My last post briefly touched on Needleman-Wunsch and Smith-Waterman dynamic programming techniques, for what I thought was a good, accessible explanation of how the algorithms work under the hood (for comparison to BLAST, which built off their foundation).
This plot, from a new paper from Mark Gerstein's group at Yale really shows how far the field has moved on from these early works however -- Muir et al. (2016) The real cost of sequencing: scaling computation to keep pace with data generation.
Alignment tools have co-evolved with sequencing technology to meet the demands placed on sequence data processing. The decrease in their running time approximately follows Moore's Law (Fig. 3a). This improved performance is driven by a series of discrete algorithmic advances. In the early Sanger sequencing era, the Smith-Waterman and Needleman-Wunsch algorithms used dynamic programming to find a local or global optimal alignment. But the quadratic complexity of these approaches makes it impossible to map sequences to a large genome. Following this limitation, many algorithms with optimized data structures were developed, employing either hash-tables (for example, Fasta, BLAST (Basic Local Alignment Search Tool), BLAT (BLAST-like Alignment Tool), MAQ, and Novoalign) or suffix arrays with the Burrows-Wheeler transform (for example, STAR (Spliced Transcripts Alignment to a Reference), BWA (Burrows-Wheeler Aligner) and Bowtie).
In addition to these optimized data structures, algorithms adopted different search methods to increase efficiency. Unlike Smith-Waterman and Needleman-Wunsch, which compare and align two sequences directly, many tools (such as FASTA, BLAST, BLAT, MAQ, and STAR) adopt a two-step seed-and-extend strategy. Although this strategy cannot be guaranteed to find the optimal alignment, it significantly increases speeds by not comparing sequences base by base. BWA and Bowtie further optimize by only searching for exact matches to a seed. The inexact match and extension approach can be converted into an exact match method by enumerating all combinations of mismatches and gaps.
Their paper is open access and covers plenty of ground - from new approaches like Salmon ('lightweight' alignment, using estimated position of reads on the genome) and Kallisto ('pseudoalignment', determining theoretical compatibility of reads rather than aligning) -- as described by Rob Patro over on his blog last year -- to discussion of budgets, Big Data, and observations on Moore's and (a new one to me) Kryder's laws.
It echoes Ewan Birney's call for life science infrastructure during his visit to the U.S. National Human Genome Research Institute, where he compared the state of biology and its information services against the monument to research architecture supplying CERN. Muir and colleagues suggest a 'biomedical cloud' on which datasets, analysis programs, and so on were stored on, rather than the commercial (and regularly failing) commercial options that have become the norm.
Academic/scientific-specialised systems exist, but a meltdown just last night in a similar package manager, npm, as has been pointed out, highlights the liabilities such systems open up in the long term.
Npm blog, official response: kik, left-pad, and npm
Ars Technica: Rage-quit: Coder unpublished 17 lines of JavaScript and "broke the Internet"
The Verge: How an irate developer briefly broke JavaScript
















