This is a really quick summary of the interesting paper “Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms, by Rob Patro, Stephen M. Mount, Carl Kingsford” that was on arXiv and @Haldanesive early this week.
Sailfish is a new computational approach that can quantify isoform level expression using RNA-seq data. What is novel about Sailfish is that unlike other commonly used approaches Sailfish does not align RNA-seq reads to either genome or transcriptome.
“Alignment-free” approach to quantify expression? It does raise a few eyebrows and questions. What does “Alignment-Free” mean and does it really work?
One can understand the “Alignment-Free” novelty, by getting to the core challenge of RNA-seq analysis. The first step of RNA-seq isoform quantification programs like TopHat/Cufflinks, RSEM, is to align RNA-seq reads to a genome or transcriptome to assign a read to one or more transcripts. The ambiguity resulting from a read mapping to multiple transcripts or genes is resolved using an Expectation Maximization algorithm.
An EM based approach assign reads to transcripts based on the alignment and use these assignments to estimate transcript abundances. As the Sailfish paper say the initial
abundances are then used to re-estimate the read assignments, weighting potential matches in proportion to the currently estimated relative abundances, and these steps are repeated until convergence.
Computationally, EM approaches rely on the alignment profile, which is a sparse binary matrix of size up to N x T, where N is the number of reads and T is the number of transcripts in the genome. The probability of each alignment is obtained by running thousands of EM iterations, until it converges.
It is easy to see that the alignment step and the EM step are the huge bottlenecks in RNA-seq isoform analysis. The complexity in alignment and EM steps, grow exponentially as we sequence more reads. And this makes isoform quantitation time and resource consuming. For example, assuming things converge at the same rate, by doubling the read depth, the time to align doubles and the EM matrix size also doubles. In reality, with larger alignment profile, the convergence time also increases significantly.
This is where Sailfish comes to help. Sailfish kind of simplified the computational bottlenecks by simply taking the alignment process away. Sailfish starts with a known transcriptome and creates unique k-mer index from the reference transcriptome. Then it runs through each sequenced read and catalogs the kmer counts in the read. Essentially this is the “alignment-step”. Instead of doing a real alignment, it keeps a record of k-mers from RNA-seq reads if they are present in the reference k-mer index. For example, a 100 base read can create 81, 20-mers and it will increase count for corresponding 20-mer, if it is present in the reference k-mer index. At the end of the k-mer indexing process, for each unique “k-mer” in the reference transcriptome we get counts observed in the RNA-seq data.
Human reference transcriptome has over 60 Million k-mers of size 20 bases and of these close to 40 Million k-mers appear at least once. And in the data that they used there were only about 150,000 distinct k-mer equivalent classes with non-zero counts. This makes EM sparse matrix in Sailfish almost a constant size; ~150K x number of human transcripts. And the size does not grow exponentially with RNA-seq data depth. With the help of SQUAREM, Sailfish could make the convergence of EM even faster. The result is an isoform quantitation method that is super-fast and not memory hogging. Check the figure from the paper below that compares the alignment plus quantitation speed of RSEM, eXpress, an Cufflinks with Sailfish. On the real data, Sailfish finished in just about 10 mins, while RSEM running complete EM on all reads took close to 47 hours.
The biggest novelty of the paper is the “alignment-free” aspect and using efficient hash-table to create the index. Taking the alignment away has paved way for simpler EM as well. Although, the further reduction of EM complexity by collapsing millions of k-mers in to equivalent classes greatly improve computational complexity, it is not totally novel. Isoform estimating tools, IsoEM and MMSEQ employ similar techniques to reduce the EM matrix size by collapsing reads with similar profiles as “equivalent classes” and improve speed. IsoEM uses “Computer Science-y” way to reduce the parameter space and MMSEQ uses a similar (to Sailfish) but a model based approach to collapse reads with similar profiles into equivalent classes.
Sailfish paper completely missed comparing and referencing these two papers. This could be a genuine oversight. However, there is no doubt that the “alignment-free” first step and the fixed “k-mer” classes could easily beat the other approaches on speed. It will be great if the Sailfish manuscript cites these two papers that employ “equivalent classes” technique to reduce the EM complexity. For sure, the authors would appreciate the previous work in RNA-seq anlaysis domain.
Another biggest advantage of the “alignment-free” approach is that isoform quantitation can be immune to SNPs and indels, if the k-mer size is small enough. Note that, only the k-mers that do not have a match in the k-mer index will be disregarded. Another way to look at the hashing is that each read gets multiple chances to be counted in expression quantitation. So by losing a few k-mers corresponding to variants from a read, we are probably not going to lose all the information from a read.
Sailfish software written in C++ is available for free download. Check the website for more details
Would love to test and comment more on Sailfish, but it has to wait for some more time. The paper was on arXiv just this Sunday with the program ready to download and here is our quick comment on the paper within two days. #OA rocks.