Sailfish: Isoform Quantitation at the Speed of Making a Cup of Coffee

Sailfish: Isoform Quantitation from RNA-seq at the speed of making coffee

Sailfish: Isoform Quantitation from RNA-seq at the speed of making coffee

This is a really quick summary of the interesting paper “Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms, by Rob Patro, Stephen M. Mount, Carl Kingsford” that was on arXiv and @Haldanesive early this week.

Sailfish is a new computational approach that can quantify isoform level expression using RNA-seq data. What is novel about Sailfish is that unlike other commonly used approaches Sailfish does not align RNA-seq reads to either genome or transcriptome.

“Alignment-free” approach to quantify expression? It does raise a few eyebrows and questions. What does “Alignment-Free” mean and does it really work?

One can understand the “Alignment-Free” novelty, by getting to the core challenge of RNA-seq analysis. The first step of RNA-seq isoform quantification programs like TopHat/Cufflinks, RSEM, is to align RNA-seq reads to a genome or transcriptome to assign a read to one or more transcripts. The ambiguity resulting from a read mapping to multiple transcripts or genes is resolved using an Expectation Maximization algorithm.

An EM based approach assign reads to transcripts based on the alignment and use these assignments to estimate transcript abundances. As the Sailfish paper say the initial

abundances are then used to re-estimate the read assignments, weighting potential matches in proportion to the currently estimated relative abundances, and these steps are repeated until convergence.

Computationally, EM approaches rely on the alignment profile, which is a sparse binary matrix of size up to N x T, where N is the number of reads and T is the number of transcripts in the genome.  The probability of each alignment is obtained by running thousands of EM iterations, until it converges.

It is easy to see that the alignment step and the EM step are the huge bottlenecks in RNA-seq isoform analysis. The complexity in alignment and EM steps, grow exponentially as we sequence more reads. And this makes isoform quantitation time and resource consuming. For example, assuming things converge at the same rate, by doubling the read depth, the time to align doubles and the EM matrix size also doubles. In reality, with larger alignment profile, the convergence time also increases significantly.


This is where Sailfish comes to help. Sailfish kind of simplified the computational bottlenecks by simply taking the alignment process away. Sailfish starts with a known transcriptome and creates unique k-mer index from the reference transcriptome. Then it runs through each sequenced read and catalogs the kmer counts in the read. Essentially this is the “alignment-step”. Instead of doing a real alignment, it keeps a record of k-mers from RNA-seq reads if they are present in the reference k-mer index. For example, a 100 base read can create 81, 20-mers and it will increase count for corresponding 20-mer, if it is present in the reference k-mer index. At the end of the k-mer indexing process, for each unique “k-mer” in the reference transcriptome we get counts observed in the RNA-seq data.

Human reference transcriptome has over 60 Million k-mers of size 20 bases and of these close to 40 Million k-mers appear at least once. And in the data that they used there were only about 150,000 distinct k-mer equivalent classes with non-zero counts. This makes EM sparse matrix in Sailfish almost a constant size; ~150K x number of human transcripts. And the size does not grow exponentially with  RNA-seq data depth.  With the help of SQUAREM, Sailfish could make the convergence of EM even faster. The result is an isoform quantitation method that is super-fast and not memory hogging. Check the figure from the paper below that compares the alignment plus quantitation speed of RSEM, eXpress, an Cufflinks with Sailfish. On the real data, Sailfish finished in just about 10 mins, while RSEM running complete EM on all reads took close to 47 hours.

Saifish-speed-comparison

Saifish-speed-comparison (Source: Sailfish arXiv manuscript)

The biggest novelty of the paper is the “alignment-free” aspect and using efficient hash-table to create the index. Taking the alignment away has paved way for simpler EM as well. Although, the further reduction of EM complexity by collapsing millions of k-mers in to equivalent classes greatly improve computational complexity, it is not totally novel.  Isoform estimating tools, IsoEM and MMSEQ employ similar techniques to reduce the EM matrix size by collapsing reads with similar profiles as “equivalent classes” and improve speed. IsoEM uses “Computer Science-y” way to reduce the parameter space and MMSEQ uses a similar (to Sailfish) but a model based approach to collapse reads with similar profiles into equivalent classes.

Sailfish paper completely missed comparing and referencing these two papers.  This could be a genuine oversight.  However, there is no doubt that the “alignment-free” first step and the fixed “k-mer” classes could easily beat the other approaches on speed. It will be great if the Sailfish manuscript cites these two papers that employ “equivalent classes” technique to reduce the EM complexity. For sure, the authors would appreciate the previous work in RNA-seq  anlaysis domain.

Another biggest advantage of the “alignment-free” approach is that isoform quantitation can be immune to SNPs and indels, if the k-mer size is small enough. Note that, only the k-mers that do not have a match in the k-mer index will be disregarded. Another way to look at the hashing is that each read gets multiple chances to be counted in expression quantitation. So by losing a few k-mers corresponding to variants from a read, we are probably not going to lose all the information from a read.

Sailfish software written in C++ is available for free download. Check the website for more details

Would love to test and comment more on Sailfish, but it has to wait for some more time. The paper was on arXiv just this Sunday with the program ready to download and here is our quick comment on the paper within two days.  #OA rocks.

Thanks to @Caseybergman, who tweeted about the blog post by RNA-seq guru Lior Pachter on SailFish.

Comments

  1. Rob Patro says:

    Thanks for your great post on Sailfish. I think you did a very nice job of summarizing the main ideas behind the approach and the
    main results of the paper. Thank you for pointing out the references to IsoEM and MMSEQ. We will include them in
    an updated arXiv pre-print and in any eventual journal version of the paper. Please let us know if you or your readers have any questions or comments on the pre-print.

    • nextgenseek says:

      Thanks for commenting. I am really glad that you think I did a good job summarizing your work. It is a really nice piece of work and a well written manuscript. Sure. will be in touch if I need more details about the work. BTW love the next-gen-sailfish logo :)

    • So can Sailfish find novel isoform?

    • Zengmiao Wang says:

      Hello Rob Patro,
      I am a Ph.D candidate. May I ask you a few questions?

      First one, I do not completely understand the concept of “equivalent classes”, is there a case that a k-mer belongs to two different “equivalent classes”? Could you give me some specific examples?

      Second one, in the paper, you wrote the EM procedure directly, but what was the likelihood function ? or the complete likelihood function, or the hidden variable?

      Thank you very much

  2. Rob Patro says:

    Hi Marco,

    No; Sailfish is a reference-based method for isoform quantification. The “reference” can be any target set of sequences of your choice (e.g. a set of isoforms discovered via another tool on a related dataset), but Sailfish will not attempt to assemble and report novel isoforms. Though we’re interested in the problem of novel isoform detection, we haven’t really begun looking into it deeply yet.

  3. Congratulates on this great piece of work! Speaking of collapsing of reads (not k-mers) into equivalent classes, it was discussed in a more theoretical manner in this paper http://projecteuclid.org/euclid.ss/1307626566

  4. nextgenseek says:

    Thanks a lot for pointing the reference. I knew of this work, sorry I missed it earlier.

  5. nextgenseek says:

    Here is the same article on arXiv. Statistical Modeling of RNA-Seq Data http://arxiv.org/abs/1106.3211

Trackbacks

  1. […] we had quick a blogpost on Sailfish almost immediately @nextgenseek (the only useful post from @nextgenseek in a longtime :-)).  The arXiv manuscript also allowed a […]

  2. […] abstracts that look interesting. One of them is RNA-Skim, a k-mer based RNA-seq method inspired by Sailfish.  RNA-Skim abstract claims that it uses less than 4% of the k-mers and less than 10% of the CPU […]

  3. […] is a new method, inspired by the recently published method Sailfish, that does quantitation of transcript/gene expression from RNA-seq […]

Speak Your Mind

*