This is a quick post: summary/thoughts* on the new paper that got accepted at the upcoming ISMB 2014 and published in the special issue of Bioinformatics.
- RNA-Skim: a rapid method for RNA-Seq quantification at transcript level Zhaojun Zhang and Wei Wang
is from a duo at UNC and UCLA. (*thoughts are always half baked :))
RNA-Skim paper is a great example of one of the benefits of publishing early in a open preprint server. Sailfish was first published in the open preprint server arXiv on 16th August 2013 and it was published in Nature Biotechnology on 20th April. While the RNA-Skim paper must have been submitted before 10th january (ISMB 2014 paper submission deadline) and conditionally accepted in late March.
It looks like, the early publication of Sailfish has inspired to come up with RNA-Skim. For sure, the early Sailfish availability enabled comparison with Sailfish and present a method that is faster than Sailfish quantitation and get it accepted in a journal in the same month as the original paper. It would have been great to see RNA-Skim in a preprint server too :) [Not sure about ISMB’s open preprint policy here]
What is new in RNA-Skim?
Like Sailfish, RNA-skim uses k-mer based approach, instead of the standard “Alignment” that we know of. If you remember Sailfish, Sailfish creates k-mer index of a transcriptome, where the transcriptome is the set of all known transcripts/isoforms and the k-mer index is the mapping of all unique k-mers in the transcriptome to the transcripts it belongs to.
The “alignment step” in Sailfish is performing the lookup and keeping the record for every k-mer in all reads. This “alignment” step is essentially a k-mer counting step, where one counts occurrence of each reference k-mer in the k-mers from the reads. And one can use fast k-mer counting algorithms to do this task efficiently.
The “quantitation step” in Sailfish is to use EM algorithm cleverly to probabilistically resolve all the multi-mapping k-mers, the k-mers that perfectly maps to multiple transcripts, to get expression estimates from k-mer x transcript “alignment” profile.
RNA-Skim takes a similar approach with a twist. RNA-Skim does not use all the k-mers for quantitation. Instead, it uses only the k-mers, which are kind of special and more informative. RNA-skim calls these special k-mers as “sig-mers“. The use of informative “sig-mers” simplifies the EM algorithm and enables a faster quantitation method than Sailfish. The paper claims that there is so much redundancy in using all k-mers and thus reducing to only “sig-mers” does not result in loss of accuracy.
So. what are these “sig-mers” aka “special” k-mers ?
RNA-Skim instead of working with all the transcripts as a whole, it clusters/partitions the transcripts based on sequence similarity (k-mer based similarity measure). For each cluster, it finds all k-mers and uses a subset of the k-mers that are unique to the cluster. These unique k-mers aka “sig-mers” are present only in a single cluster (not present in any other cluster). In the toy example from the paper shown above, there are two clusters in the transcriptome and cluster 1 has three sig-mers and cluster 2 has two sig-mers. The biggest advantage of working with only “sig-mers” from each cluster is EM part gets much simpler. One can run EM separately on sig-mers from each cluster and also the number of k-mers that EM deals with is really small.
Speed: RNA-Skim vs Sailfish
Using 44 million PE reads from a real RNA-seq data, RNA-Skim authors showed that RNA-skim is faster than Sailfish, and also Tophat + cufflinks, Bowtie + RSEM, Bowtie+eXpress. RNA-Skim is not multithreaded yet. So the comparison below is the total CPU time under multiple methods.
Why “sig-mers” may be good enough for RNA-seq quantitation?
Although it is clear that using “sig-mers” seem to give a huge computational advantage, it is not that clear why it is also equally good in terms of accuracy. Here is a half baked rationale (from me) why sig-mers are informative and may be good enough for quantitation.
The underlying problem that sig-mers addresses is the challenge of dealing with multi-mappability. When a read aligns to multiple transcripts, finding the true origin is difficult. EM based approaches uses all the reads information and estimates probability for the read’s alignments.
The multi-mappability problem gets worse in the case of k-mers, when compared to whole read alignment. Since a k-mer is just 20 or 30 base long and the transcriptome is complex with many homologous transcripts, a lot of k-mers may have a large number of multi-mappings. One dumb solution is to ignore multi-mapping k-mer completely. As it is easy to see, this does not make sense and will result in wrong expression estimates. On the other hand, Sailfish uses all k-mer multi-mappings as such with some clever EM tricks to simplify the EM computations.
Among the multi-mapping k-mers some are informative and some are not. For example, a k-mer mapping to large number of transcripts may be less informative than a kmer mapping to a select set of transcripts or a single transcript.
RNA-Skim’s sig-mer approach seems to take the middle ground approach and tries to find & use the informative k-mers. It uses multi-mapping k-mers, but with the restriction that the multi-mapping k-mers should be from transcripts with some sequence similarity. By clustering transcripts and using only the unique k-mers in the cluster, RNA-Skim allows multi-mapping k-mers only within the cluster. Provided the clustering is right, these sig-mers are informative and can be as accurate as Sailfish. Yes, if the clustering is not right, one may not have enough k-mer coverage per transcript and have incorrect estimates.
It can also result in huge computational benefits
- smaller number of k-mers to work and they are the most informative ones (either they are unique k-mers or have only fewer multi-mappings (and only to similar transcripts (likely evolutionarily related too.))
- one big EM algorithm is now split into multiple, but smaller EM computations.