kallisto paper summary: Near-optimal RNA-seq quantification

In case you missed it, kallisto, the new RNA-seq quantification method from Lior Pachter‘s group, was one of the biggest highlights from this year’s The Biology of the Genomes conference. kallisto team made the kallisto software immediately available and the preprint describing kallisto was available at arXiv pretty soon after the #BOG15.

Here is a detailed summary of some interesting aspects of kallisto paper.

What is kallisto?

kallisto is a new approach to quantify isoform level expression abundances from RNA-seq data and it is really fast. kallisto can take 30 Million RNA-seq reads as input and estimate expression abundance in five minutes with 20 cores (in parallel). It is fair to say that the origin of kallisto is inspired by Sailfish, a k-mer based fast quantitation method that came out in 2013. Sailfish takes 54 mintues using the same data and resources. The manuscript also show that, kallisto is not just fast, but also more accurate (Fig 2a). And that’s why the title “near optimal”, in speed and accuracy.

Pseudo-alignment in kallisto

One of the first tasks of any sequencing analysis is to find where all each of the reads come from in the genome? In case of RNA-seq: which gene or isoform? Standard alignment procedures try to find the read origin by aligning reads to the genome or transcriptome. RNA-seq alignments were slow, mainly because the aligner needed to be splice-aware. Thanks to recent developments, like STAR and HISAT, the alignment process is getting much faster and memory efficient, but still slower.

In a typical RNA-seq analysis estimating expression abundances, we are mainly interested in finding out the number of reads originated from each gene/transcript. In other words, it is a simply a counting exercise (with multi-read complexity) and do not need base level resolution for quantifying expression. Until recently, we had to rely on full-alignment approaches.

A major force behind kallisto being super-fast is its pseudo-alignment process. kallisto’s pseudo-alignment approach finds potential transcript origins of RNA-seq reads really fast using k-mers in the read and the transcriptome De Bruijn graph built from the transcritome, instead of doing base level read alignment.

One of the main distinguishing features of kallisto from Sailfish is that, Sailfish splits the reads into k-mers and lookup the kmers from the read in to k-mers from the transcriptome. The k-mer based approach throws away the information in the reads, thereby leading to loss of useful information affecting the accuracy of expression abundance. It is also need to be mentioned that Sailfish’s next iteration Salmon also does not shred the kmers and throw away information present in k-mers of a read. Remember having a chat with Rob Patro on twitter about the keeping “read-info” in Sailfish and came to know a bit about Sailfish’s next version Salmon. Salmon is around for about a year and uses different algorithms to do pseudo-alignment or lightweight alignment in Salmon-speak.

How does Pseudo-alignment work in kallisto?

The transcriptome De Bruijn graph (T-DBG) is constructed from k-mers in the transcriptome. Basically each k-mer in the transcriptome is a node in the T-DBG and if the k-mer occurs in 3 transcripts, it will have three colors.

Naively, the Pseudo-alignment process takes a read and looks up the k-mers in the read in the T-DBG.  For each k-mer, we can get the potential transcripts it could have come from the k-mer colors in the T-DBG. And the set of all potential transcripts a read could have come from (or aligned to ) can be obtained by simply intersecting the transcript list for each k-mer in the read. I think, if we union instead of intersection, we might get an implementation that is sort of like Sailfish at read level. Need to beef up my knowledge on T-DBG and the preprint also seems to lack a bit on algorithmic details on T-DBG and hope there is more in the final paper.

Pseudo-alignment by k-mers

Pseudo-alignment by k-mers

kallisto has implemented a really cool trick that speeds up this whole pseduo-alignment process. In the T-DBG the neighboring k-mers often belong to the same transcripts. kallisto T-DBG constructs a contig (or set of contigs) from the continuous k-mers that belong to the same set of transcripts.

kallisto uses the contig information when the first k-mer from a read is looked up. Basically, kallisto finds the distances to the end of contigs when the first k-mer of a read is pseudo-aligned, then it skips to the last k-mer in the read depending on the distances. Since all the k-mers in a contig gives the same set of transcripts and remaining kmers in a short read will also get the same set of transcripts and therefore they are redundant.  Therefore, kallisto does not need to do the pseudo-alignment for every k-mer in a read and skips the majority of k-mers. Actually, for over 60% of the reads, kallisto uses only two k-mers in pseudo-alignment process. Yes, kallisto literally throws away 38% of ~60% of 100 bp reads :-), because they do not give us anything new (default k-mer size of kallisto is 31).

Another trick that kallisto uses that greatly helps in speeding up is their use of simple model for equivalent classes read. What is an equivalent class? When we align millions of reads to transcripts, often many reads have same alignment or pseudo-alignment profiles, .i.e. these reads align to the same set of transcripts.  An equivalent class is basically the set of reads that aligns to the same transcripts.

These tricks enable kallisto to work with compatibility matrix of equivalent classes instead of read level compatibility matrix as illustrated in the figure shown (1’s corresponding to the pseudo-alignment of read or class to a transcript). Instead of using these reads separately in the EM algorithm, kallisto uses the equivalent classes. This greatly reduces complexities in EM algorithm.

Compatibility Matrix and Equivalent Class in kallisto

Compatibility Matrix and Equivalent Class in kallisto

The use of equivalent class is a huge advantage as the size of matrix that EM algorithm deals with is much smaller and therefore the running time of EM algorithm is much smaller. In addition, the number of equivalent classes also do not grow linearly as the number of reads in a sample grows. The concept of equivalent class in doing RNA-seq analysis is not new, the programs isoEM, MMSEQ, and Sailfish uses it in their implementation. However, finding which reads belong to each equivalent class has been a bit complex and time-consuming until k-mer based approaches like kallisto and Sailfish came along.  The EM algorithm in kallisto also seems to be much simpler. kallisto’s EM algorithm estimates the probabilities of each nonzero element in all equivalent classes.

Bootstrapping in kallisto

kallisto’s speed also enables bootstrap implementation to estimate the confidence interval for each transcript’s expression abundance. Stepping back a little bit, the EM algorithm results in maximum-likelihood estimate of expression abundances for transcripts. It is only a point estimate of expression abundance given the data. what this means is for some genes the estimated expression estimates may not be accurate.

In order to find how reliable each expression estimate, we can compute standard error/confidence-intervals for each estimates. One way to estimate the confidence intervals is to use bootstrapping. Bootstrapping is a pretty cool statistical technique developed by Brad Efron in the 70s. The way it works is, one samples the observed data with replacement and estimate the parameter of interest a bunch of times. And uses the bootstrapped estimate to compute standard error.

Bootstrapping Algorithm by Brad Efron

Bootstrapping Algorithm from Brad Efron’s book

In case of RNA-seq, naively, we could sample pseduo-alignment profiles with replacement and re-estimate expression abundances multiple times. And standard error/variance from bootstrapped abundance estimates will tell us how reliable our expression estimates from observed pseudo-alignment. For example, high variance of a gene’s expression informs us that our abundance estimate is not reliable. This could typically occur for genes with lower expression abundance and genes with large contributions from multi-mapping reads. For any post-hoc analysis it is a very useful information.

kallisto samples counts for each equivalent class using the multinomial probability over the observed class counts and bootstraps over multinomial distribution on the equivalent classes sizes many times. For example, for the toy example used above, the bootstrapped equivalent class sizes are

> ec_prob = c(2/6,1/6,1/6,2/6)
> number_of_bootstraps = 10
> number_of_reads = 6
> rmultinom(n=number_of_bootstraps, size=number_of_reads, prob=ec_prob)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 4 2 2 2 2 0 1 2 4
[2,] 0 1 0 2 1 0 0 0 1 0
[3,] 1 0 0 1 1 0 4 3 1 1
[4,] 3 1 4 1 2 4 2 2 2 1

Then kallisto uses EM algorithm on the bootstrapped count. Since the EM algorithm over the equivalent class is much simpler, it can quickly get bootstrapped estimates by running EM with new equivalent class counts and compatibitlity matrix and get estimate of variance for expression abundance estimates.

Bootstrapped confidence intervals are really nice. In the most simplest case, it lets your to flag some gene’s expression.  However, I think we need better methods that can make use of this in downstream analysis. Cufflinks and RSEM have confidence intervals on the expression estimates by using different approaches. And I also very vaguely remember, the confidence interval feature on Cufflinks were updated the least, mainly because not many used it.

You might also be interested in these blogs posts on kallisto/pseudo-alignment/light-weight alignment.

Comments

  1. “30 Million RNA-seq reads”? Really fast. Very detailed information about kallisto.

Speak Your Mind

*