One of the most common applications of single-cell RNA-seq data is to use it to characterize the heterogeneity of cell populations and identify new cell types or sub-populations. Typically one uses single-cell RNA-sequencing (RNA-seq) data, quantify gene expression on all cells and use the common gene-expression profile among the cells to characterize the cell-to-cell heterogeneity by using clustering/dimensionality reduction techniques. Although single cell RNA-seq experiments typically deal with smaller read-depth, as the single-cell technologies are getting better, we are increasingly profiling a large number of single-cells.
A really neat method paper for analyzing single-cell RNA-seq data came out at bioRxiv last week from David Tse’s group and Lior Pachter.
Fast and accurate single-cell RNA-Seq analysis by clustering of transcript-compatibility counts, by Vasilis Ntranos, Govinda M. Kamath, Jesse Zhang, Lior Pachter, David N. Tse
And this paper uses Kallisto and the underlying data cleverly to make the single-cell RNA-seq analysis faster.
RNA-seq quantitation methods that do not throw-away multi-mapping reads, like RSEM, Sailfish/Salmon, and Kallisto, typically utilize EM algorithm to probabilistically assign the weight of multi-mapping reads. As the number of reads in RNA-seq data increases, the computational need for an EM algorithm increases. To overcome the computational burden, clever RNA-seq quantitation approaches uses the fact that multiple reads have the same alignment pattern, i.e. they all align to the same set of transcripts. An equivalent class, the number of reads with the same alignment pattern, is constructed from the alignment profile.
Collapsing read alignment profile to equivalent classes offer a huge computational advantage in running EM algorithm. Instead of working with EM on alignment profile matrix of size N x T (Number of reads x Number of transcripts), now the equivalent class matrix (Number of equivalent class x Number of transcripts) will be much smaller and the matrix also does not grow with read depth.
Instead of using the equivalent class counts (Transcript Compatibility Counts as this paper refers) to quantify expression abundances, this paper uses it to further characterize single-cell data with existing clustering approaches. Mainly, this paper showed how equivalent class counts can be used to perform clustering to identify new types instead of expression abundances. They used a modified version of Kallisto to export the equivalent class counts.
The modified kallisto is available here and the equivalent class counts can be obtaining by using the command “kallisto pseudoalign”. Note that the basic idea can used with equivalent class counts from other approaches like Sailfish, and Salmon. The paper re-analyzed two published single-cell RNAseq datasets (The 271 primary human myoblasts by Trapnell et al. and The 3005 mouse brain cells by Zeisel et al.) and showed that this method works as well as the previous approaches with a huge computational gain. The code used for all analyses in this paper can obtained from the github page.