Just a few weeks ago Nature Methods published an interesting paper on Bayesian approach that takes care of technical variations in Single Cell RNA-seq data and perform differential expression analysis . This is one of many interesting papers that came out recently addressing technical variations in single-cell RNA-seq data.
- Bayesian approach to single-cell differential expression analysis by Peter V Kharchenko, Lev Silberstein & David T Scadden, Nature Methods, (2014) doi:10.1038/nmeth.2967
Here are some thoughts on challenges in analyzing single-cell RNA-seq data and (half-baked) short summary/statistical model used in the paper.
Variability in Single-Cell RNA-seq
So what is the problem with single-cell transcriptomics? In a nut shell, gene expression is stochastic and there is a lot of cell-to-cell expression variability for real. For example, transcriptional bursting, where genes are transcribed in a sudden burst is a big biological component of the variability.
In addition to the real biological variability, single cell transcriptomics can have a huge technical variability. Single-Cell RNA-seq protocols takes mRNA sample from a single-cell, which is tiny in amount and amplifies over million fold so that they can be sequenced. The low sample quantity plus amplification steps can cause genes to be “missed” from sequencing.
And we are just beginning to understand the cause and effect of the technical variability in single-cell RNA-seq. And thats why we are seeing a flood of papers on it. The combination of biological and technical variabilities increase the noise in the data.
Statistical Model for Standard RNA-seq
Early statistical models for standard/pooled RNA-Seq count data used Poisson distribution for differential expression analysis. The rationale behind using Poission distribution was that if the reads are sampled independently from a population then it will follow multinomial distribution and it can be approximated to Poisson distribution.
Under Poisson model, mean and variance are be the same. What this means in RNA-seq is, if a gene has mean gene expression n, then if we look at the same gene’s expression in multiple samples (in theory tech. reps.) we expect the variance is also n under Poisson model.
However, this assumption is almost never true in practice and one finds variance is much larger than mean. This extra-variation not explained by Poisson model is called over-dispersion.
A simple Negative Binomial distribution with two parameters accounts for the over dispersion and it is the go-to model for total gene expression analysis using RNA-seq. And the whole business of differential expression analysis methods like edgeR and DESeq lies in how can we better model the variance using the data.
Statistical Model for Single-Cell RNA-seq
In case of Single-Cell there is more twist. If we compare gene expression from two single-cell RNA-seq samples, the overall correlation will be small and there will be a lot of genes that are expressed in one cell but not the other as shown in the figure. In addition to the over-dispersion and big outliers, one finds a lot of “drop out” genes that is missing in one cell.
By looking at pairwise comparison of all single cell RNA-seq samples, the authors found two common threads about these “drop out” events.
- some cells had higher dropout rate than the other
- dropouts are more frequent with low expressed genes.
This variation from dropout is in addition to the standard over-dispersion that is seen like pooled RNA-seq data.
To account for the extra variation due to dropouts, the paper models each gene’s expression as a mixture of two probabilistic process;
- Negative-binomial distribution for the genes that gets amplified and gets sequenced
- Poisson distribution for the genes that gets dropped out.
And the contribution of each distribution is determined by the amplitude of gene expression, as they found low expressed genes gets dropped out more frequently.
Another suitable model one can think of using under such scenario is “zero inflated negative-binomial-model”. However, it makes the assumption that the dropout events result in “zero counts”. That is not the case in the single-cell RNA-seq data that the authors used.
Skipping a whole lot of the paper, the Bayesian approach uses the Single-Cell RNA-seq data itself to learn about the parameters describing these distributions. The authors used the comparison all pairs of single-cell RNA-seq data with in a group and learned about expected expression of genes.
This paper used data single-cell RNA-seq data from two papers published recently. One is the data from Stephen Quake’s group’s study of 92-cell set consisting of mouse embryonic fibroblast (MEF) and embryonic stem (ES) cells and the other is the data from multiple developmental stages of early mouse embryos of B6xCAST F1 hybrids.