Single Cell RNA-seq technology has gained a lot of traction in the last few years, as it can help address a variety of interesting biological problems, ranging from studying development to cancer heterogeneity. Since the first paper demonstrating the feasibility of single cell transcriptomics with, the number of cells one could assay is also increasing continuously. Here is a quick look at the increase in the number of cells used in single-cell RNA-seq studies in the last few years. The plot does not contain the most recent publications on new single-cell RNA-seq technologies – DropSeq and inDrops, which assayed over 44,000 cells and 5,000 cells respectively.
Just like any new technology in high-throughput genomics, there is a strong need for developing new methods for analyzing single-cell RNA-seq data. Although some of the methods available for bulk RNA-seq methods can be used for analyzing single-cell RNA-seq data, the unique nature of single-cell data brings additional challenges. And the recent papers on Single-cell RNA-seq show that slowly we are beginning to grapple the biases and the challenges in analyzing single-cell RNA-seq data.
One important aspect that has not received that much attention is the good experimental design for single-cell RNA-seq experiments. Stephanie C Hicks from Rafael A Irizarry’s group at Harvard has a really interesting preprint on bioRxiv looking at the prevalence of biases and batch effects in the published single-cell RNA-seq studies.
When one is interested in finding differences between two experimental groups (like control vs wild type or cancer vs normal), if the experiments on these two groups were done at two different time points, the differences between the two groups can be either real or due to the experimental batches.
When the experimental batch variable (in this case dates) completely correlates with the experimental groups, the study is said to be completely confounded and we can not determine the real cause of the differences observed.
The information on experimental batches is harder to get as they are not specified in published papers. Just like the Yoav Gilad’s group looking at ENCODE data on human and mouse expression data and inferring the experimental batches using information from FASTQ files, Hicks et al, inferred the sequencing batches from FASTQ files. Of the 15 single-cell RNA-seq studies, only eight of them were interested in finding the difference between two or more groups. In these studies, the confounding due to batch effect ranged from 82% confounding to 100% perfect confounding (See the table 1 from paper reproduced above).
Digging deeper, the authors found that proportion of detected genes is a major source of technical cell-to-cell noise. There was a strong correlation between the first principal component and the proportion of detected genes. In the cases, authors could check, they found that batch effect explaining the variation in the proportion of detected genes than the biological variable of interest. Basically, this suggests that the “Batch effects lead to differences in detection rates, which lead to apparent differences between biological groups”
Note that, when the confounding is small, one can use the batch effect as a covariate and adjust for it in the analysis. However, when there is complete confounding, a solution is better experimental design with biological replicates. Hick et al also offer a solution that uses biological replicates that might help reduce the effect of batch effects in some experiments. Although early on, it was difficult to plan the best experimental design with biological replicates, now with our ability to assay a large number of cells one can have a better experimental design with biological replicates.