If one aspect of Next-Gen sequencing experiments is missing “big” attention, one can immediately point to “good design principles” for Next-Gen Sequencing experiments. Although, there is a lot of computational/statistical methods are coming up for analyzing Next-Gen Sequencing data, it looks like there are not that many resources for designing NGS experiments. At least it is not openly talked about.
Granted the technology itself is evolving and constantly changing and the design might be different for different technologies, but the conversation around the NGS experimental design is highly important to not just the ones who conduct the experiments, but also for the ones who analyze the data.
Just a week ago the CoreGenomics blog published a great post on “How to do better NGS experiments?“. CoreGenomics’ posts touched and highlighted importance of the major factors, Design, Replication, and Mulltiplexing, that one needs to consider for doing a good NGS experiment.
We just wanted to highlight one really important factor that is not “explicitly” mentioned in the post. That is the importance of Randomization in NGS experiments. Randomization in statistics is a age old important concept. CoreGenomics talks about randomization indirectly, we think it deserves equal attention and give a simple example for Next Gen Sequencing experiment, more specifically experimental design for RNA-seq.
Let us say you are interested in finding differences between two samples, a control and experimental group. It is a well designed experiment and has samples from six biological replicates in each of the two groups.
Let us say you are sequencing these samples using two lanes in Illumina’s HiSeq/GAIIx machines such that each lane is multiplexed to sequence six samples. One can think of at least two ways to sequence the samples.
A Bad Next Gen Sequencing Experimental Design
A naive design is to put all six samples from the same group in a single lane (or do sequencing in one day). For example, multiplex all six control samples and sequence them in one lane and multiplex all six experimental group samples and sequence in the other lane. This is a bad design despite the fact that there are six biological replicates and it is multiplexed.
The reason why it is bad design is simply the same as the good old saying, “Don’t put all your eggs in one basket”. By sequencing every “control” samples in the same lane (same day), if anything goes wrong with that lane (or the day) the whole of control data is useless. In statistical parlance, the lane effect, a systematic bias (or a batch effect), is confounding with the two experimental groups.
For example, let us say we are doing RNA-Seq experiment and interested in differential gene expression between control and experimental group samples. If the lane with “control samples” turns rogue and result in fewer reads, one can easily misconstrue that there is a strong differential expression between control and experimental group, where as the difference is mainly due to the “rogue lane”. And we can not resolve the lane effect vs differential expression.
A Good Next Gen Sequencing Experimental Design
A better approach for the above example is “not to” sequence all samples from an experimental group in a single lane, but make sure each lane contains samples from both the control and experimental groups.
How do we do that? How to decide which sample should go to a lane?
That is where randomization comes in. One good NGS design is to randomly pick three samples from control and experimental groups and sequence them in a lane. And sequencing the remaining six samples in the second lane.
In this design even if one lane goes rogue that affects both the control and experimental group samples equally and we still have one more “well behaved” lane with both the groups. In statistical parlance, now the lane effect and the experimental group effect is no longer confounded.
This is a really simple example that focuses only on lane effect and how one can deal with that. In reality, since we are multiplexing six samples in a lane with bar codes. There could be bar code effect on sequencing. We have not addressed the bar code effect at all here.
And yes, there is at least one better design as well :)
A good paper that addressed various aspects of good practices of Next-Gen sequencing experimental design (with the focus on RNA-Seq Experimental design) is
Another paper that is worth reading is
Fundamentals of experimental design for cDNA microarrays, by Gary Churchill published in Nature Genetics.
Although this paper is on microarray technologies, not Next-Gen Sequencing technologies, the basic principles of experimental design is the same and is useful for NGS experiments as well.
[Update] Two years after writing the post, there was a lot interesting conversation on batch effect in RNA-seq gene expression analysis due to the re-analysis of a ENCODE paper looking at the differences in gene expression between human and mouse. See here to learn more on how batch effect may affect
- Rafael Irizarry’s beautiful post at SimplyStatistics: Is it species or is it batch? They are confounded, so we can’t know
- F1000 Research publication: A reanalysis of mouse ENCODE comparative gene expression data