Nature Review Genetics has just published a nice review on how replicates (biological) in sequencing experiments are useful in finding and correcting errors.
- The role of replicates for error mitigation in next-generation sequencing by Kimberly Robasky, Nathan E. Lewis & George M. Church
More specifically, the paper delves into sources of errors in sequencing, argues how multiple types of replicates can be useful to correct stochastic sequencing errors, and shows how biological replicates can be used to measure the specificity and the sensitivity of sequence variant-calling methods.
Although using replicates is time-tested principle, Next-gen sequencing has been bit slow to embrace due to the cost and the fact that sequencing at higher depth offers one type of replicates. The scope of “read depth” is limited and the other types of replicates, such as technical replicates, biological replicates, and cross-platform replicates can be extremely useful.
Thanks to our ability to multiplex samples, a good experimental design can use technical replicates; sequencing the same samples multiple times. And a lot of studies typically use technical replicates and pool the data together for further analysis.
A number of studies have combined multiple sequencing technologies to improve variant callings. Remember, correcting PacBio long read errors using Illumina short reads? That is a great example of cross-platform replicates.
Biological replicates, where one prepares and sequences multiple biological samples from the same host under same condition, is the bread and butter for differential expression analysis in a tissue. This review focuses on the use of biological replicates from different tissues in correcting errors in variant calling.
It looks at three whole genome sequence data from Complete Genomics data from one of the personal genome project participant (It is easy to guess who that individual is. yes, PGP1). The authors classify a SNP as concordant or discordant depending on whether all replicates agree or not. Using multiple scoring schemes, such as read depth, gene expression score, and genomic quality score, for each SNP, the authors analysed the proportion of True Positive and True Negatives using ROC-like curve. Interestingly using read depth as the quality score to call variants behaved poorly, when compared to genomic quality score and expression score.
You may already know this, but not sequencing errors can be solved by using replicates. We definitely need to look else where for help in correcting errors from incomplete reference genome, insertions, deletions, gene families, repeats, and batch effects :)
Although the paper kind of hints on good experimental design (deep buried in supplements), it does not go further. A good balanced experimental design matter a lot, especially while dealing with replicates in next-gen sequencing experiment. We wrote a post on the “use of randomization in next-gen sequencing experimental design” a while back and it may be of interest here.
The paper nicely catalogues the experimental sources of errors and the publications that (possibly) found the errors first. Although, most NGS are aware of the possible sequencing errors, it is nice to see all the documented errors at one place with attributions. Just as one can image, the list below shows that errors can creep in at every step of going from preparing samples to handling fastq files.
Just remember the NGS Murphy’s Law
Anything that can go wrong in sequencing, has gone wrong already
But, there is hope :)
Here are the three broad categories of sequencing errors and how things have gone wrong in each of them.
Sources of Sequencing Errors in Sample preparation
- User errors; for example, mislabelling
- Degradation of DNA and/or RNA from preservation methods; for example, tissue autolysis, nucleic acid degradation and crosslinking during the preparation of formalin-fixed, paraffin-embedded (FFPE) tissues
- Identification of high-confidence somatic mutations in whole genome sequence of formalin-fixed breast cancer specimens. Nucl. Aci. Research, 2012
- Effect of duration of fixation on quantitative reverse transcription polymerase chain reaction analyses
- Targeted high throughput sequencing in clinical cancer settings: formaldehyde fixed-paraffin embedded (FFPE) tumor tissues, input amount and tumor heterogeneity
- Alien sequence contamination; for example, those of mycoplasma and xenograft hosts
- Low DNA input
Sources of Sequencing Errors from Library Preparation
- User errors; for example, carry-over of DNA from one sample to the next and contamination from previous reactions
- PCR amplification errors
- Primer biases; for example, binding bias, methylation bias, biases that result from mispriming, nonspecific binding and the formation of primer dimers, hairpins and interfering pairs, and biases that are introduced by having a melting temperature that is too high or too low
- 3′ end capture bias that is introduced during poly(A) enrichment in high-throughput RNA sequencing
- Private mutations; for example, those introduced by repeat regions and mispriming over private variation
- Machine failure; for example, incorrect PCR cycling temperatures
- Chimeric reads
- Barcode and/or adaptor errors; for example, adaptor contamination, lack of barcode diversity and incompatible barcodes
Sources of Sequencing Errors from Sequencing and imaging
- User errors; for example, cluster crosstalk caused by overloading the flow cell
- Dephasing; for example, incomplete extension and addition of multiple nucleotides instead of a single nucleotide
- ‘Dead’ fluorophores, damaged nucleotides and overlapping signals
- Sequence context; for example, GC richness, homologous and low-complexity regions, and homopolymers
- Machine failure; for example, failure of laser, hard drive, software and fluidics
- Strand biases