Alignment and calling variants are probably the two most common tasks in Next Gen Sequencing analysis. Both are challenging as we typically work with short sequence reads and complex genomes like human. Thats why there are over 60 short read aligners and numerous pipelines for calling variants like SNPs and indels. Although there are many tools, a good evaluation of multiple tools for variant calling is still missing.
A team led by Gholson Lyon from CSHL systematically evaluate the commonly used bioinformatics pipelines for calling variants from exome and whole genome sequence data and ask really simple question “How well do the results from multiple pipelines agree?” And they published their results in a paper titled Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Here is a quick summary of the paper.
The team sequenced 15 exomes from four families using Illumina HiSeq 200 with about ~120 X coverage and tested five commonly used alignment and variant calling pipelines;
Among these five pipelines three (GATK, SOAP, and Samtools) can also call indels. In addition to the 15 exomes, the team had also sequenced the whole genome of an individual using Complete Genomics. They also validated a set SNPs and indels by amplicon sequencing in MiSeq.
The gist of the findings is that there is very poor concordance of variants called from these five pipelines. Across all the samples there was only about 57% of the SNPs called were concordant in the five pipelines used. Each pipeline came up with its own unique SNPs with varying degree; 0.5% to 5.1%. As one expects, SNP calling fares better than indel calling. Only about 26% of the indels called agreed with each other in the three indel pipelines.
In addition, a comparison of variants from a Illumina exome sequence sample to exomic regions from whole genome sequencing from Complete Genomics (CG) show that a large number SNPs inferred from CG data is missing in the Illumina data and the SNPs that are present in the Illumina sample have low sequence coverage.
So which is the best “variant calling pipeline”? This a loaded question, under the “default” conditions for Illumina sequencing, 97% of GATK-only SNPs and 54% of GATK-only indels could be validated using MiSeq sequencing. In comparison, only about 60% of SOAP-only SNPs and 44% of SOAP-only indels could be validated. This suggests that GATK is superior.
This paper nicely highlights the problems with the current state of “calling variants”, which are both at the technology and pipelines that are available to call variants. Stressing the need for change, the authors came with multiple recommendations to improve variant calling. They include
- discussion on variant quality and the case for multiple methods for analyzing personal genomes
- standardization of indel callings
- need for studying large families for discovering variations
- need for multiple platforms to comprehensively look at exomes
They are all valid recommendations that can improve the current state of variant discovery.
Here are a few thoughts on this paper and reproducibility of variant calling methods in general.
The authors have presented “pooled” results from comparing 15 across the five different pipelines. This basically gives an average picture of “pipeline-specific” and “pipeline-shared” variants. Although it helped to see how discordant the results from multiple pipelines are, it will be interesting to see how each exome data fared under the five pipelines (May be it is in the Supp. Section and we missed it). The detailed supplementary section provides data on exome capture for each sample and it shows that there is lot variation between samples. For example, total number of reads mapped varies from 61M to 94M; mean depth of the target region varies from 100 to 150.
Comparing samples independently across multiple pipelines might give hint on how these pipeline specific and shared variants vary from sample to sample. And looking at the outliers might help understanding the strength and weakness of a pipeline.
On a more general note, the trouble with the “default” in a pipeline is that, it is tuned for one kind of analysis. And often, there is no way one can tune all the parameters to infer the best parameters. In addition to the suggestions in this paper, another work around is that variant calling methods (and the way we use) need to get more quantitative/statistical. This will not solve all the problems, but will definitely help.
Reproducibility is a common problem in next-gen sequencing analysis. For example, just like multiple variant calling algorithms, there are multiple peak-calling algorithms for ChIP-Seq analysis. And ChIP-Seq also faces poor reproducibility across multiple pipelines for the same sample. Recently, Peter Bickel’s group at Berkeley developed Irreproducibility Discovery Rate (IDR) for measuring reproducibility of high-throughput experiments (manuscript at arXiv). IDR is similar to FDR, but gives us a handle on reproducibility. ENCODE project successfully adapted IDR for its ChIP-Seq analysis.
It may not be trivial, but should be possible to extend or develop “IDR-like” approaches for assessing reproducibility in variant calling. It is possibly more challenging in the variant calling setting as the SNP calling and Indel calling are often inter-related. Such quantitative masure of reproducibility will readily allow us to compare multiple variant discovery procedures/platforms and a handle on the reliability of the results.