Nature Genetics just published a really interesting paper from Gerton Lunter Group (and McVean group) from University of Oxford. It is about a variant calling approach that integrates ideas from mapping, assembly, and haplotypes. I can hear you saying “Yeah. right. Yet another variant caller?”. Believe me, this one looks different. The paper looks really interesting and want to read the whole paper.
Here is a really short summary of what the paper is about.
The most common approach, like old version of GATK [Check the comments below from GATK on the benefits of new GATK Haplotype caller that also does local re-assembly like ], is to call variants by aligning reads to a reference genome and find locations where nucleotides differ from the reference base. This approach has served us well as it has high sensitivity; uses most of the human genome, includes repetitive regions, exploits information in paired-end reads; and does not need crazy computing resources.
One of the weaknesses of the approach is that alignment based approaches focus on a single variant type, like SNP or indel. This can cause errors around indels and larger variants. It is also prone to high false positives from highly diverged regions. Also they rely mainly on alignment accuracy at nucleotide level and realignments around indels to improve the accuracy can be costly. The use of multi-sample variant calling helps borrow information between samples to call variants that does not look reliable in a single sample.
Alternative variant calling approaches that uses reference-free sequence assembly builds a de Bruijn for finding evidence of polymorphisms. Such approaches works on the local haplotype level rather than on the level of individual variants and does well on highly divergent regions. However, these approaches have huge computational requirements. Also, they have
lower sensitivity than alignment based approaches and are limited by repetitive sequence, as contiguity information is lost when the reads are broken up into their consecutive k-mers during graph construction.
The Nature Genetics paper presents a new approach that integrates local sequence assembly, haplotype-based, multi-sample variant caller with in a single Bayesian statistical framework and it is implemented as software Platypus. Platypus takes in mapped and sorted BAM files as input and calls candidate variants from read alignments, local assembly and external sources. Platypus can identify SNPs, MNPs and short indels of size less than read length, and larger indels of size up to several kb deletions and maybe 200bp insertions.
First Platypus generates candidate variants using the read alignments, variants identified by local assembly and variants from external sources. The local assembler looks at small window of region (~few kb) at a time and uses all the reads in the window and their pairs to generate a colored de Bruijn graph. Candidate alleles are generated by getting all unique paths in the graph by a depth-first traversal algorithm. Platypus is tuned for high sensitivity and returns a exhaustive list of paths unlike other assemblers. Candidate haplotypes are generated by clustering the candidate alleles across windows. Haplotype frequencies are estimated by EM magic. Variants are called using the estimated haplotype frequencies. A lot of interesting details on how it works is hiden in the supplementary methods section.
The paper goes on to show how this approach is useful in four different scenarios of variant calling applications.
- calling variation from whole-genome data
- calling SNPs and indels from whole-exome data
- de novo mutations in parent-offspring trios
- genotyping HLA loci
The paper also shows that integrating the approaches yields high sensitivity and specificity in several clinically relevant experimental designs and it is also an order of magnitude faster.