Platypus: A New Variant Caller that Integrates Mapping, Assembly and Haplotype-based approaches

Nature Genetics just published a really interesting paper from Gerton Lunter Group  (and McVean group) from University of Oxford. It is about a variant calling approach that integrates ideas from mapping, assembly, and haplotypes. I can hear you saying “Yeah. right. Yet another variant caller?”. Believe me, this one looks different. The paper looks really interesting and want to read the whole paper.

Here is a really short summary of what the paper is about.

The most common approach, like old version of GATK [Check the comments below from GATK on the benefits of new GATK Haplotype caller that also does local re-assembly like ], is to call variants by aligning reads to a reference genome and find locations where nucleotides differ from the reference base. This approach has served us well as it has high sensitivity; uses most of the human genome, includes repetitive regions, exploits information in paired-end reads; and does not need crazy computing resources.


One of the weaknesses of the approach is that alignment based approaches focus on a single variant type, like SNP or indel. This can cause errors around indels and larger variants. It is also prone to high false positives from highly diverged regions. Also they rely mainly on alignment accuracy at nucleotide level and realignments around indels to improve the accuracy can be costly.  The use of multi-sample variant calling helps borrow information between samples to call variants that does not look reliable in a single sample.

Alternative variant calling approaches that uses reference-free sequence assembly builds a de Bruijn for finding evidence of polymorphisms. Such approaches works on the local haplotype level rather than on the level of individual variants and does well on highly divergent regions. However, these approaches have huge computational requirements. Also, they have

lower sensitivity than alignment based approaches and are limited by repetitive sequence, as contiguity information is lost when the reads are broken up into their consecutive k-mers during graph construction.

Playtpus: A New Multi-sample Variant Caller that Integrates Mapping, Assembly and Haplotype-based approaches

Platypus: A New Multi-sample Variant Caller that Integrates Mapping, Assembly and Haplotype-based approaches (Image: Nature Genetics)

The Nature Genetics paper presents a new approach that integrates local sequence assembly, haplotype-based, multi-sample variant caller with in a single Bayesian statistical framework and it is implemented as software Platypus.  Platypus takes in mapped and sorted BAM files as input and calls candidate variants from read alignments, local assembly and external sources.  Platypus can identify SNPs, MNPs and short indels of size less than read length, and larger indels of size up to several kb deletions and maybe 200bp insertions.

First Platypus generates candidate variants using the read alignments, variants identified by local assembly and variants from external sources.  The local assembler looks at small window of region (~few kb) at a time and uses all the reads in the window and their pairs to generate a colored de Bruijn graph. Candidate alleles are generated by getting all unique paths in the graph by a depth-first traversal algorithm. Platypus is tuned for high sensitivity and returns a exhaustive list of paths unlike other assemblers. Candidate haplotypes are generated by clustering the candidate alleles across windows. Haplotype frequencies are estimated by EM magic. Variants are called using the estimated haplotype frequencies. A lot of interesting details on how it works is hiden in the supplementary methods section.

The paper goes on to show how this approach is useful in four different scenarios of variant calling applications.

  1. calling variation from whole-genome data
  2. calling SNPs and indels from whole-exome data
  3. de novo mutations in parent-offspring trios
  4. genotyping HLA loci

The paper also shows that integrating the approaches yields high sensitivity and specificity in several clinically relevant experimental designs  and it is also an order of magnitude faster.

 

Comments

  1. Interesting article, thank you — that’s a great summary of the method. A quick correction however: when you say “The most common approach, like GATK”, that actually refers to the UnifiedGenotyper, which was the GATK’s *old* variant caller. The UG has been (mostly) deprecated in favor of the HaplotypeCaller, which uses local re-assembly and re-alignment to determine haplotypes. As far as I can tell the methods are very similar (see https://www.dropbox.com/s/yywmm779sahrccr/GATKwr4-BP-4-Variant_calling_genotyping.pdf for an overview of the HC method), although we are moving away from multi-sample calling due to the computational requirements for large cohorts, and the so-called “N+1 Problem”. See https://www.dropbox.com/s/6evtqqa25h2p14h/GATKwr4-X-3-Analyzing%20cohorts.pdf for an overview of the workflow we came up with to solve that problem.

  2. nextgenseek says:

    Thanks for comment and correcting the way I referred GATK. Wrote the post with all the excitement after reading the Nature Genet. paper quickly. I will add/correct/change the use soon.

  3. Don’t want to be THAT guy who goes around correcting spelling mistakes, but this rather a large one that as an Australian I pounced on… its platypus, not playtpus!

  4. nextgenseek says:

    this is embarrassing. :( thanks for pointing that out :) I think it is corrected.

Speak Your Mind

*