Illumina’s long read sequencing technology, Moleculo, is in action for haplotyping a whole human genome. In a recent paper published in Nature Biotechnology, Stanford team led by Mike Snyder (other authors include the two Moleculo founders) demonstrated the use of Moleculo long-read sequencing technology for haplotyping a whole human genome.
- Whole-genome haplotyping using long reads and statistical methods, Volodymyr Kuleshov, Dan Xie, Rui Chen, Dmitry Pushkarev, Zhihai Ma, Tim Blauwkamp, Michael Kertesz & Michael Snyder, Nature Biotechnology
Haplotyping/Phasing in Clinical Genomics
Humans are diploids; have two copies of every autosomal chromosome: one is maternal and the other is paternal. Both maternal and paternal chromosomes can have their own genetic variations. If there are SNPs in maternal and paternal chromosomes, current short read sequencing can only identify the SNPs, but it can not tell whether the SNP is from mom or dad. To do that either we need to know parental sequencing information or do some complex experimental approaches. Identifying the origin of the allele (and thus the sequence containing the allele), whether it is from mom or dad is called haplotyping or phasing.
Haplotyping/phasing can be extremely useful in clinical genomics. For example, let us say we know a gene with two specific mutation causes a disease, phased variants will immediately tell whether the two variants in a patient’s sample are in the same allele or in different alleles. More likely, the scenario where a single allele containing both the variants is more problematic than the scenario where the two alleles have only one variant not both. In addition, phasing information allows one to further understand allele specificity at methylation, binding, and gene expression.
SLRH: Statistically aided, Long-Read Haplotyping
The Nature Biotechnology paper describes the application Moleculo sequencing to human genome haplotyping. The approach is named as SLRH: Statistically aided, Long-Read Haplotyping. SLRH uses 30 Gbp of sequencing from Moleculo’s long read technology in addition to a standard 50X coverage whole-genome data to haplotype a human genome.
The paper also describes Prism, a Dynamic programming for local phasing and HMM for global phasing based algorithm that augments long-fragment haplotyping with statistical techniques. Prism starts with short fragments and produces haplotype blocks of sizes that are of equal or greater quality than the ones from existing haplotyping technologies.
The paper shows that SLRH phases 99% of SNPs in three human genomes into long haplotype blocks of size 0.2–1 Mbp. Further, the paper illustrates the use of haplotyping information to determine allele-specific methylation patterns at base-resolution in a human genome.
How Does Moleculo’s SLRH Works?
The SLRH technology has two parts: one experimental library prep and sequencing and the other computational stitching the short reads to create long reads and phasing using Prism.
The library prep step involves
- shearing DNA into fragments of 10 kbp size
- diluting the fragments and placing them into 384 wells such that about 3,000 fragments are in a single well
- amplifying fragments in each well by long-range PCR and cutting into short fragments and barcoding the fragments
- pooling the barcoded fragments from each well together to sequence them all
— Eric Topol (@EricTopol) February 23, 2014
After seqeuncing the computational steps involve
- Separating the reads from each well based on the attached barcodes and grouping them into fragments
- assembling the fragments at their overlapping heterozygous SNVs into haplotype blocks
- phasing the blocks statistically based on a phased reference panel and producing long haplotype contigs
The SLRH library prep protocol is a bit similar, but much improved, to LR-Seq, an earlier version used to sequence B. schlosseri by Moleculo (published in eLife last year). One of the biggest concern with Moleculo long read technology is the bias from PCR amplification. The paper claims that PCR artifacts “mainly introduce point errors at individual variants”, and these errors do not significantly affect the long-range phase information.
Comparing the haplotyping by Moleculo to other existing technologies, the paper shows that Moleculo needs much smaller amount of data (in addition to standard WGS) than the other approaches. With 30Gb-60Gb data, SLRH could phase 99% of SNPs, while the othe approaches need 110Gb for 94% SNP phasing (Kaper et. al), 203Gb-409Gb to phase 97% of the SNPs.
However, the library prep time using Moleculo could be a bottleneck. In the comparison with three other approaches, SLRH needs 2 days (6 hours hands on) of library prep time now, while the approach by Kaper et. al and Complete genomics’ LFR needs just one day. And the approach by Kitzman et. al. needs 7 days.
Hoping to write a another going over the details of the Prism algorithm for phasing/haplotyping soon.
Data and Code Access
Want to play with the data from the publication? Check out at SRA: SRP036864. The algorithm Prism, which does the phasing with Moleculo reads, is freely available with Illumina open source license (https://github.com/sequencing/licenses). Prism is written in Python & C and can be installed from (also from github page of Prism)
tar -zxvf prism.tar.gz
python setup.py install
Making the software and data publicly available is great. However, one also wonders why the paper is not #openaccess, given the importance of the Moleculo long read technology, its possible applications and the fact that Illumina is offering it as a service.
Earlier version of this work was presented as poster at AGBT 2013 and posters are available from here: