Moleculo Long-reads in action for haplotyping whole human genome

Illumina’s long read sequencing technology, Moleculo, is in action for haplotyping a whole human genome. In a recent paper published in Nature Biotechnology, Stanford team led by Mike Snyder (other authors include the two Moleculo founders) demonstrated the use of Moleculo long-read sequencing technology for haplotyping a whole human genome.

Haplotyping/Phasing in Clinical Genomics

Humans are diploids; have two copies of every autosomal chromosome: one is maternal and the other is paternal. Both maternal and paternal chromosomes can have their own genetic variations.  If there are SNPs in maternal and paternal chromosomes, current short read sequencing can only identify the SNPs, but it can not tell whether the SNP is from mom or dad. To do that either we need to know parental sequencing information or do some complex experimental approaches. Identifying the origin of the allele (and thus the sequence containing the allele), whether it is from mom or dad is called haplotyping or phasing.

Haplotyping/Phasing-Example

Haplotyping/Phasing-Example

Haplotyping/phasing can be extremely useful in clinical genomics. For example, let us say we know a gene with two specific mutation causes a disease, phased variants will immediately tell whether the two variants in a patient’s sample are in the same allele or in different alleles. More likely, the scenario where a single allele containing both the variants is more problematic than the scenario where the two alleles have only one variant not both. In addition, phasing information allows one to further understand allele specificity at methylation, binding, and gene expression.

A few phasing possibilities

A few phasing possibilities

SLRH: Statistically aided, Long-Read Haplotyping

The Nature Biotechnology paper describes the application Moleculo sequencing to human genome haplotyping. The approach is named as SLRH: Statistically aided, Long-Read Haplotyping. SLRH uses 30 Gbp of sequencing from Moleculo’s long read technology in addition to a standard 50X coverage whole-genome data to haplotype a human genome.

The paper also describes Prism, a Dynamic programming for local phasing and HMM for global phasing based algorithm that augments long-fragment haplotyping with statistical techniques. Prism starts with short fragments and produces haplotype blocks of sizes that are of equal or greater quality than the ones from existing haplotyping technologies.

The paper shows that SLRH phases 99% of SNPs in three human genomes into long haplotype blocks of size 0.2–1 Mbp. Further, the paper illustrates the use of haplotyping information to determine allele-specific methylation patterns at base-resolution in a human genome.

How Does Moleculo’s SLRH Works?

The SLRH technology has two parts: one experimental library prep and sequencing and the other computational stitching the short reads to create long reads and phasing using Prism.

The library prep step involves

  • shearing DNA into fragments of 10 kbp size
  • diluting the fragments and placing them into 384 wells such that about 3,000 fragments are in a single well
  • amplifying fragments in each well by long-range PCR and cutting into short fragments and barcoding the fragments
  • pooling the barcoded fragments from each well together to sequence them all

After seqeuncing the computational steps involve

  • Separating the reads from each well based on the attached barcodes and grouping them into fragments
  • assembling the fragments at their overlapping heterozygous SNVs into haplotype blocks
  • phasing  the blocks statistically  based on a phased reference panel and producing long haplotype contigs

The SLRH library prep protocol is a bit similar, but much improved, to LR-Seq, an earlier version used to sequence B. schlosseri by Moleculo (published in eLife last year). One of the biggest concern with Moleculo long read technology is the bias from PCR amplification.  The paper claims that PCR artifacts “mainly introduce point errors at individual variants”, and these errors do not significantly affect the long-range phase information.

Comparing the haplotyping by Moleculo to other existing technologies, the paper shows that Moleculo needs much smaller amount of data (in addition to standard WGS) than the other approaches. With 30Gb-60Gb data, SLRH could phase 99% of SNPs, while the othe approaches need 110Gb for 94% SNP phasing (Kaper et. al), 203Gb-409Gb to phase 97% of the SNPs.

However, the library prep time using Moleculo could be a bottleneck. In the comparison with three other approaches,  SLRH needs 2 days (6 hours hands on) of library prep time now, while the approach by Kaper et. al and Complete genomics’ LFR needs just one day. And the approach by Kitzman et. al. needs 7 days.

Hoping to write a another going over the details of the Prism algorithm for phasing/haplotyping soon.

Data and Code Access

Want to play with the data from the publication? Check out at SRA: SRP036864. The algorithm Prism, which does the phasing with Moleculo reads, is freely available with Illumina open source license (https://github.com/sequencing/licenses). Prism is written in Python & C and can be installed from (also from github page of Prism)

wget http://www.stanford.edu/~kuleshov/prism.tar.gz
tar -zxvf prism.tar.gz
cd prism
python setup.py install

Making the software and data publicly available is great. However, one also wonders why the paper is not #openaccess, given the importance of the Moleculo long read technology, its possible applications and the fact that Illumina is offering it as a service.

Update:

Earlier version of this work was presented as poster at AGBT 2013 and posters are available from here:

 

Comments

  1. Hi,
    First author of the paper here. Let me know if you have any questions! Also, if you’re interested in the Prism phasing algorithm, have look at the following tutorial on how to run Prism on a subset of chr. 22: http://www.stanford.edu/~kuleshov/prism/tutorial.html
    — Volodymyr

    • nextgenseek says:

      Thanks for dropping by the blog post and offering your help. Really nice paper. Hoping to read the paper in depth soon. Thanks again.

    • Hello Volodymyr!
      I would like to phase the data of the publication, but it is not clear to me how to obtain the local blocks. I have downloaded one dataset and have what appears to be barcoded illumine read. How can I obtain “long reads” out of this?
      Best

      L

Trackbacks

  1. […] Giving examples of use of TruSeq Synthetic Long-Read DNA kit, Illumina’s product document showed that Illumina successfully used it for assembling C.elegans (of size ~100MB) and O. Sativa genome (of size 430MB) using the long reads alone.  Earlier, Stanford  and Illumina team showed Moleculo long-read technology can be used to phase human genome. […]

Speak Your Mind

*