Just ahead of the upcoming AGBT 2014 at Marco Island Florida, PacBio released 54 x coverage human genome sequence data for public use. The new PacBio data was generated by using its P5-C3 sequencing chemistry on a well-studied human haploid cell line (CHM1htert). PacBio unveiled P5-C3 chemistry in 2013 fall and it produces sequence data of read lengths greater than 8,500 bp and about 50% of the reads over 10,000 bp in length.
CHM cells have a diploid genome resulting from replication of a haploid paternal (sperm) genome. It is the same cell line that is used to sequence and assemble an alternate reference genome (“platinum genome”), by Rick Wilson from the Washington University in St. Louis and Evan Eichler from the University of Washington in collaboration with investigators from the National Center for Biotechnology Information (NCBI). So, already a variety of sequence data is available on the same cell line. The addition of high depth long PacBio reads will be a great addition.
PacBio said one of the main reasons it is making the human genome dataset public is to “accelerate the understanding of genome-wide variation at all genome size scales, and to improve assembly techniques”. PacBio will give multiple presentations on the data at the AGBT including
- Gene Myers’ talk “A De Novo Whole Genome Shotgun Assembler for Noisy Long Read Data.”
- PacBio’s Senior Director of Bioinformatics Jason Chin’s talk “String Graph Assembly For Diploid Genomes With Long Reads”
PacBio also claimed that the use of PacBio long-read data to create de novo assemblies of human genomes using Hierarchical Genome Assembly Process (HGAP) in collaboration with Google has resulted in a 3.25 Gb assembly with a contig N50 of 4.38 Mb, and with the longest contig being 44 Mb. Compare this to total assembly size of 2.83 Gb and a contig N50 of 144 kb from the most recent reference-guided assembly using Illumina and BAC-clone finishing on the same sample.
The PacBio data is not available yet, but will be available on 12th Feb from PacBio Blog. PacBio released 10X coverage data from the same cell line jsut ahead of ASHG 2013. PacBio has also released both DNA and RNA-seq data from multiple organisms last year. Here is a link pointing to some of PacBio long read data.
Announcing the release of high depth PacBio human data, Michael Hunkapiller, Chief Executive Officer of Pacific Biosciences said
Recent performance increases on the PacBio RS II — notably substantial improvements in throughput — are allowing scientists to approach much larger genome sequencing projects. The list of recent accomplishments using SMRT Sequencing includes agriculturally valuable plants such as spinach, model organisms such as Drosophila and Arabidopsis, and now, very high-quality work on human genomes. We are delighted with the diversity of applications that will be showcased by our customers this year at AGBT, including unprecedented work on the human genome.