Long reads of the year 2013

A bit late to the yearly review posts. But here it is. Long Reads of the year 2013. As you can see, this “Long Reads” are slightly different :)  Here we summarize a few “long read” sequence data that got publicly available last year and point to where one can download the data.  They are awesome resources and great to start playing with them in the new year.

One of the most exciting things in “next-gen sequencing” happened this year is the availability “long” sequence reads, be it genomic or transcriptomic.  Two sequencing technologies, that already have “long reads” and got a lot of attraction this year are Illumina’s Moleculo and PacBio.  And Oxford Nanopore data is just around the corner. With Oxford Nanopore’s early access program, it is expected that, we might see some data by February 2014 (AGBT 2014?).

The year 2013 started with Illumina acquiring Moleculo for its long-read technology. And another biggest change that happened is that PacBio got more social (possibly realizing the threat from Illumina) :). PacBio started blogging in mid 2012, but had just two blog posts in 2012. Then, 2013 came, PacBio got really prolific and till now it has over 55 posts. In addition, PacBio also started making its data publicly available using the blog. 

Moleculo and PacBio sequence data from Drosophila

After acquiring Moleculo,  Illumina launched Fast Track Long Read sequencing service using Moleculo long read technology. As part of the early Access launch, Illumina shared long reads data set from Dr. Dmitri Petrov’s group at Stanford, comprising two libraries of Drosophila melanogaster, each run on a single HiSeq lane and producing ~30Gb data. Visit Illumina’s Base Space to get the data.

Around the same time, Casey Bergman’s lab made PacBio long reads publicly available. The raw PacBio data is 1,357,183,439 bp with ~7.5x coverage of the 180 Mb male D. melanogaster genome. The 63G PacBio data can be downloaded from Bergman’s lab website. Not just this, Begman lab also had Illumina data from the same sample and combined it with the PacBio reads to offer error corrected sequence data.

Another possible Moleculo data is from the publication first publication using Moleculo technology.  The Moleculo team worked on the project before naming the technology as Moleculo and the results came out in a paper on eLife. However, it looks like the data is not available freely. Are there other Moleculo data out in the wild?

PacBio RNA-seq data from Human MCF-7

PacBio long generated  sequencing data of RNA from MCF-7, a human breast cancer cell line and made it available on its website. The data obtained from  P4-C2 sequencing chemistry and contains 44,531 non-redundant transcript-length consensus sequences with read length ranging from 400 bp – 4,900 bp (an average length of 1,929 bp).  Here is the PacBio blog post offering more details on the “long read” data.

Long-Read Shotgun Sequencing of a Human Genome

Pacbio released the data generated from  P5-C3 scaffolding sequencing chemistry and contains over 3.6 M reads with average length of 8,849 bases. (Half of sequenced bases in reads greater than: 10,985 bp).  The data is from an interesting human cell line derived from a complete hydatidiform mole (CHM).

A hydatidiform mole is defined as a pregnancy with no embryo and clinically presents in approximately 1 in 1,500 pregnant women in North America. The CHM cells have a diploid genome, typically XX, that is a result of replication of a haploid paternal (sperm) genome. Through the corresponding absence of allelic variation, this sample has been used to generate a haploid reference genome sequence, and many associated resources are available, including physical maps, genotypes (iSCAN), and a large-insert BAC library (CHORI-17). It is also one of the targets for the production of a higher quality “platinum” genome assembly.

Visit PacBio blog for accessing the data.

PacBio RNA-seq data

Mike Snyder’s group from Stanford  did the first long-read survey of human transcriptome  and generated 476,000 CCS reads from cDNA with an average length of 1 kb to investigate the isoform complement of a diverse pool of RNA samples representing 20 human tissues and organs. Data from 454 platform with average read length 522 bp , but on the same samples, is also available.  PacBio RNA-seq Data on ENA: PRJEB3969

PacBio RNA-seq data from hESC cell line

Wing Wong’s team from Stanford published a new method that can use PacBio and Illumina reads to identify isoforms in PNAS.  The team used C2 chemistry to generate over 7.5 M lreads of average length 2-3 Kb from hESC cell line H1.  Data can be accessed  at GSE51861.



  1. Great idea, this post! Some comments:

    The Drosophila moleculo data is available through Illumina’s basespace (free registration required).

    PacBio released several bacterial genome datasets, from projects illustrating the potential for finished genomes using this platform.

  2. And then I forgot to include the Arabidopsis Pacbio long reads, as well as the reads generated from the Human Microbiome Project ‘mock community’ sample – both released by the company and available through pacbiodevnet.com

    • nextgenseek says:

      Thanks a lot for mentioning basespace registration for getting Moleculo data and pointing to more PacBio data. Thank you


  1. […] The PacBio data is not available yet, but will be available on 12th Feb from PacBio Blog.  PacBio released 10X coverage data from the same cell line jsut ahead of ASHG 2013. PacBio has also released both DNA and RNA-seq data from multiple organisms last year. Here is a link pointing to some of PacBio long read data. […]

Speak Your Mind