It has been exactly 5 years and six months since the first paper on RNA-seq came out. In a short span of time RNA-seq technology replaced microarray for gene expression studies and it offers many more new interesting applications. No doubt, RNA-seq has been great. However, if you ever need a good RNA-seq reality check and one of the biggest challenges of RNA-seq from second-generation short read technology, think isoform identification and quantitation.
Briefly, reads from Illumina technology are short in the order 100-250 bases is much smaller than median human transcript length, about 2500 bases. The short reads makes it difficult to identify and quantify isoforms. The RNA-seq alignments to genome using these reads give us exon alignments, where read aligns completely within exons and junction alignments, where reads span two exons.
RNA-seq reality check
In theory, isoform deconvolution from short RNA-seq reads is not an identifiable problem. What that means is that, a given set of exon and junction reads might be compatible with multiple set of isoforms and expression. Mathematically, we can not uniquely identity isoform (and expression) from the set of exons and splice junction reads alone. Current approaches have to resort to some sort of approximation. For example, Cufflinks, estimates the parsimonious set of isoforms supported by the both the types of reads (and also multireads). Long sequencing reads from PacBio, (and Moleculo, Nanopore) will help us find new isoforms and in turn help better isoform quantitation.
IDP: Isoform Detection and Prediction
Last week, PNAS Early Edition published an interesting paper titled “Characterization of the human ESC transcriptome by hybrid sequencing by Au et. al.” from Stanford’s Wing Wong Group, addresses the problem of isoform identification and quantitation using PacBio and Illumina RNA-seq technologies. The team developed a new computational method, called Isoform Detection and Prediction (IDP), which combines the accuracy of short reads from Illumina with long error-corrected PacBio reads to Isoform detection and quantitation.
On a high-level (with lots of hands waving) the approach uses error corrected PacBio long reads to identify potential isoforms. As the mean PacBio read lengths are about 2-3Kb and some reads with length more than 8Kb, they are good at finding long isoforms. The authors also use the splice junctions obtained from Illumina reads to predict the possible isoforms.
The authors have developed a new statistical method for quantifying the expression of isoform candidates. Instead of simply using the alignments from Illumina reads to quantify isoform abundance, the authors use the PacBio data as a prior and estimate the abundance using Illumina in a Maximum A Priori (MAP) setting. The basic framework used in the method is very similar to an earlier method developed by Wing Wong’s group. The big difference is that the earlier method estimates isoform abundances using Maximum Likelihood approach. IDP takes Bayesian route by using the known isoform information from PacBio as a prior in a similar model and does MAP estimate.
IDP-like approaches look promising and such approaches will help annotate isoforms better. Have not fully gone through the results yet. A quick glance at the results show that IDP identifies thousands of new isoforms, which are not in any annotation. The use of PacBio reads seems to help IDP perform better when compared to using second-gen data alone using Cufflinks. Another interesting results from the paper, but part of it is known already is that
Most genes only express one or two isoforms, while there is no such preference for number of junctions used by a gene. And also there is no correlation between the number of isoforms used and number of junctions used by a gene (Fig 8).
IDP software tool is available freely and it is written in Python. Check http://www.stanford.edu/~kinfai/IDP/IDP.html
Update: PacBio posted a summary of this paper in its blog a few days after this post. Glad to beat PacBio :) Here is the post