Cataloging Splice Junctions in all Human RNA-seq data from Sequence Read Archive

One of the biggest sources of complexity in transcriptomes come from ubiquitous splicing. Our knowledge about the prevalence of alternative splicing has greatly increased from our ability to directly sequence the transcritome by using RNA-seq technology. Granted that to fully characterize the isoforms resulting from splicing, one needs long-read transcriptome sequencing. However, the abundance of RNA-seq data, mainly from Illumina sequencing technology, has been a boon to characterize splicing.

Till now, no study has looked at all the available data together to characterize splicing.  BioRxiv has a new paper that looks at the splicing diversity of human in all the sequence data available at the Sequence Read Archive (SRA).

Earlier in 2015 summer, Nellore et al  (from Jeff Leek and Ben Langmead‘s team at JHU) developed Rail-RNA – a splice-aware, annotation-agnostic RNA-seq read aligner for analyzing population scale RNA-seq data. Rail-RNA uses Map-Reduce framework and alternates between aggregating and computing steps. In the aggregating steps, Rail-RNA removes alignment redundancy by finding similar reads to align from all the samples and each unique read is aligned just once.  The reads that do not align perfectly are then divided into segments and these segments are aggregated further for alignment.


Rail-RNA: Aligning RNA-seq reads at population scale

In the new preprint, Abinav Nellore et. al. used the Rail-RNA to analyze over 21,500 human RNA-seq data available at the SRA and traced the evolution of splice junction growth over time. Yes, you got that right, they took over 21,500 publicly available RNA-seq data and aligned human reference genome and cataloged all the splice junctions.  Using all the data, the work has identified over 42.88M splice junctions with at least one read support (in contrast to just about 350,000 junctions from known annotation).

Evolution of Splice Junction Discovery

Evolution of Splice Junction Discovery using all data from SRA

It is fascinating to see the evolution of splice-junction discovery over time.  Early  RNA-seq projects sequencing ~40-70 human samples (Cheung et al and Pickrell et al) in 2009 resulted in the spike in splice-junction discovery.

The 2011 spikes in splice-junction discovery were mainly due to the University of Washington Human Reference Epigenome Mapping Project (UWE), Illumina Body Map project 2.0 (BM2) and ENCODE project. By 2013, the new splice junction discovery has plateaued and GEUVADIS project, one of the largest RNA-seq projects did not help that much in increasing splice-junction discovery.

Interesting that another large-scale RNA-seq project (Montgomery et al with about 60 samples from CEU population) that was published at the same time as the Pickrell et al paper is not mentioned in the figure. May be it was just an oversight and the 2009 spike includes the data from Montgomery et. al paper too.

Not sure how it deals with mappability of genomic region (and processed pseudo-genes) and it will be really nice if it can combine mappability with false positives in splice junctions.

The list of all the splice junctions from the analysis is a fantastic resource and it is available for download at intropolis. The preprint has analyzed theses splice junctions in multiple ways and would love to dissect the junctions in tissue specific and gene-specfic ways. And it would be great to play with the data in an interactive browser like ExAC. (No.. not playing the 3rd reviewer here :-))

Speak Your Mind