Just a few weeks ago Mike Snyder’s team at Stanford published an interesting paper on PNAS. Snyder’s team is again at using PacBio long reads to understand and characterize human transcriptome.
by Hagen Tilgnera, Fabian Gruberta, Donald Sharon, and Michael P. Snyder
In this paper, Tilgner et al. focussed on defining personal transcriptome at allele-level.
Diploids like human have two copies/alleles of every autosomal gene, where one allele is from mother and the other allele is from father. A number of studies have shown that preferential expression of mom’s or dad’s allele is pretty common. However, till now these studies have used short-read RNA-seq technology to study transcriptome at allele-level.
And this paper is the first attempt to characterize personal transcriptomics, where an individual’s genetic variations and allele-level isoforms are defined and quantified at transcript’s full length. Synder team used PacBio to produce a personal transcriptome at allele-level by producing long read PacBio RNA-seq data from the GM12878 cell line (lymphoblastoid) and its parental cell lines (GM12891 and GM12892). In addition to the long reads from PacBio, the team also used Illumina sequencing to give us the most comprehensive view of a personal transcriptome. Sequencing the transcriptomes of the trio by PacBio, this paper provides the glimpse of allele level expression at full length transcript.
They sequenced 711,000 circular consensus (CCS) reads/molecules from unamplified, polyA-selected RNA from the GM12878 cell line. The average length of CCS is 1,188 bp and some read with length up to 6 kb. The sequencing effort in this paper is much longer than their earlier effort to define transcriptomes by using Pacbio long read technology (Published in Nature BioTech. in 2013: A single-molecule long-read survey of the human transcriptome). The early part of the paper is all about how long read PacBio fares in finding isoforms and improve GenCode annotations.
- How the CCS read is able to capture all exon-intron structure of transcript in a single read?
- How does the gene detection by long-read PacBio compares to Illumina 101 bp reads?
- How the long read enhances GenCode annotations?
One of the interesting parts of the paper was finding out the parental origin of a given PacBio long read, i.e. whether the read originated from mom or dad. This basically help quantifying allele-specific gene expression.
Traditionally, using Illumina reads, one aligns reads to genome, looks at known SNP location, and counts number of reads with reference allele and the number of reads with alternate allele to quantify allele-specific expression. Accounting for the alignment bias induced by reference genome, either at the alignment step or later, this approach gives us an idea of how much gene shows allele-specific expression. This approach gives us SNP level allelic expression, not isoform level or gene level. Therefore quantifying allelic expression of a transcript with multiple variants is challenging using Illumina reads.
Allele Specific Expression by Principal Component Analysis (PCA)
Interestingly, this paper addressed the quantification of allele-specific expression in a PCA/SVD framework instead of looking at known SNP locations. Possible reasons for the approach are that
- this gives read level inference of parental origin instead of SNP level
- the random errors in PacBio may make it difficult to quantify allele level expression at SNP level.
On a high level, In the PCA framework, the SNP profiles of aligned reads are the input data and the parental origin of reads is the unknown variable to inferred from the SNP profile. SNP profile – number of reads by mismatch matrix is created by coding a mismatch as 1,0,-1 depending on whether the nucleotide is different or the same as reference, or absent.
Assuming the PacBio errors are random and there are no other factors affecting the reads, one can do PCA on SNP profiles of aligned reads and one of the principal components will be the parental origin of the reads. With enough number reads, for a gene with multiple SNPs there will enough signal and that can be captured as one of the principal components. The basic idea is that instead of computing ASE by using single SNPs, PCA approach uses information from multiple SNPs (either real or error) and computes parental origin of reads. Each column in the Read by mismatch matrix is a potential SNP and the PCA does dimensionality reduction and the PC that explains the most variance gives us the parental origin of read information.
An advantage of the approach is that, a priori we do not need to know about SNPs. A problem with the approach is that, it works only when the data is from trios. The reason is – although PC from PCA can separate maternal and paternal reads from an individual, it can not tell which reads are maternal and paternal. A way to get around the problem is to use the reads from parents in the PCA analysis, then using them we can tell which reads are paternal and maternal. Also the approach will fail if a gene has fewer SNP, fewer number of reads, and read has more errors.
It will be great to understand the nitty-gritty details of how this method compares with SNP-based approach and what are the limitations of the approach. Some other time, if needed :)