Large Scale Genetics of Human Gene Expression Studies Turn To Next-Gen Sequencing

Understanding how the naturally occurring genetic variations affect gene expression levels has been a promising first step to understand the genetics of complex traits at molecular level.  The expression Quantitative Trait Loci studies (eQTL) attempt map all the genomic regions affecting/associated with gene expression levels, by genotyping and measuring genome wide expression levels on the same set of individuals from a population.

Although sequencing mRNA molecules by RNA-seq technology has almost replaced microarray technology in small/medium scale gene expression studies, the large-scale genetics of gene expression studies have been a bit slow to embrace Next-gen sequencing. Till recently almost all of the eQTL studies have primarily relied on measuring expression profiles by using microarray technologies.

Only two studies have been published that looked at at genetics of human gene expression by RNA-seq. Pritchard’s group at U. Chicago characterized the genetics of gene expression in 69 african samples from HapMap using RNA-seq and a group from Europe analyzed RNA-seq from 60 European HapMap samples. The results from these studies were published back-to-back in the Nature 2010 issue celebrating “The human genome at ten“.

After those two publications, there are no studies using RNA-seq for understanding genetics of gene expression. Actually, except for the ENCODE/MODENCODE project there is no large scale RNA-seq data published yet. ENCODE published over 410 RNA-seq studies, they were mainly from a few cell lines, not a population level data.

It is bit surprising to see that genetics of gene expression has been a bit slow to embrace sequencing. For sure there are many valid reasons, including funding, scale of such projects and challenges associated with collecting genotype, expression and possibly other phenotypes from the same population and challenges associated with analyzing RNA-seq data.

Genetics of Gene Expression Goes Next-Gen Sequencing

It is all going to change pretty soon. The recently concluded Biology of the Genomes conference highlighted at least two major studies on the genetics of human gene expression using RNA-seq data.

One study, Geuvadis project from Europe uses part of samples from the 1000 Genome project with the aim of setting up standards for biological/medical interpretation of sequence data in relation to clinical phenotypes. The Geuvadis project has sequenced both mRNA and microRNA molecules on 465 lymphoblastoid cell line (LCL) samples from 5 populations of the 1000 Genomes Project: one african (Yoruba, YRI) and four european (CEPH (CEU), Finns (FIN), British (GBR), and Toscani (TSI)).

Of these samples, 423 were part of the 1000 Genomes Phase 1 dataset with genome/exome sequencing data. Although the project results are not published yet, the underlying RNA-seq and microRNA-seq data are available from Geuvadis, under Fort Lauderdale Agreement. The Geuvadis website also says that one can expect the publication describing the results of the project pretty soon. Till then here are the tweets on talk at #BOG13 from the leading author of Geuvadis project.

Genetics of Gene Expression of samples from 1000 Genome #bog13

The lead author of Geuvadis project @tuuliel’s talk at #BOG13 storified.

Storified by NextGenSeek· Mon, May 13 2013 04:06:13

@tuuliel up at the #bog13 podium on functional variation on human populations – eQTLs as an intermediate phenotypes.Ewan Birney
Tuuli Lappalainen on the Geuvadis project. Genome meets Transcriptome… #bog13Nicolas Robine
.@tuuliel using gene expression as an intermediate phenotype to understand how genomic variation leads to phenotypic variation. #bog13Chris Gunter
Lappalainen on how gene expression varies among human populations. Uniformly generated RNAseq data from 1 wAfr, 4 Eur populations. #bog13Nathan Pearson
TL Higher proportion of splicing variation between YRI and other pop than within other european populations #bog13Nicolas Robine
TL: splicing variation disproportionately contributes to continental population differences #bog13Magdalena Skipper
Splicing differences stronger than expression between populations. This surprises me. #bog13Ewan Birney
.@tuuliel: miRNA sequencing revealed that miRNAs affect mRNAs, AND the other way around too. #bog13Chris Gunter
Lappalainen findings: 1) Splicing varies greatly between wAfr & Eur. 2) miRNA and mRNA levels correlate, but variably #bog13Nathan Pearson
. @tuuliel shows exon eQTL, transcript ration QTL, mirQTL, repeat eQTL and RNA edit QTL. Cool! #bog13Nicolas Robine
. @tuuliel now on Allele-specific expression (ASE) and transcript structure (ASTS). #bog13Nicolas Robine
TL: allele specific expression is a strong genetic trait #bog13Magdalena Skipper
This is a tour de force of eQTL analysis – I hope tuuli touches in the repeat eQTLs. Currently talking about allele specific effects #bog13Ewan Birney
.@tuuliel: Genetic effects on splicing no less important than genetic effects on expression levels. #bog13Chris Gunter
Lappalainen: Carefully probed (in hets vs. homs) cis-regulation of allele-specific expression. #bog13Nathan Pearson
@ewanbirney I want mirQTL! Looking forward to eating the papers. Great work! #bog13Nicolas Robine
. @tuuliel how to discover causal regulatory variants and links to human disease? #bog13Nicolas Robine
TL: large-scale analysis further confirms that eQTLs often map to promoters, enhancers etc #bog13Magdalena Skipper
Tl: The 1st eQTL variant is causal in 54%of the eQTL loci #bog13Magdalena Skipper
TL: GWAS variant in eQTL is no guarantee that gene expression change underlies disease #bog13Magdalena Skipper
TL: important to keep in mind when GWAS hit is also eQTL-doesn’t necessarily imply regulatory mechanism bc null overlap is high. #bog13Alicia Martin
@tuuliel: 1 in 6 GWAS hits is an eQTL. More than null-expected (1/9). Why? Disease may not always trace to expression diff. #bog13Nathan Pearson
Nice job by @tuuliel of summarizing biological meaning from vast amounts of data. #bog13Chris Gunter

The second study from Stanford group led by Alexis Battle has sequenced transcriptomes for 922 individuals from the same population and genotyped them for 737,187 common SNPs. With over 900 individuals, it is the largest transcriptome sequencing effort ever. The mRNA from whole blood was sequenced to really high depth of over 60 million reads in each individual. This work does not have a project page yet and one can learn a bit. Here is the tweets from #BOG13 talk of Alexis Battle storified.

Large-scale RNA-seq study to understand the genetics of gene expression in human #bog13

Alexis Battle’s on the largest RNA-seq project at #bog13 storified

Storified by NextGenSeek· Mon, May 13 2013 04:12:08

Alexis Battle: Characterizing the genetic basis of transcriptome diversitythrough RNA-sequencing of 922 individuals #bog13Nicolas Robine
Alexis Battle on genetics of gene expression #bog13Elisabeth Rosenthal
AB: 922 individuals, 720K autosomal snps, RNA from whole blood #bog13Elisabeth Rosenthal
AB: detected potential splicing QTLs (sQTLs) for about 2,851 transcripts #bog13Elisabeth Rosenthal
Battle points out the prevalence of cis regulation in the human genome ; base on eQTL analyst in 922 individuals #bog13Magdalena Skipper
Next up is Alexis Battle on RNA-seq of 922 blood samples. Found ~11K expression QTLs, nearly 3K splicing QTLs. #bog13Daniel MacArthur
Alex’s Battle: basically every single gene has a cis-eQTL if you have a big enough sample. Data have been going this way for a while #bog13Jeffrey Barrett
Battle: distal regulation is less prevalent but has potential for broader effects #bog13Magdalena Skipper
AB: asks do variants outstide of standard promoter act in cis? or have an indirect effect via a TF, for example? #bog13Elisabeth Rosenthal
Grr. Alexis Battle. Bloody autocorrect. #bog13Jeffrey Barrett
Battle: finding lots of s(plicing)QTLs…and trans-eQTLs #bog13 #statisticalpowerisgoodNathan Pearson
Battle: many distal elements might act in cis and affect enhancers #bog13Magdalena Skipper
Alexis Battle: find 803 SNPs affecting expression levels of other multiple genes at a distance. #bog13Chris Gunter
AB: found 269 variants that appear to affect expression of genes over 1M bases away. Many act in haplotype-specific fashion. #bog13Daniel MacArthur
AB: found 803 SNPs that affect multiple genes. genes are often colocated (linearly and in 3D by HiC) #bog13Elisabeth Rosenthal
AB: 56 genes have sQTL >1MB away. possible has effect through 3D interaction. #bog13Elisabeth Rosenthal
Battle talks about eQTLs on large sample. Powered to look at trans effects. #bog13Ewan Birney
AB shows qqplot that does not look very good, to me. appears overly skewed and am wondering if an adjustment missing #bog13Elisabeth Rosenthal
Battle: TF and hub-like genes are depleted in eQTLs #bog13Magdalena Skipper
AB: genes with more protein-protein interactions are less likely to have genetic variants altering expression. #bog13Daniel MacArthur
A Battle – fantastic work presented at lightening speed. Glad I have notes from Koller’s April lecture at the Broad to follow along. #bog13Kate Blair
Battle: very few trans-sQTLs. Cites, as example, a cis-eQTL for known splicing factor [no surprise there…]. #bog13Nathan Pearson
AB: looks at encode data near their eQTLs and sQTLs. enriched in chip-SEQ annotation and chromatin marks #bog13Elisabeth Rosenthal
Battle: latent regulatory variant model to predict impact of regulatory variants #bog13Magdalena Skipper
AB: model uses EM maximization of logistic model with ‘hidden’ driver variables #bog13Elisabeth Rosenthal
Battle trained an LD-corrected logistic model to integrate potential cis-regulatory effects of gene-flanking allelisms. #bog13Nathan Pearson
Battle: finding lots of new eqtls. Interested to hear a bit more about the supporting stats but perhaps this is too technical for bog #bog13Mark Gerstein
Many of the key features of eQTL have remained consistent with what we saw in our original study in 2002:… #Bog13Leonid Kruglyak
AB: also working on environmental effects. smoking has broad impact on expression (but I am worried about qqplot, again) #bog13Elisabeth Rosenthal
Battle has data about some environmental exposures in her 922 indivs. Finding that smoking has very broad impacts on transcriptome. #bog13Chris Gunter
AB shows possible GXE effect on expression (smoking and APOE SNP). (I’m not convinced by the boxplots, though) #bog13Elisabeth Rosenthal
Eeek. Confounders when we bring in environment variables. Smoking might well be confounded eg to socio econ status #bog13Ewan Birney
AB: SNPs that affect APOE expression in smokers, not non-smokers. #bog13Matthew Herper
Battle: finds 21 environment specific eQTLs; had some epidemiology information about their individuals #bog13Magdalena Skipper
Battle surveyed transcriptomees’ lifestyles. Turns out smoking affects many genes’ expression, incl. by genotype intxn. #bog13 #whodathunkNathan Pearson
Battle shows a cool GxE analysis: expression qtl that show effect only with smoking. Awesome! #bog13Yaniv erlich
AB: Continues theme at #bog13 that details of biological inputs (environment, behaviors, aka #metadata) essential for interpreting dataEL Hong
@bullymom2 I was worried about the first qq plot as well. About long range allele specificity #bog13Ewan Birney
Wow. I need to hear that talk again. AB went so fast she lost me, but the environmental effects stuff was fascinating. #bog13Matthew Herper
A Battle – RNA-Seq + SNP data + environmental info => SNP affects expression of APOE in smokers but not non-smokers. #justsayno? #bog13Kate Blair
AB APOE expressed in both smokers and non-smokers. Not driven just by expression. #bog13Matthew Herper
I think we need to be skeptical of AB’s results. sample size seems small to me. qqplots very skewed and possibly missing adjustment. #bog13Elisabeth Rosenthal
#bog13 Battle et al observed an eQTL in APOE for smoking! Very cool!JL Rodriguez-Flores
@ewanbirney And of course by genomic susceptibility to nicotine addiction, etc. #bog13Nathan Pearson
Also, I think outliers are having undue influence on the smoking by genotype effect. #bog13Elisabeth Rosenthal
Lander Q: could it be not regulation intracellularly, but regulation of what cell types? AB: Maybe. #bog13Matthew Herper
Battle got many questions all essentially on the same theme/last slide: we don’t know enough yet about confounders and smoking. #bog13Chris Gunter
. @ewanbirney: couldn’t this be correlated to age or some other confounder? #bog13Matthew Herper
@matthewherper watch the confounder thing. There is about 100 years of being burnt by case control studies in epidemiology. #bog13Ewan Birney
Great talk by Alexis Battle about regulatory variation. Would be awesome to apply their prediction model on genome seq data #bog13Tuuli Lappalainen
Question from @denizkural: A neutral theory of variability in gene expression? #bog13Kate Blair
AB: study in case-control depression cohort. Depression correlated with smoking? #bog13Leonid Kruglyak
@ewanbirney smoking/class bias well known wrt classic Doll/Peto studies. Also see for bias (and refs therein)… #bog13Douglas Kell

NIH’s Genotype-Tissue Expression (GTEx) Project

These two large scale population level RNA-seq effort are only a beginning. NIH’s Genotype-Tissue Expression (GTEx) project is underway to create a largest resource of genotype and gene expression by RNA-seq on 30 to 50 tissues in the human body, including the brain, lung, heart and muscle. Just on the scale alone, the GTEx project could be called as “popENCODE” :) (Over 4000 experiments in ENCODE: 2142: ChIP-seq, 418:RNA-seq,318-DNase-seq). And the GTEx pilot project’s final data on 190 individuals with genotype information and RNA-seq on over 1800 tissues is available now. GTEx project plans to scale up the sample size in the future.


  1. […] In a span of a few weeks, three large scale sequencing projects with the theme of moving beyond genetic variations towards function, were published interesting papers. Two of the projects are not completely new the ones who are regular to the big genomic conferences. We earlier covered it here. […]

Speak Your Mind