Just earlier today, Lior Pachter gave a great talk at the symposium “Computation-Intensive Probabilistic and Statistical Methods for Large-Scale Population Genomics” at UC Berkeley. The topic of the talk is using gene expression from GEUVADIS RNA-seq data on five populations of 1000 Genomes project to find the geographic signature.
Using principal component analysis from genotype data to decipher population substructure or signatures of geography has been studied well. One of the early papers that used PCA to infer geography was by Menozzi P, Piazza A, and Cavalli-Sforza L, published in 1978 on “Synthetic maps of human gene frequencies in Europeans“. They had 39 loci information across Europe and near east and showed how PCA analysis reflects geography.
More recently in 2006, Price et. al. that used principal components from genotype data to infer population structure and showed how it can help remove false associations in a GWAS setting. Later on two papers from Novembre. J showed how genotype reflects continuos population structure across europe.
- Interpreting principal component analyses of spatial population genetic variation by John Novembre & Matthew Stephens
- Genes mirror geography within Europe, John Novembre et. al.
Does Gene Expression Reflect Geography?
In the talk, Lior Pachter asks Okay. Genotypes reflect geography. What about Gene expression? Does it also reflect geography? We have enough gene expression data to address the question. Gene expression data by microarray from multiple HapMap populations and the latest RNA-seq data from GEUVADIS project with five european populations from 1000 Genomes project.
Lior Pachter shows that using PCA analysis of gene expression data from GEUVADIS RNA-Seq data is noisy does not reveal the geography easily although the data has the signal and presents on methods to address this problem. Watch the 40 min video at the end of the post made available by Simons Institute to learn more on this really interesting talk.
One immediate thought was that may be when using all gene expression data, it is difficult to see the geography as gene expression data contain variations from environmental factors as well. However, the geographic signal might be more enriched if we only do PCA analysis on genes Cis associations as the Cis associations are influenced by the variations in the physical neighborhood of gene.