The meaning of large-scale study producing massive amount of data has changed with next-gen sequencing revolution. Thanks to Next-Gen sequencing applications, these days even small projects are data intensive. Then there are a different breed of projects like ENCODE or 1000 Genomes project, producing Next-Gen sequencing data on a massive scale. The massive data from these projects are really a boon to anybody who is computationally inclined.
In addition to the large-scale data resource projects, this year we also saw publications from large-scale sequencing efforts on population/cohorts to understand complex diseases like cancer and autism.
Here is a brief look at five papers/projects published in 2012 with massive Next-Gen sequencing data.
The ENCODE project: ENCylopedia Of DNA Elements
The ENCODE project published in early fall this year is as massive as any project can get. The five year project involving over 400 researchers mapped functional elements in human genome and produced 30 papers. It did draw a lot of criticisms for the claim on amount of “not-so” junk DNA, the conclusions that was portrayed in the public media and for its data release policy. However, the resulting 15 trillion bytes of raw data will be a great resource for the community.
Visit http://www.ncbi.nlm.nih.gov/geo/info/ENCODE.html to play with ENCODE data.
1000 Genomes Project
1000 Genomes project set out to map the genetic variations in human across the world sequenced 1,092 individuals from 14 populations. The resulting 20,000 Gb bases of raw data with about 5X coverage will be basis for understanding of human genetic variations. The 100 Genomes project was initiated in 2008 and ultimately will have data from the genomes of more than 2,600 people from 26 populations around the world.
Sequencing 1092 exomes and whole genomes resulted in over 38 million SNPs, 1.4 million indels (small insertions and deletions) and 14,000 large deletions. Of these genetic variations over 50% of them are novel in all three variation classes. What this means for you and me is that on an average any two human differs by about 3.6 Million SNPs, over 300K indels and about 700 large deletions.
Just to put the amount variations discovered from the 1000 Genomes project in perspective, the HAPMAP project that started mapping human genetic variations discovered just 3 million human DNA variations in 2003.
Visit http://www.1000genomes.org/ to get your hands wet with human genetic variation data.
The Human Microbiome Project
The number of teeny tiny microorganisms that live in human body greatly outnumber the human cells. In an attempt to catalog the microorganisms in human body, The Human Microbiome Project (HMP) collected samples from 242 healthy U.S. volunteers and collected tissues from 15 body sites in men and 18 body sites in women.
Like ENCODE project, the HMP project is an effort of over 200 researchers from over 80 research institutes spending 5 years to produce the first reference catalogue of microbial diversity in human body. The resulting 5 Terabytes of genomic data covering over 5 million microbial genes will be useful in our future metagenomics endeavor.
Visit http://hmpdacc.org/ to get your hands on the data and the tools to access the data.
De Novo Mutations in Autism
Three teams independently addressed the role of de novo mutations in autism. De novo mutations are, mutations that are present in a child but not in either of parents. The teams used whole exome sequencing to look at de novo mutations in the coding sequence that might be associated with autism. Each study sequenced about a few hundred exome (In total, over 2000 exomes from over 500 families) and identified genes and network of genes that might related to autism. The results were published in three Nature papers in the same issue.
Patterns and rates of exonic de novo mutations in autism spectrum disorders: 425 exomes: 175 mother-father-child trios in which the child was diagnosed as autistic.
- De novo mutations revealed by whole-exome sequencing are strongly associated with autism: whole-exome sequencing of 928 individuals (238 families), including 200 families with affected and unaffected sibling pairs.
- Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations:
677 individual exomes from 209 families with an affected child.
The National Heart, Lung, and Blood Institute (NHLBI)’s Exome Sequencing Project (ESP) using multiple cohorts and centers aimed to identify new genes behind complex heart, lung, and blood disorders by sequencing the exomes of a large number of individuals with relevant phenotypes measured.
The project has resulted in two major publications this year; an initial paper published in Science this summer addressed the functional impact of rare mutations in coding sequences using whole exome sequence data from 2440 individuals sequenced at an average median depth of 111 X.
The second paper published in Nature this fall, sequenced the exomes of 6,515 individuals belonging to european american and African american ancestry. What they found was that the fast human population growth has resulted in a large number of new and rare mutations. Comparing the amount of rare variants between european and african ancestry people, they found that European-americans have a higher fraction of rare coding variants than the african americans.
This is possibly the biggest study (so far) looking at the human coding genetic variations. The project is so big that the authors threw more than 300 exomes at QC step. A back of the envelope calculation using sequencing statistics suggests that the project has yielded at least 50 tera bytes of raw sequence data. (each sample has at least 30 Million reads of 50 or 76 bases PE data).
- 6,515 Exomes: Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants
- Evolution and functional impact of rare coding variation from deep sequencing of human exomes.
Exome Sequencing Project data: Not sure whether the raw sequence data can be public for this project. The Exome Sequencing Project website (and this site) is not telling much about the data or its policy.
Massively “bleak” future
If you think 6500 exomes is huge, wait for a year; there will be more than one project sequencing a larger number of exomes.
We already know of Daniel MacArthur from Harvard Medical School’s sequencing effort of over ”mind-spinning” 26,000 exomes. [Thanks to comments from Daniel MacArthur on twitter]. Daniel MacArthur from Harvard Medical School is co-ordinating a project that uses 26,000 exomes (already sequenced by projects liker ESP, GoT2D, and autism).
Assuming each sample has at least 30 Million reads of 70 bases PE data, 26,000 exomes would result in about 270 tera bytes of raw sequence data (26000 x 30 x 2 x *0.7 *250MB per million reads). One can only imagine the challenges in analyzing the data.
Chime in to add any ongoing “massive” next-gen sequencing data project.