1000 Genomes Project announced that it is releasing initial data from Phase 3 analysis. The 1000 genomes project is the first major effort catalog genetic variations across human populations by sequencing. It has been divided into multiple phases due to the challenges in sample collection and data generation.
The initial phase of the 1000 Genomes project was called the pilot project. Phase 1 of the project focused on low coverage and exome data analysis on 1092 samples. Phase 2 of the project had an expanded set of with around 1700 individuals.
Phase 3 analysis uses only Illumina platform data with 70bp reads or longer and use the new methods developed in phase 2 to call complex variants. The initial Phase 3 data release contains over 79 million variant sites; including snps, indels, deletions, complex short substitutions and other structural variant classes. The initial call set from the 1000 Genomes Project Phase 3 analysis can be obtained from the ftp site in the directory release/20130502/.
Here are some basic statistics about the sites from vcflib
- total variant sites: 79449759
- of which 79174635 (0.996537) are biallelic and 275124 (0.00346287) are multiallelic
- total variant alleles: 79729551
- unique variant alleles: 79770999
- snps: 77520219
- mnps: 0
- indels: 2250780
- complex: 46530
- mismatches: 77520219
- ts/tv ratio: 2.09751
- deamination ratio: 1.44763
- biallelic snps: 77011302 @ 2.1149