1000 Genomes Project Releases Phase 3 Initial Variant Data

1000 Genomes Project announced that it is releasing initial data from Phase 3 analysis. The 1000 genomes project  is the first major effort catalog genetic variations across human populations by sequencing. It has been divided into multiple phases due to the challenges in sample collection and data generation.

The initial phase of the 1000 Genomes project was called the pilot project.  Phase 1 of the project focused on low coverage and exome data analysis on 1092 samples.  Phase 2 of the project had an expanded set of with around 1700 individuals.

Phase 3 the 1000 Genomes contains 2535 individuals from 26 different populations around the world.  New samples include individuals Africa and South Asia.  Each of the 26 populations have about 60-100 individuals.


1000 Genomes Project Phase 3 Populations

Phase3 1000 Genomes-Project Populations

Phase 3 analysis uses only Illumina platform data with 70bp reads or longer and use the new methods developed in phase 2 to call complex variants. The initial Phase 3 data release contains over 79 million variant sites; including snps, indels, deletions, complex short substitutions and other structural variant classes. The initial call set from the 1000 Genomes Project Phase 3 analysis can be obtained from the ftp site in the directory release/20130502/.

Here are some basic statistics about the sites from vcflib

  • total variant sites: 79449759
  • of which 79174635 (0.996537) are biallelic and 275124 (0.00346287) are multiallelic
  • total variant alleles: 79729551
  • unique variant alleles: 79770999
  • snps: 77520219
  • mnps: 0
  • indels: 2250780
  • complex: 46530
  • mismatches: 77520219
  • ts/tv ratio: 2.09751
  • deamination ratio: 1.44763
  • biallelic snps: 77011302 @ 2.1149

Speak Your Mind