HISAT2, the next iteration of HISAT – the splice-aware low memory alignment program by Daehwan Kim from Steven Salzberg’s team is out. Actually, the beta version of HISAT2 has been available since September, but somehow missed it until HISAT2 was presented as talk at #GI2015.
HISAT2 is not just the next iteration of HISAT, HISAT2 can do much more. HISAT2 is a fast and sensitive aligner for not aligning to a single reference genome, it is also an aligner for aligning against human population.
Till now we have been using the linear model (haploid) for reference genome. As we sequence more individuals and characterize genetic variations at population scale, the linear model reference genome has a number of limitations. Genome Reference Consortium has been painstakingly characterizing difficult and alternate regions. However, the linear genome model makes it difficult to fully utilize the characterized genetic variations easily. In addition, reads containing alternate alleles are known to suffer from reference genome bias.
The field is moving towards modeling the reference genome as a graph, where the sequence variations embedded on linear genome sequence yields reference graph genome. The reference graph genome accounting for genetic variations can offer much greater alignment accuracy for reads containing SNPs and indels.
HISAT2 uses “hierarchical graph FM index (GFM)” to align to “graph genome”. HISAT2 website says that it has built a GFM index using 11 million single nucleotide polymorphisms, 728,000 deletions, and 555,000 insertions. Even with over 12M SNPs (and indels), HISAT2’s index size is just about 6.2GB.
HISAT2 website says that it comes with four different indices
- Hierarchical FM index (HFM) for a reference genome (index base: genome)
- Hierarchical Graph FM index (HGFM) for a reference genome plus SNPs (index base: genome_snp)
- Hierarchical Graph FM index (HGFM) for a reference genome plus transcripts (index base:genome_tran)
- Hierarchical Graph FM index (HGFM) for a reference genome plus SNPs and transcripts (index base:genome_snp_tran)
Although the index name mentions “snp”, the index is made from both SNPs and and indels. The HISAT2 page gives basic information about the aligner and offers these indices for human and mouse for download. It is not clear how/whether one can create customized indices using SNP/Indel data. The HISAT2 page seems still under work (FAQs are empty as of now), but it will be interesting once a preprint/paper associated with HISAT2 and full website is out.
Check the HISAT2 slides for information on how it works.
Aligning to Graph Genome
HISAT2 is not the only tool available for aligning with reference graph genome. Richard Durbin’s group at Sanger has been developing methods/tools to use reference graph genome. The unpublished tool Variant Graph “vg” has been under development publicly by Erik Garrison for a while and the tool vg is available at github.
Gill McVean’s group at Oxford has been developing tools for graph genome. A few papers from McVean’s group
- De novo assembly and genotyping of variants using colored de Bruijn graphs
- Improved genome inference in the MHC using a population reference graph
David Haussler’s group has also been working on theoritical/computational aspects of modeling genome as a graph.
On the industry side, Seven Bridge Genomics has developed tool for aligning reads to graph genome.
And the recent RNA-seq tool Kallisto developed by Lior Pachter’s group, under the hood use colored de Bruijn graphs for target sequences (transcriptome) and does the graph (pseudo-) alignment.
Please check the following interesting blog posts/comments that covers interesting issues on reference genome as a graph and for other relevant publications.
- On the graphical representation of sequences
- Graphical Fragment Assembly (GFA) Format Specification
- Graph alignment and variant calling
- On graph-based representations of a (set of) genomes
- Extending reference assembly models
Please feel free to add any relevant article that is missing here.