HISAT: A Fast and Memory Lean RNA-seq aligner

HISAT is the new kid in the RNA-seq aligner’s block. There is a good chance you might think, Oh. No. what is the need for one more new aligner when there are over 90 short read aligners. The problem is out of the 90 aligners only a handful can deal with short reads from RNA-seq technology. Most of the currently used RNA-seq aligners are either slow or need big memory requirements to be faster. HISAT, instead of being yet another short read aligner, addresses this challenging problem and offers a fast aligner that is lean on memory requirements for RNA-seq.

Current aligners use a single global indexing scheme; either  Burrows-Wheeler Transform (BWT) based or suffix tree based. (The advantage of BWT index based aligners is that they have low memory requirements, while suffix tree based methods can be faster but require higher memory). HISAT, in contrast to other aligners, uses two different types of indexes instead on a single indexing:

(i) a global FM index that represents the entire genome and (ii) numerous small FM indexes for regions that collectively cover the genome

The small FM indexes represent 64,000 bp. For the human genome, HISAT creates about ~48,000 local FM indexes such that each local index overlaps its neighbor by 1,024 bp and covers the entire 3 billion bases of human genome. The overlapping boundaries make it easier to align reads that would otherwise span the regions covered by two indexes.

How do these global and local indexes help HISAT?

One of the challenges in aligning RNA-seq reads to genome is that the aligner has to deal with the short reads that span one or more exon boundaries and thus making the genome alignment “non-continuous”.  HISAT algorithm uses different alignment strategies for the reads that align completely within an exon and the reads that overlaps one or more exon junctions differently. For example, the classifies aligning reads into five read types

  1. exonic read – reads that align within a single exon (62% of reads)
  2. read that span a junction such that there is at least 15-bp alignment in both the exons (25% of reads).
  3. read that span a junction such that there is only 8-15 bp alignment in one exon (5% of reads)
  4. read that span a junction such that there is 1 to 7-bp alignment in one exon (4% of reads).
  5. read that span more than two junctions (3% of reads).
RNA-seq read types based on how it aligns

RNA-seq read types based on how it aligns (Source: HISAT preprint)

Most exonic reads are easier to align. However, the 12% of reads that span one or more junctions with small alignments on one exon are difficult for aligner with single global FM index. The paper states that these reads

make up to 30–60% of the total run time, and many of those reads are ultimately aligned incorrectly or left unaligned.

On the other hand HISAT with both global and local indexes handles these harder reads types easily with different alignment strategies. The basic idea is that once the end with longer alignment is anchored/aligned, the smaller end can easily found from the local indexes. And also HISAT can use known splice junctions info to get the alignments right.

HISAT-fast-splice-aligner

HISAT processes over 120K reads per second with 4.3 GB (Image credit: HISAT paper)

HISAT can align 120K reads per second using 4.3GB memory

HISAT: Hierarchical Indexing for Spliced Alignment of Transcripts, is the fastest splice-aware aligner for RNA-seq data with lowest memory footprint. HISAT is 50 times faster than TopHat2, 12 times faster than GSNAP, and slightly faster than STAR but with low memory. In the fastest mode, HISAT can align about 120K reads per second using just 4.3GB memory, in comparison to the next fastest aligner STAR that uses 28Gb to align about 80,000 reads.

Here is a quick comparison of running time and memory usage for aligning 109 Million reads using HISAT with other popular RNA-seq aligners (Table 2 from the paper). HISAT runs for just short of 25 mins requiring just 4.3 GB, in comparison to STAR aligner that can run fast but needs 28GB memory. While TopHat or oLego which require just about 4GB, but needs more than 950 mins to run.

HISAT-run-time-comparison

HISAT run time comparison with other RNA-seq aligners (Using data from Table 2)

HISAT-memory-usage-comparison

HISAT memory usage comparison with other RNA-seq aligners

A few cool tidbits about HISAT aligner:

  • HISAT can directly work with SRA data from NCBI on demand over internet. Yes, this means no need to download SRA reads manually and then convert tofasta/fastq format for running the alignment. All you need to align is to use hisat -x /path/to/index –sra-acc SRRxxxxx. That is sweet right?
  • HISAT uses Bowtie2 for low level implementation of FM indexes.
  • HISAT will be the core of the next version of TopHat, aka TopHat 3.
  • Yes. HISAT can handle genomes of any size, particularly larger than 4 billion base pairs
  • HISAT is optimized for read of length 75 to 150 bp, but it will also handle the 250- to 300-bp reads from MiSeq machines

HISAT is available from here.

 

Interesting discussion on the performance of HISAT on twitter.

Comments

  1. Thanks for a great breakdown of the paper! I love seeing computational approaches that use biological information to become better and faster. It makes sense that to align the second half of a split read you need only search nearby in the genome, nice to see this concept integrated into this aligner.

Speak Your Mind

*