Tophat/RNA-seq Analysis Nostalgia

Recently just started playing with TopHat, after a while back, for doing gene expression quantitation using  RNA-seq data and got all nostalgic about RNA-seq analysis over the years.

Although there are over 70 short read aligners to choose from, accurately aligning non-contiguos RNA-seq reads and quantifying expression is still challenging.  Only a few of the aligners (about 10), like TopHat,  STAR, GSNAP, RUM, and MapSplice are useful for RNA-seq analysis.

The evolution of RNA-seq analysis space is interesting.  Almost everyone takes the TopHat/Cufflinks route for expression quantification and complain how bad it was.  In the last few years, TopHat faced a (very) little heat from aligners like STAR, GSNAP, and RUM from the early adopters.  The lightning fast STAR seemed to rise up the ranks quickly with the large scale ENCODE project support.  For the regular RNA-seq users, TopHat has always had the monopoly. The 2009 TopHat paper has over 400 citations from pubmed and over 1100 at Google Scholar.

Whether you hate it or love it, TopHat has been pretty quick on its feet to adapt new approaches and come up with a updated version. Be it coming up with a suite of tools like cufflinks and cuffdiff or adapting transcriptome plus genome alignment.  Already TopHat has gone through eight updates to TopHat 2.  TopHat 2 is around for little more than a year and came out first adding support for Bowtie 2 aligner that can handle short insertions or deletions. Just recently, the TopHat team published a paper on TopHat 2 in Genome Biology with new improvements.

If you are around with RNA-seq data for a “while”, you may remember the annoying/interesting tid bits of Tophat

  • TopHat used Bowtie and Maq for aligning reads. Yes. you needed two aligners. (well.. read lengths were in 20-30 range)
  • TopHat can analyze Paired End RNA-Seq data, but it won’t use the mate information.
  • TopHat wanted sequence data in a single file
  • TopHat could not deal with multireads
  • TopHat was giving expression quantification and there was no cufflinks
  • TopHat alignment was not in SAM format
  • and possibly many more….

No doubt every other NGS analysis pipeline has changed so much since it was around. It will be interesting to see “annoying features” of other NGS analysis.


Speak Your Mind