ISMB 2014, one of the largest Bioinformatics/Computational Biology conferences, to he held in Boston this year from July 11 – 15, 2014 announced the list of papers accepted for conference proceedings. The accepted papers will be presented as talks in the conference and published as conference proceedings in a special issue of Bioinformatics journal online. The ISMB special issue will also be open access and will be availble about a month before the conference.
In addition to the traditional papers and posters, ISMB 2014 also hosts Highlights Track, which lets recently published work and yet to be on print work to be presented as talk at ISMB. Check the abstracts selected for Highlights Tracks here.
Among the many interested papers accepted for publication, here is a few abstracts that look interesting. One of them is RNA-Skim, a k-mer based RNA-seq method inspired by Sailfish. RNA-Skim abstract claims that it uses less than 4% of the k-mers and less than 10% of the CPU time required by Sailfish.
Check below for a few other interesting papers’ abstract.
RNA-Skim: a rapid method for RNA-Seq quantification at transcript level
Zhaojun Zhang, UNC – Chapel Hill, United States
Wei Wang, University of California, Los Angeles, United States
Motivation: RNA-Seq technique has been demonstrated as a revolutionary means for exploring transcriptome because it provides deep coverage and base-pair level resolution. RNA-Seq quantification is proven to be an efficient alternative to Microarray technique in gene expression study, and it is a critical component in RNA-Seq differential expression analysis. Most existing RNA-Seq quantification tools require the alignments of fragments to either a genome or a transcriptome, entailing a time-consuming and intricate alignment step. In order to improve the performance of RNA-Seq quantification, an alignment-free method, Sailfish, has been recently proposed to quantify transcript abundances using all k-mers in the transcriptome, demonstrating the feasibility of designing an efficient alignment-free method for transcriptome quantification. Even though Sailfish is substantially faster than alternative alignment-dependent methods such as Cufflinks, using all k-mers in the transcriptome quantification impedes the scalability of the method.
Results: We propose a novel RNA-Seq quantification method, RNA-Skim, which partitions the transcriptome into disjoint transcript clusters based on sequence similarity and introduces the notion of sig-mers that are a special type of k-mers uniquely associated with each cluster. We demonstrate that the sig-mer counts within a cluster are sufficient for estimating transcript abundances with accuracy comparable to any state of the art method. This enables RNA-Skim to perform transcript quantification on each cluster independently, reducing a complex optimization problem into smaller optimization tasks that can be run in parallel. As a result, RNA-Skim uses less than 4% of the k-mers and less than 10% of the CPU time required by Sailfish. It is able to finish transcriptome quantification in less than 10 minutes per sample by using just a single thread on a commodity computer, which represents more than 100 speedup over the state of the art alignment-based methods, while delivering comparable or higher accuracy.
Availability: The software is available at http://www.csbio.unc.edu/rs
Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation
Tarmo Äijö, Aalto University, Finland
Vincent Butty, Massachusetts Institute of Technology, United States
Zhi Chen, University of Turku, Finland
Verna Salo, University of Turku, Finland
Subhash Tripathi, University of Turku / Åbo Akademi University , Finland
Christopher Burge, Massachusetts Institute of Technology, United States
Riitta Lahesmaa, University of Turku, Finland
Harri Lähdesmäki, Aalto University,, Finland
Motivation: Gene expression profiling using RNA-seq is a powerful technique for screening RNA species’ landscapes and their dynamics in an unbiased way. While several advanced methods exist for differential expression analysis of RNA-seq data, proper tools to analyze RNA-seq time-course have not been proposed.
Results: In this study, we use RNA-seq to measure gene expression during the early human T helper 17 (Th17) cell differentiation and T cell activation (Th0). To quantify Th17 specific gene expression dynamics, we present a novel statistical methodology, DyNB, for analyzing time-course RNA-seq data. We use non- parametric Gaussian process to model temporal correlation in gene expression and combine that with negative binomial likelihood for the count data. To account for experiment specific biases in gene expression dynamics, such as differences in cell differentiation efficiencies, we propose a method to rescale the dynamics between replicated measurements. We develop an MCMC sampling method to make inference of differential expression dynamics between conditions. DyNB identifies several known and novel genes involved in Th17 differentiation. Analysis of differentiation efficiencies revealed consistent patterns in gene expression dynamics between different cultures. We use qRT-PCR to validate differential expression and differentiation efficiencies for selected genes. Comparison of the results with those obtained via traditional time point wise analysis shows that time-course analysis together with time rescaling between cultures identifies differentially expressed genes which would not otherwise be detected.
Availability: An implementation of the proposed computational methods will be available at http://research.ics.aalto.fi/csb/software/
Deep learning of the tissue-regulated splicing code
Michael Leung, University of Toronto, Canada
Hui Xiong, University of Toronto, Canada
Leo Lee, University of Toronto, Canada
Brendan Frey, University of Toronto, Canada
Motivation: Alternative splicing is a regulated process that directs the generation of different transcripts from single genes. A computational model that can accurately predict splicing patterns based on genomic features and cellular context is highly desirable, both in understanding this widespread phenomenon, and in exploring the effects of genetic variations on alternative splicing.
Methods: Using a deep neural network, we developed a model inferred from mouse RNA-Seq data that can predict splicing patterns in individual tissues and differences in splicing patterns across tissues. Our architecture utilizes hidden variables that jointly represent features in genomic sequences and tissue types when making predictions. A graphics processing unit was used to greatly reduce the training time of our models with millions of parameters.
Results: We show that the deep architecture surpasses the performance of the previous Bayesian method for predicting alternative splicing patterns. With the proper optimization procedure and selection of hyperparameters, we demonstrate that deep architectures can be beneficial, even with a moderately sparse dataset. An analysis of what the model has learned in terms of the genomic features is presented.
Probabilistic Method for Detecting Copy Number Variation in a Fetal Genome using Maternal Plasma Sequencing
Ladislav Rampášek, University of Toronto, Canada
Aryan Arbabi, University of Toronto, Canada
Michael Brudno, University of Toronto, Canada
Motivation: The past several years have seen the development of methodologies to identify genomic variation within a fetus through the non-invasive sequencing of maternal blood plasma. These methods are based on the observation that maternal plasma contains a fraction of DNA (typically 5-15%) originating from the fetus, and such methodologies have already been used for the detection of whole- chromosome events (aneuploidies), and to a more limited extent for smaller (typically several megabases long) Copy Number Variants (CNVs).
Results: Here we present a probabilistic method for non-invasive analysis of de novo CNVs in fetal genome based on maternal plasma sequencing. Our novel method combines three types of information within a unified Hidden Markov Model: the imbalance of allelic ratios at SNP positions, the use of parental genotypes to phase nearby SNPs, and depth of coverage to better differentiate between various types of CNVs and improve precision. Our simulation results, based on in silico introduction of novel CNVs into plasma samples with 13% fetal DNA concentration, demonstrate a sensitivity of 90% for CNVs >400 kilobases (with 13 calls in an unaffected genome), and 40% for 50-400kb CNVs (with 108 calls in an unaffected genome).
Availability: Implementation of our model and data simulation method is available at http://github.com/compbio-UofT/fCNV
Detecting independent and recurrent copy number aberrations using interval graphs
Hsin-Ta Wu, Brown University, United States
Iman Hajirasouliha, Brown University, United States
Benjamin Raphael, Brown University, United States
Somatic copy number aberrations are frequent in cancer genomes, but many of these are random, passenger events. A common strategy to distinguish functional aberrations from passengers is to identify those aberrations that are recurrent across multiple samples. However, the extensive variability in the length and position of copy number aberrations makes the problem of identifying recurrent aberrations notoriously difficult.
We introduce a combinatorial approach to the problem of identifying independent and recurrent copy number aberrations, focusing on the key challenging of separating the overlaps in aberrations across individuals into independent events. We derive independent and recurrent copy number aberrations as maximal cliques in an interval graph constructed from overlaps between aberrations. We efficiently enumerate all such cliques, and derive a dynamic programming algorithm to find an optimal selection of non-overlapping cliques, resulting in a very fast algorithm, which we call RAIG (Recurrent Aberrations from Interval Graphs).
We show that RAIG outperforms other methods on simulated data and performs well on data from three cancer types from The Cancer Genome Atlas (TCGA). In contrast to existing approaches that employ various heuristics to select independent aberrations, RAIG optimizes a well-defined objective function. We show that this allows RAIG to identify rare aberrations that are likely functional, but are obscured by overlaps with larger passenger aberrations.
Primate Transcript and Protein Expression Levels Evolve under Compensatory Selection Pressures
Zia Khan , University of Maryland, United States
Michael Ford, MS Bioworks, LLC, United States
Darren Cusanovich, University of Chicago, United States
Amy Mitrano, University of Chicago, United States
Jonathan Prichard, Stanford University, United States
Yoav Gilad, University of Chicago, United States
Due to the technical and computational challenges of conducting comparative, genome-scale proteomics, essentially all studies of gene regulatory evolution across primates and other mammals have focused on mRNA levels rather than protein levels. Yet, proteins perform much of the work of the cell and are subject to regulation not revealed by mRNA levels alone. Using quantitative mass spectrometry and novel computational analysis methods, we obtained thousands of comparative mRNA and protein expression measurements from human, chimpanzee, and rhesus macaque lymphoblastoid cell lines. We used data from all three species to identify genes whose regulation might have evolved under natural selection, and considered jointly, our data allowed us to identify genes where lineage-specific changes might specifically affect post-transcriptional or post-translational regulation. Our analyses indicate that on an evolutionary timescale, there is surprising flexibility in primate mRNA levels, as these changes are often either buffered or compensated for at the protein level.
Privacy Preserving Protocol for Detecting Genetic Relatives Using Rare Variants
Farhad Hormozdiari, University of California, Los Angeles, United States
Jong Wha Joo, University of California, Los Angeles, United States
Feng Guan, University of California, Los Angeles, United States
Akshay Wadia, University of California, Los Angeles, United States
Rafail Ostrosky, University of California, Los Angeles, United States
Amit Sahai, University of California, Los Angeles, United States
Eleazar Eskin, University of California, Los Angeles, United States
Motivation: High-throughput sequencing technologies have impacted many areas of genetic research. One such area is the identification of relatives from genetic data. The standard approach for the identification of genetic relatives collects the genomic data of all individuals and stores it in a database. Then, each pair of individuals are compared to detect the set of genetic relatives, and the matched individuals are informed. The main drawback of this approach is the requirement of sharing your genetic data with a trusted third party to perform the relatedness test.
Results: In this work, we propose a secure protocol to detect the genetic relatives from sequencing data while not exposing any information about their genomes. We assume that individuals have access to their genome sequences but do not want to share their genomes with anyone else. Unlike previous approaches, our approach uses both common and rare variants which provides the ability to detect much more distant relationships securely. We use a simulated data generated from the 1000 genomes data and illustrate that we can easily detect up to fifth degree cousins which was not possible using the existing methods. We also show in the 1000 genomes data with cryptic relationships that our method can detect these individuals.