Computational protocol: Mining RNA–Seq Data for Infections and Contaminations

Similar protocols

Protocol publication

[…] The original implementation of ContextMap presented previously focused on refining initial mappings provided by other mapping algorithms. We recently developed a standalone version that can provide these initial mappings itself (see for details). Here, the central concept of both ContextMap versions is the so–called read context. This is defined as a set of reads originating from the same stretch of the genome, indicating that these reads were derived from the same transcript or different transcripts of the same gene. These contexts are defined based on the initial mapping and then extended in a subsequent re–alignment step, allowing a high degree of ambiguity both between and within contexts. For this purpose, ContextMap uses a modified version of Bowtie to identify spliced read alignments in a combination of forward and backward alignments. For each read not only the alignment with the minimum number of mismatches but any alignment to any context with at most a maximum number of mismatches is investigated. The unique mapping for the read to only one context is then determined by first finding the best mapping for the read in each context and subsequently finding the best context. For this purpose, a support score is used, taking into account the number of reads mapping within and around the region to which the read is aligned. Until the final step, contexts are treated independently of each other (see ).As we show in this article, the advantage of this approach is that it allows investigating many alternative sources of reads in parallel, such as rRNA sequences, which are generally not included in reference genome assemblies of higher eukaryotes, as well as viral and microbial genomes. Contexts are then identified separately for each genome including the optimal context in each genome for each read. The final step is then used to decide for each read which of these contexts in any of the genomes considered results in the best mapping.The parallel multi–species mapping is implemented by ContextMap in the following way (). First, independent Bowtie indices are created for different potential read sources. Separate indices are necessary as Bowtie is limited to –1 characters per index. This is relevant as the human genome alone needs 73% of the maximum index size and all microbial genomes from the NCBI database taken together require 134% of the maximum index size. We, thus, generally use one index for rRNA sequences, one for the host genome, e.g. the human reference genome, one for virus genomes and two for microbe genomes. This can be easily adjusted to more indices as soon as the increasing number of sequenced virus and microbe genomes makes this necessary. After performing the initial alignment against all indices, ContextMap is then run without any further changes to define contexts, the optimal mapping for each read in each context it may belong to and finally the optimal and unique mapping for each read to any context.In contrast to ContextMap, other RNA–seq mapping tools, which predominantly also use Bowtie, cannot be used for this application as they do not support the use of multiple indices required here due to the size and number of reference sequences and provide no way to distinguish between alternative alignments for a read to two different but related genomes with the same number of mismatches. Thus, they can only be applied sequentially by mapping first all reads e.g. against rRNA sequences, then the unmapped reads against the host reference genome, and then one microbe or virus genome one after the other. However, the latter approach also poses problems as it can lead to different results depending on the order in which genomes are mapped to in case of closely related species or strains. […]

Pipeline specifications

Software tools ContextMap, Bowtie
Application RNA-seq analysis
Diseases Infection