Evaluate the quality of your read alignment with GeneQC
RNA-sequencing has replaced gene array and is now the leading technology in gene expression analysis. After a sequencing step, reads need to be mapped to a reference genome. However, this step is not perfect and errors can impact all downstream analyses.
To address this issue, Adam McDermaid and colleagues have developed GeneQC, a tool to evaluate the quality of read alignment. Here, he describes GeneQC and its features.
Quality control of read alignment
One of the main benefits of using modern RNA-Sequencing (RNA-Seq) technology is the more accurate gene expression estimations compared with previous generations of expression data, such as the microarray. However, numerous issues can result in the possibility that an RNA-Seq read can be mapped to multiple locations on the reference genome with the same alignment scores, which occurs in plant, animal, and metagenome samples. Such a read is so-called a multiple-mapping read (MMR). The impact of these MMRs is reflected in gene expression estimation and all downstream analyses, including differential gene expression, functional enrichment, etc.
Current analysis pipelines lack the tools to effectively test the reliability of gene expression estimations, thus are incapable of ensuring the validity of all downstream analyses.
GeneQC is a computational tool capable of evaluating the quality of read alignment through feature extraction and statistical modeling. This tool calculates a D-score for each annotated gene, which gives a distinct measure of the mapping uncertainty. To demonstrate the application of GeneQC, we implemented the tool on RNA-Seq datasets from seven plant and animal species and found that all seven species have extensive mapping uncertainty issues. Extracted features indicated that the animal samples had relatively lower percentages of genes with mapping uncertainty issues. However, those genes with some level of mapping uncertainty were much more likely to have high mapping uncertainty issues. Additionally, plant species demonstrated mapping uncertainty for roughly 20% of the genes across the five analyzed plant species. Although, mapping uncertainty levels were more evenly distributed across the Low, Moderate, and High categorizations than compared to the animal species.
The GeneQC algorithm has two main parts. The first is feature extraction, in which genomic, transcriptomic, and network level information is collected for each gene from the read alignment and reference genome. Genomic feature information is collected in the form of sequence similarity between two genes, transcriptomic feature information is derived from the proportion of shared multi-mapped reads, and network information is collected from the number of gene pair interactions. These three levels of information are used with elastic-net regularization methods to calculate a set of D-scores for each sample. These D-scores represent the level of mapping uncertainty for each gene. In order to facilitate a qualitative representation of the mapping uncertainty, extensive mixture model fitting is implemented to categorize the D-scores into three categories of mapping uncertainty. Certainty of the mapping uncertainty categorization is provided through an alternative likelihood, representing the maximum probability of that D-score belonging to any other category. The three extracted features, D-score, mapping uncertainty categorization, and alternative likelihood value are provided in the output file for users to evaluate the quality of their read alignment.
Adam McDermaid et al. (2018). GeneQC: A quality control tool for gene expression estimation based on RNA-sequencing reads mapping. bioRxiv preprint.