Evaluate the quality of your read alignment with GeneQC

RNA-sequencing has replaced gene array and is now the leading technology in gene expression analysis. After a sequencing step, reads need to be mapped to a reference genome. However, this step is not perfect and errors can impact all downstream analyses.
To address this issue, Adam McDermaid and colleagues have developed GeneQC, a tool to evaluate the quality of read alignment. Here, he describes GeneQC and its features.

Quality control of read alignment

One of the main benefits of using modern RNA-Sequencing (RNA-Seq) technology is the more accurate gene expression estimations compared with previous generations of expression data, such as the microarray. However, numerous issues can result in the possibility that an RNA-Seq read can be mapped to multiple locations on the reference genome with the same alignment scores, which occurs in plant, animal, and metagenome samples. Such a read is so-called a multiple-mapping read (MMR). The impact of these MMRs is reflected in gene expression estimation and all downstream analyses, including differential gene expression, functional enrichment, etc.

 

Current analysis pipelines lack the tools to effectively test the reliability of gene expression estimations, thus are incapable of ensuring the validity of all downstream analyses.

 

GeneQC is a computational tool capable of evaluating the quality of read alignment through feature extraction and statistical modeling. This tool calculates a D-score for each annotated gene, which gives a distinct measure of the mapping uncertainty.  To demonstrate the application of GeneQC, we implemented the tool on RNA-Seq datasets from seven plant and animal species and found that all seven species have extensive mapping uncertainty issues. Extracted features indicated that the animal samples had relatively lower percentages of genes with mapping uncertainty issues. However, those genes with some level of mapping uncertainty were much more likely to have high mapping uncertainty issues. Additionally, plant species demonstrated mapping uncertainty for roughly 20% of the genes across the five analyzed plant species.  Although, mapping uncertainty levels were more evenly distributed across the Low, Moderate, and High categorizations than compared to the animal species.

GeneQC Features

 

The GeneQC algorithm has two main parts. The first is feature extraction, in which genomic, transcriptomic, and network level information is collected for each gene from the read alignment and reference genome. Genomic feature information is collected in the form of sequence similarity between two genes, transcriptomic feature information is derived from the proportion of shared multi-mapped reads, and network information is collected from the number of gene pair interactions. These three levels of information are used with elastic-net regularization methods to calculate a set of D-scores for each sample. These D-scores represent the level of mapping uncertainty for each gene. In order to facilitate a qualitative representation of the mapping uncertainty, extensive mixture model fitting is implemented to categorize the D-scores into three categories of mapping uncertainty. Certainty of the mapping uncertainty categorization is provided through an alternative likelihood, representing the maximum probability of that D-score belonging to any other category. The three extracted features, D-score, mapping uncertainty categorization, and alternative likelihood value are provided in the output file for users to evaluate the quality of their read alignment.

 

GeneQC Omictools
Fig 1. Mapping Uncertainty and GeneQC. (A) The MMR percentages for the 95 datasets across seven species. More detailed information is showcased in Table 1; (B) GeneQC takes a read alignment, reference genome, and annotation file as inputs; (C) The first step of GeneQC is to extract features related to mapping uncertainty for each annotated gene; (D) Using the extracted features, elastic-net regularization is used to calculate the D-score, which represents the mapping uncertainty for each gene; (E) A series of Mixture Normal and Mixture Gamma distributions are fit to the D-scores; and (F) The mixture models are used to categorize the D-scores into different levels of mapping uncertainty along with a statistical alternative likelihood value for each gene.

 

GeneQC Omictools
Fig 2. GeneQC application. The results related to the analysis of seven datasets representing five plant and two animal species. (A) Categorizations for the level of mapping uncertainty per gene are shown relative to all categorizations. (B) Boxplots for the three extracted features of each gene are shown for each analyzed sample. D1, D2, and D3 represent the sequence similarity, proportion of shared MMR, and degree weight, respectively. Each value is shown normalized between 0 and 1. Only genes with mapping uncertainty are displayed. (C) Derived D-scores for each gene are shown by species, as calculated from the three features in (B). Higher D-scores represent higher levels of mapping uncertainty.

 

GeneQC is freely available at http://bmbl.sdstate.edu/GeneQC/home.html

Reference

Adam McDermaid et al. (2018). GeneQC: A quality control tool for gene expression estimation based on RNA-sequencing reads mapping. bioRxiv preprint.