Computational protocol: RNA-Seq Count Data Modelling by Grey Relational Analysis and Nonparametric Gaussian Process

Similar protocols

Protocol publication

[…] The proposed methodology for analysis of RNA-seq read counts is graphically presented in . One of the basic tasks in the analysis of RNA-seq count data is the detection of differentially expressed genes []. In this paper, the RNA-seq read counts are first transformed using the voom method []. The transform alleviates the typical skewness, dependency between mean and variance or extreme values of RNA-seq data. After the transformation, RNA-seq data can be treated as if it was microarray data. This means that any normal-based methods or gene set testing procedures can be applied to RNA-seq data. We then design the GRA-based aggregate feature selection method that combines outcomes of five individual methods including two-sample t-test, entropy test (known as Kullback-Liebler distance or divergence) [], Bhattacharyya distance [], Wilcoxon test [] and receiver operating characteristic (ROC) curve [] to select significant genes as biomarkers. GRA-based method performs as a filter approach based on the assumption that the features are independent. This assumption is often made for high-dimensional low-sample data as there are too few observations available to be able to effectively estimate the dependence structure among the features [, –].Once discriminant feature subsets have been selected, they serve as inputs into the GP models for classification. GP is fast and computationally tractable based on analytic formulae. A GP is completely characterized by its mean and covariance functions but it is not limited by a parametric form. Being a nonparametric method, the number and nature of GP parameters are flexible and not fixed in advance but are determined from data. Therefore, uncertainty and complexity of RNA-seq data can be addressed effectively by GP models. Under the GP viewpoint, the models are transparent and hence amenable to interpretation compared to black-box methods such as neural networks []. Generalization capability of GP based on Bayesian formalism can yield high classification performance for RNA-seq data modelling. Details of the voom transform approach, GRA-based feature selection method and GP models are sequentially presented in the following subsections. [...] Two benchmark real datasets are utilized in this study for comparisons. We name the first dataset as “Mont-Pick” as it is obtained from a combination of two studies Montgomery et al. [] and Pickrell et al. []. This data set is available through the ReCount RNA-seq database developed by Frazee et al. []. The data can be used to analyze differential expression between two ethnicities: the Montgomery group sequenced Utah people with ancestry from northern and western Europe (the HapMap CEU population) and the Pickrell group sequenced Yoruba residents in Ibadan, Nigeria (the HapMap YRI population). These two groups of ethnicities are treated as two separate classes in this study. There are 60 samples from the CEU group and 69 samples from the YRI group. A total of 52,580 genes are processed by which genes with zero counts in all samples are removed. The number of nonzero count genes of the Mont-Pick dataset is 12,984.The second data set “cervical cancer” is available from Gene Expression Omnibus [] under accession number GSE20592, which was utilized in [, ]. The data include 29 tumor and 29 non-tumor cervical tissue samples measured on 714 microRNAs, which are small RNAs with 18–30 nucleotides in length. The classification task is to distinguish between tumor and non-tumor samples. Details of the experimental datasets are presented in .In the Mont-Pick dataset, the CEU and YRI samples were sequenced by different groups using potentially different facilities. Therefore, the batch effect would be a factor that affects the performance comparisons of RNA-seq data analysis approaches. To deal with this issue, we have included the design option that addresses the batch effect in the voom transformation method, which is implemented in the limma package []. Figs and show pseudocolor heat maps of the expression levels before and after voom transformation for the Mont-Pick and cervical cancer datasets respectively. The x-axis represents genes or genomic features of interest whilst the y-axis represents data samples of different groups (classes). Both datasets used in this study have two classes of samples and the horizontal red line in every heat map divides the samples into the two classes. In each heat map, the corresponding color bar representing expression levels is plotted adjacent to the color map. The white color represents mid points, warm colors represent high expression levels and cool colors indicate low expression or sparse regions.Before voom transform, the white region locates at the bottom of color bars in both datasets (see Figs and ). This shows that read counts follow a positively skewed distribution, which would hinder the application of normal-based methods. Moreover, color maps of data before voom are almost blue with a very small proportion of warm color spots, which represent extreme values or outliers. The range of expression levels before voom transformation is extremely large, from 0 to 91,991 in the Mont-Pick dataset and from 0 to 476,438 in the cervical cancer dataset. The large blue areas in color maps represent sparse data, especially in . In contrast, the data after voom transformation is continuously distributed with the white region locates near the middle of color bars. In addition, the data after voom transform are less sparse than the original count reads as heat maps are more colorfully diversified with the combination of cool and warm colors. This demonstrates that the transformed data practically follow a normal distribution and can be processed by normal-based statistical methods. [...] To highlight the advantages of GRA-based feature selection method, we implement a number of competing methods for comparisons including ReliefF [], iterative search margin based algorithm (Simba) [], signal-to-noise ratio (SNR) [], and information gain (IG) [].The following methods are also applied for comparisons with the designed GP classifier: k-nearest neighbors (kNN) [], multilayer perceptron (MLP) [], support vector machine (SVM) [] and ensemble learning AdaBoost []. Specifically, the number of nearest neighbors in kNN is equal to 5 and SVM kernel function is the Gaussian radial basis function with the scaling factor of 1. MLP is constructed with two hidden layers and each layer comprises five nodes. AdaBoost uses a collection of individual learners that are 100 decision trees.Four different performance metrics including accuracy rate, F1 score statistics (F-measure), AUC and mutual information (MI) are used to evaluate performance of classification methods. F-measure considers both the “Precision” (denoted as Pr) and “Recall” (Re) of the classification procedure to compute the score expressed by: F-measure=2×Pr×RePr+Re(10)The MI between estimated and true label is calculated by: MI(C^,C)=∑c^=01∑c=01p(c^,c)logp(c^,c)p(c^)p(c)(11) where p(c^,c) is the joint probability distribution function of estimated and true class labels C^ and C, and p(c^) and p(c) are the marginal probability distribution functions of C^ and C respectively.The five-fold cross validation method is employed for experiments. Data samples are divided at random into five distinct subsets and four subsets are used for training classifiers whilst the last subset is for testing. For unbiased comparisons, each classifier is repeated 30 times and the average performance is reported along with the standard error.To draw convincing conclusions in evaluating performance of feature selection methods and classifiers, we implement the Mann-Whitney U-test [] for comparing two sets of results. The Mann-Whitney U-test is a nonparametric test of the null hypothesis that two populations have distributions with equal medians, against the alternative hypothesis that they do not. As the results over 30 trials may not be normally distributed, the use of Mann-Whitney U-test is more appropriate than that of normal-based methods [].Note that the test is performed to compare between the set of 30 outcomes generated by GRA method and that obtained by each of the competing feature selection methods using the same classifier. Similar procedure is performed to compare the GP classifier with its competing methods, i.e. kNN, MLP, SVM and AdaBoost using the same feature selection method. […]

Pipeline specifications

Software tools voom, limma, Inbix
Applications RNA-seq analysis, GWAS