Computational protocol: Unsupervised Outlier Profile Analysis

Similar protocols

Protocol publication

[…] To assess the performance of skewness, kurtosis and K2 tests, we first conducted simulation studies using biologically relevant parameters. We studied two situations. In the first situation, all outlier genes were assumed to be only overexpressed in a few tissues. In the second situation, each outlier gene was both induced and repressed in a few tissue types. In all simulations, we generated gene expression measurements for a total of N genes across T tissue types. We allowed for n0 genes to have non-preferential expression pattern across tissues and n1 genes to have tissue preferential expression pattern (N = n0 + n1). For the n0 genes that are not differentially expressed, we took the baseline gene expression to be distributed as normal, which is denoted by N(μ0,σ02).In the situation where outlier genes are only induced in a few tissues, we considered n1 genes to be induced in pt number of tissues. These n1 genes had a distribution of N(μ1,σ12) in the induced tissues and a baseline distribution of N(μ0,σ02) in other tissues. We compared the performance of the proposed methodology to the entropy-based approach by Kadota et al. using area under the receiver operating characteristic (ROC) curve (AUC). In term of performance, an AUC value close to 1 indicates good performance, whereas an AUC value of 0.5 indicates poor performance. The simulation results are shown in . shows that all three moment-based methods perform better than the entropy-based approach. Among the moment-based methods, the test based on skewness performs the best, while the K2 test performs better than the kurtosis test when the false-positive rate (FPR) is low.Next, we simulated the situation where each outlier gene is induced and repressed in a few tissues. We generated the baseline distribution for n0 genes as described above. Each of the n1 outlier genes had a distribution of N(μ1,σ12) in pt tissues, a distribution of N(−μ1,σ12) in other pt tissues, and a baseline distribution in T − 2 × pt tissues. shows that the entropy method performs better with larger μ1 and pt, while for smaller μ1 and pt, the kurtosis test performs the best. The reduced performance of skewness is expected as the simulated outlier genes have symmetrical distribution.In practice, of course, the true data-generating distribution is unknown to the analyst. Thus, we would recommend the use of the K2 test that combines information on skewness and kurtosis, as its performance seemed to be quite competitive with the best method for any simulation setting. An open question that is beyond the scope of the current paper is the possibility of constructing data-adaptive weights that can be used to combine the skewness and kurtosis tests in a powerful manner.Finally, we did a simulation in which we compared supervised methods such as in Ghosh and Chinnaiyan to the unsupervised methods developed here. In particular, we compared the B–H approach from Ghosh and Chinnaiyan, which we termed GOBH, to the various methods. Note that GOBH is a supervised algorithm that requires that the samples are labeled as diseased samples and non-diseased samples. The proposed kurtosis, skewness, and K2 tests as well as the entropy-based method (ROKU) are unsupervised models. One would expect that the supervised methods (with labels) outperform the unsupervised methods in general, because a strong hint (ie, the class labels) is given to the supervised method. However, if the labels are not informative, one would expect the supervised methods to perform worse than the unsupervised methods. We performed simulation analyses using the GOBH algorithm with correct sample labels (GOBH) and with shuffled sample labels (GOBH-shuffle). We use the same settings as in but for the purpose of comparisons, we plot everything separately in . The results show that the supervised method (GOBH) with correct labels consistently outperforms other methods. However, with shuffled class labels, the supervised method (GOBH-shuffle) shows variable performance with average AUC around 0.5 and is worse than the unsupervised methods. [...] The real data example features data from copy number and transcript mRNA microarrays, some of which are analyzed in Kim et al. We have data on 7534 genes of 47 subjects, 18 of whom have prostate cancer. We show the results of the analysis using the K2 test; similar results were found using the other two methods. We did an initial analysis using all 47 samples; however, no statistically significant genes were found using the procedure of Phillips and Ghosh. This appeared to be because of big differences in expression patterns between the cancer and non-cancer cases. The differences then revealed that the P-values were all clustered near the origin. This rendered the procedure of Phillips and Ghosh to be numerically unstable, as was alluded to in the bivariate extension section.Next, we performed an analysis of the cancer samples only. The goal was to identify genes that show extreme heterogeneity across the cancer subjects, which might be putative cancer biomarkers. First, we applied the analysis with the gene expression only. This is shown in . Based on the q-value analysis, we selected 490 genes as significant using an FDR cutoff of 0.05. The q-value analysis estimated about 20% of the genes to show significant expression based on outlier transcript profiles. Next, we repeated the analysis using copy number by itself, as shown in . There are many more significant genes that are found using the copy number expression data. About 45% of the genes are called statistically significant using the q-value method. This mirrors what was seen in Kim et al. using a supervised statistic. If we were to intersect the results of the individual copy number and gene expression analyses, we would find that 190 genes have an estimated FDR that is less than 0.05 for both platforms.Next, we show the results of the joint copy number and gene expression analyses using the points in the lower left-corner of corresponding to genes, which will be of interest as they show signal both on the copy number and on the transcript mRNA scale. Applying the procedure of Phillips and Ghosh identifies 734 genes as significantly expressed at an FDR of 0.05. Note that using the information jointly from copy number and transcript mRNA levels using the method of Phillips and Ghosh leads to almost four times as many rejections as the intersection analysis.Enrichment analysis of the selected genes using DAVID found pathways such as the cell-cycle pathway, the D4-GDI signaling pathway, and various metabolic pathways as being statistically overrepresented among the selected genes. […]

Pipeline specifications

Software tools ROKU, DAVID
Applications Gene expression microarray analysis, Transcription analysis
Diseases Prostatic Neoplasms