Computational protocol: Imputing gene expression to maximize platform compatibility

Similar protocols

Protocol publication

[…] We restricted our GEO query to human samples and selected all series records (GSEs) as of March 2015 that were based on the Affymetrix HG-U133 Plus 2.0 platform. Each GSE was then mapped to its corresponding samples (GSMs). If a GSM appeared in more than one GSE, we assigned it to the oldest GSE. GSMs without associated microarray CEL files were treated as invalid for our purposes, and only GSEs containing at least three valid samples were retained. In total, 97 049 microarray CEL files, coming from 2753 accepted GSEs, were downloaded. The R packages GEOquery () and GEOmetadb () were used to perform the above tasks.The CEL files within each GSE were processed using robust multi-array average (RMA) (; ). Technical bias correction was done using the R package bias v0.0.5 (). The probe sets were then mapped to Entrez gene identifiers using the R package Jetset v3.1.2 (). The mapping from Jetset yielded 20 089 and 12 210 unique Entrez gene identifiers for the HG-U133 Plus 2.0 and HG-U133A platforms respectively, with the latter being a proper subset of the former. Of these, only 10 103 Entrez gene identifiers were obtained from the same probes on both platforms. We refer to these 10 103 genes as the ‘common gene set’ (). [...] All external evaluation datasets were downloaded separately from GEO and processed using RMA (; ). We then mapped the probe sets to Entrez gene identifiers using Jetset and predicted the expression level of genes not measured in the HG-U133A platform using our model. We define the gene set, comprising of the measured genes on the HG-U133A platform and the predicted genes, as the ‘imputed gene set’ (), and the corresponding transformed HG-U133A sample as the ‘imputed sample’.To evaluate the accuracy of our model, we used three previously published works (GSE17700 (), GSE23906 () and GSE3061 ()) that assessed the concordance of data from both platforms. These three studies consist of samples that were measured using both the HG-U133A and the HG-U133 Plus 2.0 platforms. We compared the 9986 imputed HG-U133 Plus 2.0 genes to those measured on the HG-U133 Plus 2.0 array using Spearman’s correlation. We additionally correlated the imputed sample of 20 089 genes to the measured sample to assess the similarity of using the imputed versus measured values of the imputed gene set in downstream analysis.We also applied our model to another three studies (GSE11482 (), GSE3893 () and GSE26712 ()) to demonstrate the effect of the increased number of features on downstream data analysis. GSE11482 consists of 53 samples representing four different types of pediatric kidney tumors measured on the HG-U133A platform. We performed hierarchical clustering (using 1-Spearman’s correlation as the metric) on the samples restricted to the common gene set as defined previously and also on the imputed samples.In analyzing GSE3893 and GSE26712, we applied the methods described in the original papers to the samples, filtered by (i) the full probe set used in the original work, (ii) the common gene set and (iii) the imputed gene set. GSE3893 consists of 24 breast cancer samples from 20 tumors with ductal carcinoma in situ (DCIS) and invasive ductal carcinoma (IDC). Ten samples were profiled using the HG-U133A array, and the remaining 14 samples used the HG-U133 Plus 2.0 array. We constructed imputed samples from the HG-U133A samples, and used these alongside the original HG-U133 Plus 2.0 samples in our analysis. For their analysis, Schuetz et al. performed hierarchical clustering using the neighbor-joining method with 1-Pearson’s correlation as the distance metric. We applied the same method using the ape v.3.4 package () to the original data and the imputed arrays.For GSE26712, Bonome et al. derived a gene signature to predict survival in suboptimally debulked ovarian carcinoma patients and validated their signature in an independent dataset from . In concordance with their methods, we constructed univariate Cox proportional hazards models for each gene. Genes with p-value less than 0.01 were used to form a gene signature to differentiate long and short survival time in suboptimally debulked ovarian cancer patients. A compound covariate regression model was constructed using the significant genes from GSE26712 and tested on the data from Berchuck et al. Data from both studies were median-adjusted as done in Bonome et al. Validation data was obtained via the R package FULLVcuratedOvarianData (), which contained 28 of the original 29 suboptimally debulked ovarian cancer samples (). We evaluated performance on the validation data using the chi-squared test as done in Bonome et al. We assigned short survival, or poor prognosis, patients as the positive case, and we further measured performance using accuracy, precision, recall and the F1 measure. […]

Pipeline specifications

Software tools GEOquery, GEOmetadb, Jetset, APE
Databases Gene
Applications Phylogenetics, Transcription analysis
Organisms Homo sapiens