Computational protocol: Data Mining of Gene Arrays for Biomarkers of Survival in Ovarian Cancer

Similar protocols

Protocol publication

[…] Gene array data were downloaded from Array Express, the dataset was built from tissue from patients with ovarian cancer who have been treated with the same care pathway. Full data and information is available at Array Express under experiments E-GEOD-13876 and E-GEOD-26712 [].Based on the patient information and data annotations provided with both datasets, survival time was selected as the basis for this investigation, i.e., survival time was the only listed variable common to both data sets. Both of these datasets could be used to identify genes whose expression significantly and consistently associate with survival time from Stage III serous ovarian cancer, and, to validate or refute any genes recently reported to be linked to ovarian cancer but not fully validated.Cohort 1:Full data and information is available at Array express under the E-GEOD-13876 [] Array: A-GEOD-7759-Operon human v3 ~35 K 70-mer two-color oligonucleotide microarrays. Sample information: 157 consecutive patients donated tumor from cyto-reductive surgery prior to platinum based chemotherapy treated at University Medical Center Groningen (UMCG, Groningen, The Netherlands) in the period 1990–2003 [].Cohort 2:Full data and information is available at Array Express under experiment E-GEOD-26712 [] Array: A-AFFY-33-Affymetrix GeneChip Human Genome HG-U133A [HG-U133A]. Sample information: 185 late-stage (III–IV) high-grade (2,3) ovarian cancer tumors donated from previously untreated patient at Memorial Sloan-Kettering Cancer Center between 1990 and 2003 []. [...] A set of six three-layered back propagation ANNs with an architecture of 1 input node, 2 hidden layer nodes and 1 output node were trained to identify gene probes that perform well as predictors of short and long survival. The ANN algorithm was developed at NTU [,], contact CompanDX [] for further details. Multiple ANNs were trained to accommodate a categorical analysis around a continuous variable. A backpropagation algorithm was used to update the weights of the ANN and was trained to convergence on an early stopping randomly extracted dataset comprising 20% of the global dataset. A sigmoidal transfer function was used in the architecture to relate input gene expression to survival. Firstly, the survival distribution of the population of the two datasets were observed, three possible cut-off time points determining short and long survival were defined; above and below 16, 23 and 30 months. Using these three survival cut-offs, ANN analyses were conducted on the two datasets. Within each of the six ANN analyses, the gene probes were ranked by their root mean gained error on an internal blind validation step comprising a different 20% of the global dataset and gene probes ranking below 0.05% were disregarded. The gene short names of these shortlisted gene probes were then cross-referenced across the three ANN from each time point in each dataset. Gene names were then weighted based on the frequency of their presence in the three ANNs top 0.05% ranking probes. The list of weighted gene names with a consistent predictive performance between long and short term survival were taken forward to the meta-analysis (see for full gene probe listings).Cox univariate survival analysis was conducted on every gene probe individually to determine the expression significantly correlated with survival. To do this, a macro was created within Statistica software that cycled round each of the thousands of gene probes within each dataset and produced a report for each one. Due to software limitations, this had to be done in several batches of 4000 probes for each dataset. The individual output reports were compiled and converted to an Excel spreadsheet. Gene probes were ranked by their p-value and any below 0.05 were disregarded. The gene codes of the gene probes with a p-value of ≤0.05 were taken forward for the meta-analysis (p-values available in ).The Pivot table function within Excel was used to cross-compare the gene codes that performed well as predictors in the MLP-ANNs and had a significant p-value in the Cox univariate survival analysis. Gene probes that did not occur in all four categories were disregarded. The data corresponding to the gene probes of the genes identified to be of interest were extracted from the data. T-tests were conducted using the same time point cut-offs as described for the ANNs. Genes that did not have a significant p-value for one or more probe in both datasets were disregarded. Finally the mean averages of each were compared. Genes whose expression trends differed when correlated with survival between the datasets were disregarded.The final list of 56 gene codes () were cross-referenced using STRING to highlight any known association or link between them [,]. Literature and online resources such as Gene Cards and Human Protein Atlas were further mined to create a database of genomic, proteomic, expression, oncologic and pathway information to direct avenues of further investigation [,].The probability this discovery occurring by chance was a probability of 1.39859 × 10−11. The number of genes found to be of interest multiplied by number of possible probes in each data set for both analyses ((56/37,632) × (56/22,283) × (56/37,632) × (56/22,283)) = 1.39859 × 10−11. If the work of Fury et al. [] is taken into consideration, this probability may be even lower. […]

Pipeline specifications

Software tools affy, Statistica
Applications Miscellaneous, Gene expression microarray analysis
Organisms Homo sapiens
Diseases Ovarian Neoplasms