Computational protocol: Prediction Errors in Learning Drug Response from Gene Expression Data – Influence of Labeling, Sample Size, and Machine Learning Algorithm

Similar protocols

Protocol publication

[…] The gene expression microarray data for the NCI60 panel have been downloaded from the European Bioinformatics Institute (EBI) ArrayExpress website (Identifier: E-GEOD-32474). Gene expression microarray data for BPH cell line panel has been generated in-house.The microarray data have been downloaded as CEL-files and have been processed to obtain the gene expression matrix used for modeling, according to the following procedure:CEL-files from real data sets were uniformly processed using the MAS5 algorithm as implemented in the R package simpleaffy (version 2.28.0; ).All expression values were transformed to log2 values.In case several probe sets shared the same gene symbol, the probe set with the largest mean expression over all samples was used as representative for that symbol.A subset of 10,846 genes was selected for comparison purposes to already existing studies covering these genes.Synthetic panel data have been generated according to a previously published procedure . Virtual gene expression values of 1,000 genes have been generated arbitrarily based on 5 state variables for 100 samples. One state variable represents the response to be predicted, called state1. Another response variable, state0, has been constructed by randomly sampling the state1 vector. Consequently, this vector contains no correlation between gene expression and response. State1 represents a positive, state0 a negative control for the computational workflow and thus upper and lower bounds for predictivity. [...] A support vector machine (SVM) constructs a hyper plane in a high or infinite dimensional space to separate classes. By maximizing the distance between the hyper plane and the closest samples, intuitively the generalization error is maximized and the risk of overfitting is reduced . SVMs have also been established for binary classification problems in form of a primal optimization problem .We have utilized the implementation from the R package e1071 (version 1.5–25; ). The package provides a wrapper for the LIBSVM library . For classification, the class weights have been set to the reciprocal of the class size. Nested cross validation (i.e. a simple cross-validation where in each validation run, the training set is validated in itself by another split into training and test set) has been used to select tuning parameters fromfor regression and fromfor classification. The SVM types used in this study were type = eps-regression and type = C-classification. [...] The random forest (RF) algorithm is an ensemble method with many decision trees as individual learners. The idea is based on bagging, a method in which each learner is trained on a different bootstrap to increase their variation. The algorithm is described in detail elsewhere .The implementation utilized by us is from the R package randomForest (version 4.6-2; ) maintaining the default values. The number of trees has been set to 500 for both regression and classification. The number of variables sampled as candidates for each split depends on the problem type. For classification, the number of sampled candidates equals where p is the total number of variables. For regression, the number of sample candidates has been set to p/3. The class probabilities have been calculated using the normalized votes of the base learners (trees). […]

Pipeline specifications

Software tools Simpleaffy, LIBSVM, randomforest
Databases ArrayExpress
Applications Miscellaneous, Gene expression microarray analysis
Diseases Neoplasms