Computational protocol: Quantitative sensory testing response patterns to capsaicin- and ultraviolet-B–induced local skin hypersensitization in healthy subjects: a machine-learned analysis

Similar protocols

Protocol publication

[…] The QST parameters differing most relevantly among the 3 experimental conditions, and thus qualifying as basis by which to distinguish the study conditions, were identified using techniques of unsupervised machine learning and feature selection. Specifically, the input-output pairs were submitted to random forest machine learning. A random forest consists of a set of different, uncorrelated, and often very simple decision trees. Each decision tree uses a tree data structure with conditions on variables (parameters) as vertices and classes as leaves. Each tree in the random forest votes for a class. The final classification assigned to a data point follows the majority of these class votes.In our analysis, in 1000 repeated experiments the original data set was split into 2/3 training and 1/3 test data subsets, by means of class-proportional bootstrap resampling from the training data set using the R library “sampling” (https://cran.r-project.org/package=sampling, ), to increase the robustness of the analysis. For each of the 1000 experiments, 500 random decision trees were created, each tree containing d ∈ {1,…,10} features randomly drawn from the d = 10 QST parameters. The number of trees was based on visual analysis of the relationship between the number of decision trees and the accuracy of the classification. This analysis indicated no improvement beyond approximately 300 trees; therefore, building 500 trees was considered to provide robust results. Within this analysis, bootstrap resampling was used again to split the training data subset into further training and test subsamples, and trees were created on these training subsamples and applied to test subsamples. Thus, the whole analysis accommodated the concept of a nested cross-validation analysis, with the inner loop consisting of the decision tree analysis on data resampled from the training data subset and the outer loop consisting of the 1000 splits of the whole data set into training and test data. The trees were analyzed with respect to the features included and the accuracy of classification into the 3 treatment groups. For each feature, we computed the mean decrease in Gini impurity (https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) while excluding the respective parameter from a random forest building. We thus provided a rating criterion for the importance of each QST parameter to serve as a basis to conclude the experimental condition, under which its value had been acquired, ie, the amount of this decrease indicated the importance of the particular feature. We performed these calculations using the “randomForest” library (https://cran.r-project.org/package=randomForest, ).Following each of the 1000 random forest analyses on resampled data, we submitted the values of the mean decreases in Gini impurity to computed ABC analysis while excluding the parameter from random forest analysis. ABC analysis is a categorization technique for selection of the most important subset among a larger set of items, and we chose it because it fit the basic requirements of feature selection using filtering techniques, ie, it easily scales to very high-dimensional data sets, is computationally simple and fast, and is independent of the classification algorithm. ABC analysis aims at dividing a set of data into 3 distinct subsets called “A,” “B,” and “C.” Set A should contain the “important few,” ie, those elements that allow us to obtain a maximal yield with a minimal effort., Set B comprises those elements where an increase in effort is proportional to the increase in yield. By contrast, set C contains the “trivial many,” ie, those elements with which the yield can only be achieved with a disproportionally great additional effort., The target QST parameters for further exploration of the hypersensitization effects were sought in ABC set “A.” The final size of the feature set was equal to the most frequent size of set “A” in the 1000 runs. The final members of the feature set were chosen in decreasing order of their appearances in ABC set “A” among the 1000 runs. These calculations were done using our software package (http://cran.r-project.org/package=ABCanalysis, ). [...] The QST parameters most frequently assigned to set “A” in the 1000 random forests and ABC analysis of resampled data, at the number corresponding to the most frequent size of set “A,” were used to create a decision tree associating ranges of QST parameter values with the 3 study conditions. This tree was obtained by means of decision tree learning using the classification and regression tree algorithm. As in random forests, a tree data structure is created with conditions on variables (parameters) as vertices and classes as leaves. In random forests, tree structure is randomly created, making it impossible to interpret single trees, but in classical decision trees, local decisions follow statistical criteria and aim at providing a simple and easily understandable set of classification rules suitable for topical interpretation, which is why we chose to use them in the present exploratory analysis. We used the concept of Gini impurity to find optimal (local) mutually exclusive decisions. The calculations used the “ctree” function of the “party” software package (https://cran.r-project.org/package=party, ). We assessed the significance of the splits at each decision node by applying permutation tests as implemented in the “ctree” function. Nodes were only split if the null hypothesis of independence between the response variable, ie, the treatment class, and the predictors, ie, the QST parameter values, could not be rejected at the given level of significance. Finally, we assessed the performance of the identified decision tree to correctly assign the treatment during which data had been acquired by calculating the overall classification accuracy of treatment assignment using standard equations. We repeated these steps in 1000 runs of randomly resampled data subsets class—proportionally drawn from the complete data sets. The accuracy was the median of the values obtained in the 1000 runs, and the 95% confidence interval spanned between the 2.5th and 97.5th percentiles of the range of the 1000 associated accuracy values. […]

Pipeline specifications

Software tools randomforest, CTree
Applications Miscellaneous, Phylogenetics