Computational protocol: Characterization and prediction of chemical functions and weight fractions in consumer products

Similar protocols

Protocol publication

[…] The merged function-ingredient dataset was used to develop a series of machine-learning QSPR classification models for both function and weight fraction (). QSPR models describe the relationship between a chemical’s known descriptors (e.g. structural or physiochemical information) and another property or characteristic of the chemical. QSPR models are based on either regression or classification methods, and can employ a variety of data-driven statistical techniques. The classification models built here take categorical or continuous chemical descriptors (i.e., predictive variables) as input and return assignment of the chemical into the class of interest (herein function or weight fraction bin). These descriptors (defined in SI Table S3) included 13 predicted or measured chemical properties obtained from EPI-Suite and 16 simple descriptors of chemical use previously developed for the Tox21 chemical library and evaluated for inclusion in heuristic models of exposure . Descriptors were available for 2981 chemicals for building the function models. Multiple classification models (one for each function with >10 chemicals for which descriptors were available) were built using random forests with the R package randomForest . Random forest classifiers are ensembles of decision trees; each tree is built from a sampled subset of the test data. The classification models were built by analyzing the descriptors for all the chemicals that had a given function versus all the chemicals that did not; descriptors that best “separate” these two groups were identified. Each resulting model returns a probability of an arbitrary chemical performing the function based on its descriptors; this probability is equal to the fraction of the trees in the forest returning a positive classification for the chemical. Models were built using 5000 decision trees and downsampling was implemented to account for imbalanced groups in the data. Estimates of the model error, sensitivity, specificity, and balanced accuracy (BA; mean of the specificity and sensitivity) were obtained using 5-fold cross-validation . In addition, the method of y-scrambling was used to further test the validity of the predictive models; models for each function were built for 10 sets of randomly-scrambled dependent variables (yes/no classifications for each function) and the mean and range of errors compared with the true model errors. Models with error greater than or equal to those generated by using the y-scrambled data were considered invalid.An additional random forest model for weight fraction was built using a subset of the functional use dataset that could be merged with the ingredient weight fraction data; 17103 observations (828 chemicals) could be matched to the existing descriptors. The continuous quantitative weight fractions in the ingredient data were transformed using an logit (inverse logistic) function and then divided into three weight fraction bins (high: 0.3-1.0, medium: 0.01-0.3, and low: 0-0.01) for use in the predictive model; candidate bin boundaries were determined by a visual examination of a histogram of the transformed data (SI Fig. S1 and Table S4). A random forest model for weight fraction bin was then built using function and property/use descriptors (5000 trees); the model error was estimated using 5-fold cross-validation and the model was tested using y-scrambling.Predictive variable (descriptor) importance for both the function and weight fraction models was evaluated via a measure of the Gini importance , a mean (across all trees in the forest) of the decrease in the Gini impurity criterion (a measure of entropy) that results when a tree is split using a given descriptor as a classifier. […]

Pipeline specifications

Software tools EPI Suite, randomforest
Applications Drug design, Miscellaneous