Computational protocol: A Genetic Algorithm Based Support Vector Machine Model for Blood-Brain Barrier Penetration Prediction

Similar protocols

Protocol publication

[…] The log⁡BB dataset used in this study was compiled by Abraham et al. [], which was a combination of both in vivo and in vitro data, including 302 substances (328 data points). Abraham et al. applied linear free energy relationship (LFER) to the dataset and obtained good correlation between log⁡BB values and LFER descriptors plus two indicator variables []. CODESSA [] could not calculate descriptors for the first 5 gases ([Ar], [Kr], [Ne], [Rn], and [Xe]) of the original dataset, and they were excluded from the dataset. The final dataset contained 297 compounds (323 data points). The indicator variables of I v and AbsCarboxy used in Abraham's study [] were retained in this study. I v was defined as I v = 1 for the in vitro data and I v = 0 for the in vivo data. AbsCarboxy was an indicator for carboxylic acid (AbsCarboxy = 1 for carboxylic acid, otherwise AbsCarboxy = 0).The initial structures in SMILES format were imported to Marvin [] and exported in MDL MOL format. AM1 method in AMPAC [] was used for optimization plus frequencies and thermodynamic properties calculation. The generated output files were used by CODESSA to calculate a large number of constitutional, topological, geometrical, electrostatic, quantum-chemical, and thermodynamic descriptors. Marvin was also used to calculate some physicochemical properties of the compound, including log⁡P, log⁡D, polar surface area (PSA), polarizability, and refractivity. All these descriptors and properties were used as candidate features in later modeling.Features with missing values or having no change across the data set were removed. If the correlation coefficient of two features is higher than a specified cutoff value (0.999999 used here), then one of them is randomly chosen and removed. The cutoff value used here is very high because very high variable correlation does not mean absence of variable complementarity []. A total number of 326 descriptors were left for further analysis. However, many highly correlated features have very similar physicochemical meanings. In our final analysis, similar features were put together by their physicochemical meaning, which we hope could unveil some underlying molecular properties that determine the BBB penetration.The dataset was then split into training set and test set using the Kennard-Stone method [], which selects a subset of representative data points uniformly distributed in the sample space []. At start, the Kennard-Stone method chooses the data point that is the closest to the center of the dataset measured by Euclidean distance. After that, from all remaining data points, the data point that is the furthest from those already selected is added to the training set. This process continues until the size of the training set reaches specified size. 260 data points were selected as training set and the other 63 were used as test set. [...] Details about SVM regression can be found in literatures [–]. As in other multivariate statistical models, the performance of SVM regression depends on the combination of several parameters. In general, C is a regularization parameter that controls the tradeoff between training error and model complexity. If C is too large, the model will have a high penalty for nonseparable points and may store too many support vectors and get overfitting. If it is too small, the model may have underfitting. Parameter ε controls the width of the ε-insensitive zone, used to fit the training data. The value of ε can affect the number of the support vectors used to construct the regression function. The bigger ε is, the fewer support vectors are selected. On the other hand, bigger ε-values result in more flat estimates. Hence, both C and ε-values affect model complexity (but in a different way). The kernel type is another important parameter. In SVM regression, radial basis function (RBF) () was the most commonly used kernel function for its better generalization ability, less number of parameters, and less numerical difficulties [] and was used in this study. Parameter γ in RBF controls the amplitude of the RBF kernel and therefore controls the generalization ability of SVM regression. The LIBSVM package (version 2.81) [] was used in this study for SVM regression calculation, taking the form(1)Kxi,xj=exp⁡−γxi−xj2,γ>0,where x i and x j are training vectors (i ≠ j, x i ≠ x j) and γ is kernel parameter. […]

Pipeline specifications

Software tools Marvin, LIBSVM
Applications Drug design, Miscellaneous
Chemicals Hydrogen