Computational protocol: Prediction of protein-protein interaction sites using an ensemble method

Similar protocols

Protocol publication

[…] Evolutionary conservation score is based on multiple sequence alignments (MSAs) and phylogenetic tree. Following the method used by ConSurf [], amino acid sequences similar to each other in the PDB [] are collected by using PSI-BLAST [] and then multiple aligned by using MUSCLE []. The evolutionary conservation of each amino acid position in the alignment is calculated by using the Rate4Site algorithm []. [...] In this section, we first present a component ensemble classifier, namely Sub-EnClassifier, to effectively utilize every feature space and to handle the imbalanced classification problem. Figure shows the overview of the proposed component ensemble classifier. As in most cases, the number of non-interaction sites (majority class) is much more than that of interaction sites (minority class), and the ratio of sizes between them is usually larger than three. To deal with the imbalanced problem, the Sub-EnClassifier uses an ensemble of m classifiers and decision fusion technique on the training set of each feature space. An asymmetric bootstrap resampling approach [,] is adopted to generate subsets for all component classifiers. It performs random sampling with replacement only on the majority class so that its size is equal to the number of minority samples, and keeps the entire minority samples in all subsets.In the first step, the majority class of non-interaction sites is under-sampled and split into m groups by random sampling with replacement, where each group has the same or similar size as the minority class of interaction sites. After the sampling procedure, we obtain m new datasets from the set of non-interaction sites. Each of the new dataset and the set of interaction sites are combined into m new training sets. Then, we train m classifiers by using the m new training sets as inputs, with one classifier corresponding to one training set. Each of these classifiers is a Support Vector Machine (SVM). Here the LIBSVM package 2.8 is used with radial basis function as the kernel. Finally, a simple majority voting method is adopted in the fusion unit, and the final result is determined by majority votes among the outputs of the m classifiers for further processing with 10-fold cross-validation. […]

Pipeline specifications

Software tools ConSurf, BLASTP, MUSCLE, Rate4Site, LIBSVM
Applications Miscellaneous, Phylogenetics, Nucleotide sequence alignment
Diseases HIV Infections