Computational protocol: Modularity of Protein Folds as a Tool for Template-Free Modeling of Structures

Similar protocols

Protocol publication

[…] The modeling algorithm consists of the following steps (): First, PSIPRED [] is used to predict the secondary structures and identify the putative Smotifs in the query protein of interest. Next, suitable Smotif fragments are sampled from a set of related sequences with known three dimensional structures available in the PDB []. These sequences are detected using three different methods, Psi-BLAST [], delta-BLAST [] and HHblits/HHsearch [, ]. From these remote homologs, Smotifs are collected into a “dynamic” Smotif library, tailor-made for the query protein. The Smotif fragments are larger in size (average size is 27.44 amino acid residues per Smotif in the current data set) compared to other existing fragment assembly methods (for instance, Rosetta uses fragments of 3 and 9 residues), making it feasible to carry out an exhaustive enumeration of all possible combinations of the chosen fragments during the decoy sampling step. The scoring function used to select the best model from the decoys is a linear combination of four knowledge-based components: (1) an orientation dependent statistical pair-wise potential using shuffled reference state [–], (2) the radius of gyration, (3) main-chain only hydrogen bond potential [] and (4) an implicit solvation potential []. The method is developed on a set of 20 randomly selected proteins that represent different folds and is tested on a set of 16 ab initio targets selected from just released PDB entries to avoid any bias. SmotifTF method is compared to other state of the art structure prediction methods, I-tasser [], Rosetta [] and HHpred [], which were chosen because these methods have performed well in recent CASP benchmarking experiments [, ]. The results of these predictions are discussed below. [...] Recent CASP experiments show that template-free modeling is still a work in progress and require further methodological developments to be able to provide useful models []. Some of the methods that performed the best in the template-free category in recent CASP experiments include I-tasser [], HHpred [] and Rosetta []. I-tasser and Rosetta are fragment assembly-based methods that use different kinds of fragments and sampling algorithms as described earlier. HHpred is a template-based modeling method, that uses Hidden Markov Model (HMM) profiles and an HMM-HMM comparison algorithm [] to identify remotely related templates for homology modeling. The HMM-based sequence search is more sensitive and is known to perform better than traditional heuristic sequence search methods.The benchmarks against the above three methods were carried out on a test set of 16 proteins obtained from weekly new releases of the PDB from 10-08-2015 to 12-31-2015. These were submitted to the I-tasser and HHpred servers online, while Rosetta calculations were carried out using a local installation. In each case, the trivial prediction using the self-template was eliminated. HHpred requires the user to choose the templates for model building, after the HHsearch step. If available, multiple templates were chosen to obtain maximum possible query coverage, which were then submitted to Modeller []. In case of Rosetta, 10000 decoys were sampled using the Rosetta algorithm from 100 parallel simulations. The resulting models were then clustered using the algorithm provided in the Rosetta package to identify the largest cluster. The center of the largest cluster was identified as the best model. The results of this analysis are summarized in . The mean GDT_TS scores show that I-tasser performs the best with a mean GDT_TS score of 36.97, SmotifTF comes in second with an average GDT_TS score of 33.05 and HHpred and Rosetta make the third and fourth positions with GDT_TS 31.56 and 30.70 respectively. The average GDT_TS is comparable in the four methods and is around 30–35%. Each method has some highlight performances, where its prediction is the best compared to the others. For instance, I-tasser has the best prediction for targets 2mpvA, 3wzsA and 4uzxA whereas Rosetta does better with 4nknA and 4rd5A. HHpred has better models for 4ux3B and 4v1am and SmotifTF has better predictions for 4pqzA, 2mpoA, and 4o7kA. In case of 9 of the 16 proteins in this benchmark test set (56%), SmotifTF predicts a model with GDT_TS over 30%, indicating an overall correct fold prediction for the ab initio targets. I-tasser, Rosetta and HHpred have predictions above 30% GDT_TS for 9, 9 and 7, respectively. The proteins in the table are sorted based on the e-value of best hit in the PDB (column 4). If one examines the target proteins with the least trivial templates (only high e-value hits are retained in the Smotif library), SmotifTF has an advantage over the other methods as reflected in the mean GDT_TS scores of the last ten entries with e-values > 0.1 in the table. For these most difficult targets, SmotifTF has a mean GDT_TS score of 35.24, which is the best along with I-tasser (35.28). If we consider only the entries with e-values > 2.0 (bottom 4 rows), the difference in performance is even more striking with SmotifTF and I-tasser showing the best performance amongst all the methods with an average GDT_TS of 27.65 and 27.77, respectively. As expected, the performance of HHpred drops the most (Mean GDT_TS drops from 31.56 to 19.00), as this method is explicitly dependent on finding a reasonable overall template, while all other methods are able to combine fragments from a larger variety of possible hits. Overall, there seems to exist a trend, which shows that SmotifTF has a better performance compared to the other methods when the difficulty of prediction is greater as expressed by the e-value of the best template available.While the amount of data is not sufficient to draw statistically conclusive results, nevertheless, from among the 10 the most difficult targets (with e-values to the best PDB hit above 0.1), SmotifTF has the most accurate models amongst the methods compared in four out of five large targets (sizes 119, 131, 182, 190, 236 in ), and in the fifth case, it is a close second.We calculated the relative contact order [] for the target proteins in but no apparent correlation could be seen when comparing it with accuracy. We also identified the protein classes for these targets as shown in . Among the 16 proteins there are 8, 2 and 6 cases that are mainly-α, α+β and mainly-β classes, respectively.In terms of the time scales of the four different methods, HHpred server is the fastest, providing results within the order of minutes for all proteins in our benchmark set. I-tasser server, due to its intensive public use and waiting period, provided results within 24–48 hours after submission. SmotifTF and Rosetta were carried out using our in-house linux cluster with 100 computing cores. While SmotifTF completed all jobs within 6–12 hours, Rosetta completed most jobs within 12–24 hours. [...] The SmotifTF prediction algorithm uses HHblits [] and HHalign [] to obtain and align Hidden Markov Model (HMM) profiles of the query Smotifs to the Smotifs in the dynamic library. The method further relies on the e-values provided by HHalign to rank the Smotif fragments in the library, which is then used to choose the best fragments for the decoy sampling step. The e-values provided by HHalign fail to pick the best available fragment in the library (in terms of RMSD to the query Smotif) in 66 of the 99 Smotifs in the benchmarking data set. However, for 43 of these 66 Smotifs, the method does sample a library Smotif within 1Å of the best available one, thereby neutralizing the effect of the missed fragment. […]

Pipeline specifications