Computational protocol: Function Prediction and Analysis of Mycobacterium tuberculosis Hypothetical Proteins

Similar protocols

Protocol publication

[…] In previous work [], we suggested the use of the underlying biological principle, referred to as “trace” of the functional network structure under consideration to predict functions of uncharacterized proteins by observing the level 1 and 2 neighbors’ functional annotation occurrence patterns. The approach from [] was used to predict, where possible, functional classes from TubercuList [] and GO biological process terms of uncharacterized proteins including PE/PPE proteins.We denote Level 1, the approach that exploits the guilt-by-association, or level 1 interacting neighbors, to predict the functional class of uncharacterized proteins. The Level 2 approach uses the level 2 interacting neighbors and the Level 1–2 approach combines level 1 and level 2 interacting neighbors to predict the functional class. The Level 1:2 approach uses level 1 neighbors to classify a protein but complemented by level 2 neighbors, used only in the case where level 1 neighbors of the protein under consideration are also uncharacterized, in order to improve coverage.These four approaches are evaluated using Receiver Operating Characteristic (ROC) [,] and Precision-Recall Operating Characteristic (P-ROC) [] curve analyses and proteins with known functions using the ROCR [] package under the R programming language [,]. In order to compare the performance of these approaches, we combined their related ROC and P-ROC curves, and results are shown in . These results indicate that the Level 1 or Guilt-by-association approach yields the best quality prediction, so we used this approach to classify uncharacterized proteins. We were able to predict functional classes for 1466 uncharacterized proteins out of 1784, representing 82% of uncharacterized proteins (unknown + PE/PPE functional classes). This brings the number of proteins with predicted functional classes to 3877 out of 4195 found in the non-redundant list of the MTB proteins from the UniProt database [–], which represents 92% of the proteome.For predicting GO biological process terms, we evaluated five approaches, namely the GO-GA, GO-PIND, GO-GAPIND-1, GO-GAPIND-2 and GO-FS approaches described in [], with scores and GO semantic similarity computed using GO-universal similarity metrics []. The GO-GA approach refers to the Guilt-by-association approach that uses the GO annotation in which relationships between GO terms in the GO directed acyclic graph (GO-DAG) are considered through semantic similarity scores. The GO-GAPIND approaches is a GO annotation prediction model in which the potential annotations of the protein target are annotations occurring among its direct interacting partners and those of other proteins whose direct interacting partners share significant similarity with the set of the direct interacting partners of the protein target. GO-GAPIND-1 uses only level 1 interacting neighbors and GO-GAPIND-2 combines level 1 and 2 interacting neighbors. Finally, the GO-FS approach exploits level-1 and level-2 neighbors similarity weights to identify neighbors that are more likely to share functions with the protein target. Note that all these approaches achieved their best precision at the GO score threshold of 0.1.The known protein GO annotation data for the MTB proteome were extracted from the Gene Ontology Annotation (GOA) project [–] knowing that most of these annotations if not all have been inferred electronically, with IEA as the evidence code for GO. We relied on the fact that the quality of these IEA annotations is high (up to 100% precision and, in the worst case scenario, InterPro2GO, SPKW2GO and EC2GO precisely predict the correct GO term 60 to 70% of the time) []. The ROC and P-ROC curves for the five different protein function prediction approaches are depicted in , and show that all these approaches achieve good performance in terms of the ROC analysis. To produce these curves, we used leave-one-out cross-validation strategy in which positives for a given known protein are GO terms annotating the protein, and a true positive is any predicted GO term whose semantic similarity score with protein’s known annotations is at least 0.4. Negatives are annotations occurring among a protein’s neighbors whose semantic similarity score with protein’s known annotations is less than 0.4. The P-ROC curves show the difference between these different approaches and reveal that the combination of GO-GA and PIND approaches yields better quality annotations.In order to ensure higher genome coverage, we ran the prediction model, which uses the GO-GAPIND-2 method to predict GO biological process terms for uncharacterized proteins in the MTB proteome. The GO annotation data extracted from the GOA website contained a total of 2340 proteins characterized with biological process terms. After running the annotation prediction model on the new MTB functional network, the annotations of 1770 proteins out of 1855 uncharacterized proteins were predicted, representing 95% of previously uncharacterized proteins in the MTB proteome. Thus, the resulting annotation dataset consists of 4110 proteins with predicted GO biological process terms, which represents approximately 98% of the whole proteome. Eighty-five proteins are still uncharacterized, representing about 2% of the MTB proteome. […]

Pipeline specifications

Software tools ROCR, Interpro2GO
Databases TubercuList
Application Miscellaneous
Organisms Mycobacterium tuberculosis, Homo sapiens
Diseases Infection, Tuberculosis