Computational protocol: Simple topological properties predict functional misannotations in a metabolic network

Similar protocols

Protocol publication

[…] Bipartite (reaction and compound) graphs were used to represent metabolic networks, generated using the KEGG LIGAND database (). To reconstruct the metabolic network for each species, all gene functions annotated for that species were collected. The reactions mapped to each function were then retrieved. Finally, the compounds attached to each reaction were added to produce a bipartite metabolic network for each species. All reactions were considered as being reversible. Network topological properties were calculated using the NetworkX library in Python. [...] The approach used to separate correct from incorrect annotations was the random forest. A random forest is an ensemble of decision trees. During the training process, to achieve a variety of different decision trees, a random subset of parameters is selected for each node. Afterwards, as in a standard decision tree, the parameter chosen at each node is the one that most increases the entropy. To predict the label of an entry, the entry is assessed by every tree of the ensemble. The distribution of label votes returned is the random forest prediction. In our case, the probability of an annotation being correct is taken as the proportion of trees that labelled it as correct.The random forest used was the one implemented in the randomForest R package (). The algorithm implemented is as described in . The parameters used in both the randomForest and predict functions were the default ones. For building the receiver–operator characteristic (ROC) curves, the type = ‘prob’ option in the predict function was used. [...] have reconstructed a highly resolved tree of life. Their species tree is built from a concatenation of 31 unambiguous orthologues present in 191 species. This tree and the multiple alignments used to build it were downloaded from iTOL (, ). iTOL also provides other types of data related to these species, including genome sizes, domains per genome and publication dates. The multiple alignment was used to calculate the distances between the species using protdist from PHYLIP (), a package of programs for inferring phylogenies. The classifier was applied to the metabolic networks present in KEGG for each species included in the iTOL phylogeny. […]

Pipeline specifications

Software tools randomforest, iTOL, PHYLIP
Applications Miscellaneous, Phylogenetics
Organisms Escherichia coli, Homo sapiens