Computational protocol: Predicting the Extension of Biomedical Ontologies

Similar protocols

Protocol publication

[…] The intuition behind our proposed strategy is that information encoded in the ontology or its annotation resources can be used to support the prediction of ontology areas that will be extended in a future version. This notion is inspired by change capturing strategies that are based on implicit requirements. However in the existing change capturing approaches, these requirements are manually defined based on expert knowledge. Our system attempts to go beyond this, by trying to learn these requirements based on previous extension events using supervised learning.In our test case using GO, we use as attributes for learning a series of ontology features based on structural, annotation or citation data. These are calculated for each GO term and then used to train a model able to capture whether a term would be extended in a following version of GO.Structural features give information on the position of a term and the surrounding structure of the ontology, such as height (i.e. distance to a leaf term), number of sibling or children terms. A term is considered to be direct child if it is connected to its parent by an is_a or part_of relation, but the total of children of a term encompasses all descendants regardless of the number of links between them. Annotation features are based on the number of annotations a term has, according to distinct views (direct vs indirect, manual vs all). Direct annotations are annotations made specifically to the term, whereas indirect annotations are annotations made to a parent of the term, and thus inherited by the term. Manual annotations correspond to those made with evidence codes that reflect a manual intervention in the evidence supporting the annotation, while the full set of annotations also includes electronic annotations. Citation features are based on citation of ontology terms based on external resources, in our case PubMed. Finally hybrid features combine some of the previous features into one single value. These features can be mapped onto the change discovery types: structural features belong to their homonymous change discovery type; annotations features can be seen as both data and usage based, since they can be interpreted as both ontology instances and ontology usage; and citation features correspond to the discovery-driven change, since they are derived from external sources. In total we defined 14 features, which we grouped into five sets (see ): all, structure, annotations, uniformity, direct, indirect, and . The first three sets are self-explanatory. Uniformity set features were based on , where we considered annotations to represent usage. The direct set joins direct features of terms, in terms of children and annotations, whereas the indirect set joins the same kind of features in their indirect versions. The best sets were based on the best features found after running the prediction algorithm for individual features.Due to the complexity of ontology extension, we have established a framework for the outlining of ontology extension in an applicational scenario. This framework defines the following parameters:Extension type: refinement, where a term is considered to be extended if it has novel children terms enrichment, where a term is considered to be extended if it has novel hierarchical relations to existing terms extension, where a term is considered to be extended if it has novel children terms and/or novel hierarchical relations to existing termsExtension mode: direct, where a term is considered to be extended if it has new children terms (according to extension type) indirect, where a term is considered to be extended if it has any new descendant terms (according to extension type)Term set: all termsterms at a given depth (maximum distance to root)terms at a given distance to GOSlim termsTime parameters: nVer, the number of versions used to calculate the features FC, the time interval(in number of ontology versions) between versions used to calculate features and version used to verify extension (i.e. in our dataset, a FC of two equals a time interval of one year, since we use ontologies spaced by six months.)By clearly describing the ontology extension process according to this framework, we are able to accurately circumscribe our ontology extension prediction efforts.The datasets used for classification were then composed of vectors of attributes followed by a boolean class value, that corresponded to extension in the version to be predicted, according to the used parameters. To compose the datasets we need not only the parameters but also an initial set of ontology versions to be used to calculate features and the ontology version to calculate the extension outcome (i.e. class labels). So given a set of sequential ontology versions , we need to choose one ontology version to predict extension, , and then based on time parameters and FC, select the set of ontologies to be used to calculate features. For example, for a set of ontologies , if we chose to predict extension, along with and FC = 2, the set of ontologies to calculate features will be .We tested several supervised learning algorithms, namely Decision Tables, Naive Bayes, SVM, Neural Networks and Bayesian Networks, using their WEKA implementations . For Support Vector Machines, we used the LibSVM implementation with an RBF kernel and optimized the cost and gamma parameters through a coarse grid search. For Neural Networks we used the Multilayer Perceptron implementation, with the number of hidden layers equal to , a training time of 500 epochs, and we performed a coarse grid search to optimize the learning rate. Regarding Bayesian Networks, we estimated probabilities directly from the data, and focused on testing different search algorithms, namely Simulated Annealing, K2, and Hill Climbing. Furthermore we had to take into consideration that there are many more terms that are not extended than terms that are, between two sequential ontology versions, which creates unbalanced training sets. To address this issue we used the SMOTE algorithm . SMOTE (synthetic minority over-sampling technique), is a technique that handles unbalanced datasets by over-sampling the minority class and under-sampling the majority class that has been shown to support better classification results for the minority class. […]

Pipeline specifications

Software tools ECOMICS, Weka
Application Information extraction