Computational protocol: Improving integrative searching of systems chemical biology data using semantic annotation

Similar protocols

Protocol publication

[…] Once we created an initial set of terms derived from use-case queries, we defined a set of primary classes: SmallMolecule, Drug, Protein Target, Disease, SideEffect, Pathway, BioAssay, Literature and Interaction (Table ) based partially on the BioPAX classes []. BioPAX offers a standard, well defined representation of biological pathway data using OWL and it has been widely used in biological data integration [,]. We imported the terms from BioPAX and made subsequent extensions based upon our use cases. The primary classes were refined in accordance with current instance data structure. SmallMolecule, Drug and Protein were put under PhysicalEntity. Their relation with Disease and SideEffect were elaborated under Interaction, which is further classified into DrugInducedSideEffect, DrugTreatment, DrugDrugInteraction, ProteinProteinInteraction and ChemicalProteinInteraction. BioAssay and Literature serve as Evidence to support the relations. Pathway was treated as a 'black box' since its instance data is just pathway name. Other than Interaction, we did not intend to further classify other individual major classes.After major classes were determined, some utility classes were created to help present primary classes, of which a single class is insufficient to present the hierarchical behavior. For instance, ChemicalStructure consisting of structure format and structure representation is considered as a utility class to present the structure of a small molecule. A small molecule may have multiple structure representations, thus there are several instances of ChemicalStructure relating to the small molecule. Without the bearer small molecule, the instance of ChemicalStructure is meaningless.The relations between entities which associate with properties (or contexts) such as experimental conditions and references were separated out as individual classes, and were placed under Interaction; otherwise, they were presented as object properties. Relational Ontology (RO) [] was imported to help present basic relations. For example, ProteinProteinInteraction not only covers the binary relation between two proteins, but also affiliates its experimental conditions (e.g., organism and interaction type). Protein serves as a participant in that interaction. Similarly, Chemical and Protein serve as participants in the ChemicalProteinInteraction, which includes other information such as the strength of interaction. Figure shows major classes and their relations.Data properties appeared in the original database sources were not fully covered, instead, only the important ones related to our purpose (chemogenomics and systems chemical biology). This simplifies the ontology without losing essential knowledge. The terms including data property name, class name and relation name were manually mapped to terms in relevant ontologies in the OBO and NCBO BioPortal, and the terms in the existing ontologies are preferred if multiple terms happened. For example, for a chemical formula we chose chemicalFormula as this term is used in BioPAX. In addition, the term must conform to our name convention. If there were multiple results or no results at all, we would use the terms from primary data bases. A table was created to map data source terms to the standardized and later was applied to annotate instances. The properties of class, object and data property were further edited in protégé []. [...] Figure shows the data integration workflow we used to populate the ontology. Customized Java scripts along with the OWL API Java package [] were used to automate the annotation of Chem2Bio2RDF data using Chem2Bio2OWL. Pellet reasoning [] was then applied to reason new relations. The annotated data plus new relations were uploaded to the Virtuoso triple store [] for querying. Efforts were made to cope with data redundancy, inconsistence and provenance. Data redundancy is originated from the homogeneity of data source of the objects. Chemical compounds for example were presented as various formats (e.g., SMILES, InChi, MOL, etc.) and many data sources have their own identifiers to present compounds. The URI of individual instance in Chem2Bio2OWL is based on the primary data source ID or fake ID if primary ID is unavailable. PubChem as the largest public compound hub is considered as the primary source for chemicals. Its identifier Compound ID (CID) was used to identify compounds (e.g., The compounds with unknown CIDs were assigned CIDs by searching PubChem using InChi, a universal structure representation. A fake CID was assigned if the compound did not exist in PubChem. Drug, protein and side effect are using DrugBank ID, UNIPROT entry name, and UMLS ID as primary IDs. Pathway name is used as pathway identifier. Diseases can be presented as MESH, OMIM ID, UMLS or free text, but no universal disease identifier has been agreed to present them. Since the Disease Ontology [] has already mapped terms to various public disease identifiers, we adopted Disease Ontology ID as primary ID. The free texts occurred in TTD, Diseasome and other sources were mapped to disease ontology using string matching algorithms.Maintaining data provenance (i.e. its source and history) is useful for data validation, confidence weighting and to facilitate data update and maintenance. The class UnificationXref defines a reference to an entity in an external resource that has the same biological identity as the referring entity. Its data properties DB and ID present the name of external source and the related identifier respectively; comments is used to put additional information such as why, who, how and how if needed. For example, compound5591 has ID 5591 in PubChem and ID 9753 in ChEBI, they are represented using class UnificationXref. For some assertions (e.g., interaction), PublicationXref is applied to record the original paper reporting the assertion.Table shows the statistics of sample instances of primary classes as well as sample primary data sources. The total number of triples is 3,084,836, and it increases to 4,411,817 after reasoning. They were later used for evaluation and are available at Chem2Bio2OWL web site. […]

Pipeline specifications