Computational protocol: A semantic-based workflow for biomedical literature annotation

Similar protocols

Protocol publication

[…] Information extraction tools produce several annotation formats. The migration of this data into semantic web format and services provides additional value regarding the share of knowledge. To allow this transition, our methodology is based on Ann2RDF modular algorithms (). Ann2RDF ( is based on the creation of modular integration algorithms to deal with the different formats resulting from text-mining tools. The ability to acquire data from several and miscellaneous annotation formats benefits developers, allowing each one to implement and integrate their format in a common interface. Developed algorithms are based on Object Relation Mapping techniques for mapping different data structures to a single representation and on advanced Extract-Transform-and-Load (ETL) procedures to select and extract annotations content based on regular expressions and data parsers such as XPath (XML Path Language). Currently, the system supports the integration of most BioNLP Workshop’s ( formats out-of-the-box such as the BioC and Standoff formats, with it also being possible to additionally customize new formats.After this selection and extraction processes, annotations objects are semantically enriched by using ontology mapping procedures: the system makes use of an external JSON-based configuration file to assist the ontology mapping process. In this configuration file, the mappings between classified concept categories and relation properties (i.e. associations between concepts) are defined to the respective ontology terms. This allows standardization of annotations’ content, e.g. ‘A relatedWith B “to “A dc: relation B’, using for instance, the Dublin Core Ontology (). Next, there is the possibility to normalize the detected concepts. Due to the existence of many NER tools that do not include concept normalization tasks, the system offers an optional normalization service. The invocation is also performed in the same configuration file, declaring external HTTP POST requests. For this invocation, two properties are needed: the service location and the regular expression to apply to select the desired output. With this external support, services such as BioPortal Annotator () (e.g. service: ‘ = XXXX’, query: ‘[*][email protected]’) or BeCAS () (e.g. service: ‘’, query: ‘*.*.refs’) can be easily integrated, providing an enhanced incorporation of the annotated data and improved simplification for the semantic integration process.Finally, harmonization methods are responsible for performing an adequate linkage between extracted content and the respective structured model.To represent the processed data, our architecture model is based on Annotation Ontology (AO) (), an open representation model for representing interoperable annotations in RDF (Resource Description Framework) which is currently being used by the W3C community ( It provides a robust set of methods for connecting web resources, for instance, textual information in scientific publications, to ontological elements, with full representation of annotation provenance, a contextual metadata describing the origin or source (, ). By linking new scientific content to computationally defined terms and entity descriptors, AO helps to establish semantic interoperability across the biomedical field. Through this model, existing domain ontologies and vocabularies can be used, creating extremely rich stores of metadata on web resources. […]

Pipeline specifications

Software tools BioC, becas
Application Information extraction