Computational protocol: Generation of Silver Standard Concept Annotations from Biomedical Texts with Special Relevance to Phenotypes

Similar protocols

Protocol publication

[…] To generate the different SSC, we applied four established concept recognition systems, each of which is described in the following. Our choice of systems was aimed at a variety of tools that were developed with different goals in mind (to span with the variety of text types): NCBO annotator being a universal concept recognition tool for a broad range of ontologies, cTakes developed for the analysis of electronic health records using SNOMED CT and the RxNORM subset of UMLS, BeCAS designed to analyse the scientific literature for a number of concept types such as diseases and chemical entities, and MetaMap to annotate any text with concepts from the UMLS. NCBO annotator: The NCBO annotator (http://bioportal.bioontology.org/annotator) is an online system that identifies and indexes biomedical concepts in unstructured text by exploiting a range of over 300 ontologies in the UMLS and NCBO BioPortal []. These ontologies include many that have particular relevance to disorders and phenotypes such as SNOMED CT [], LOINC [] and the Foundational Model of Anatomy [].NCBO annotator operates in two stages: concept recognition and semantic expansion. Concept recognition performs lexical matching by pooling terms and their synonyms from across the ontologies and then applying a multiline version of grep to match lexical variants in free text. During semantic expansion, various rules such as transitive closure and semantic mapping using the UMLS Metathesaurus are used to suggest related concepts from within and across ontologies based on extant relationships. The mappings and the depth of transitive closure are customisable within the tool. cTAKES: cTAKES from Mayo Clinic consists of a staged pipeline of modules that are both statistical and rule-based. The order of processing is somewhat similar to MetaMap and consists of the following stages: sentence boundary detection with OpenNLP (https://wiki.apache.org/solr/OpenNLP), tokenization, lexical normalisation (SPECIALIST lexical tools), POS tagging and shallow parsing using OpenNLP trained in-domain on Mayo Clinic EHRs, concept recognition, negation detection using NegEx [] and temporal status detection. Concept recognition is conducted within the boundaries of noun phrases using dictionary matching on a synonym-extended version of SNOMED CT and RxNORM [] subset of UMLS. cTAKES was subject to a rigorous component-by-component evaluation during development. During this process, although the focus of testing was on EHRs, the system was also tested on combinations of the GENIA corpus of Medline abstracts and Penn Treebank corpus. BeCAS: BeCAS (http://bioinformatics.ua.pt/becas) [] from the University of Aveiro is the newest integrated system of the four that we tried. The pipeline of processes involves the following stages: sentence boundary detection, tokenization, lemmatization, part of speech tagging and chunking, abbreviation disambiguation, and CUI tagging. The first four stages are performed by GDep [], a dependency parser that incorporates domain adaptation using unlabelled data from the target domain. CUI tagging is conducted using regular expressions for specific types such as anatomical entities and diseases. Dictionaries used as sources for the regular expressions include the UMLS, LexEBI [] and the Jochem joint chemical dictionary []. During development the concept recognition system was tested on abstracts and full length scientific articles using an overlapping matching strategy. MetaMap: MetaMap (http://metamap.nlm.nih.gov/) [] is a widely used system from the NLM for finding mentions of clinical terms based on CUI mappings to the UMLS Metathesaurus. The system exploits a fusion of linguistic and statistical methods in a staged analysis pipeline. The first stages of processing perform mundane but important tasks such as sentence boundary detection, tokenization, acronym/abbreviation identification and POS tagging. In the next stages, candidate phrases are identified by dictionary lookup in the SPECIALIST lexicon [] and shallow parsing using the SPECIALIST parser []. String matching then takes place on the UMLS Metathesaurus before candidates are mapped to the UMLS and compared for the amount of variation. A final stage of word sense disambiguation uses local contextual and domain-sensitive clues to arrive at the correct CUI. MetaMap is highly configurable, for example, users have the option to specify their own vocabulary lists (e.g. for abbreviations), use negation detection and the degree of variation between text mention and UMLS terms. […]

Pipeline specifications

Software tools Annotator, cTakes, becas, MetaMap, NegEx
Application Information extraction
Chemicals Silver