Computational protocol: Annotation analysis for testing drug safety signals using unstructured clinical notes

Similar protocols

Protocol publication

[…] We created a standalone Annotator Workflow (Figure ) based upon the existing National Center for Biomedical Ontology (NCBO) Annotator Web Service [] that annotates clinical text from electronic health record systems and extracts disease and drug mentions from the EHR. Unlike natural language processing methods that analyze grammar and syntax, the Annotator is mainly a term extraction system: it uses biomedical terms from the NCBO BioPortal library and matches them against input text. We have also extended the Annotator Workflow by incorporating the NegEx algorithm [] to incorporate negation detection—the ability to discern whether a term is negated within the context of the narrative. We are also extending the system to discern additional contextual cues [] such as family history versus recent diagnosis.One strength of the Annotator is the highly comprehensive and interlinked lexicon that it uses. It can incorporate the entire NCBO BioPortal ontology library of over 250 ontologies to identify biomedical concepts from text using a dictionary of terms generated from those ontologies. Terms from these ontologies are linked together via mappings []. For this study, we specifically configured the workflow to use a subset of those ontologies (Table ) that are most relevant to clinical domains, including Unified Medical Language System (UMLS) terminologies such as SNOMED-CT, the National Drug File (NDFRT) and RxNORM, as well as ontologies like the Human Disease Ontology. The resulting lexicon contains 2.8 million unique terms.Another strength of the Annotator is its speed. We have optimized the workflow for both space and time when performing large-scale annotation runs. It takes about 7 hours and 4.5 GB of disk space to process 9 million notes from over 1 million patients. Furthermore, the entire system fits on a USB stick and takes 45 minutes to configure and launch on most systems. To the best of our knowledge, existing NLP tools do not function at this scale.The output of the annotation workflow is a set of negated and non-negated terms from each note (Figure , step 3). As a result, for each patient we end up with a temporal series of terms mentioned in the notes (red denotes negated terms in Figure , step 4). We also include manually encoded ICD9 terms for each patient encounter as additional terms. Because each encounter’s date is recorded, we can order each set of terms for a patient to create a timeline view of the patient’s record. Using the terms as features, we can define patterns of interest (such as patients with rheumatoid arthritis, who take rofecoxib, and then get myocardial infarctio), which we can use in data mining applications. […]

Pipeline specifications

Software tools Annotator, NegEx
Databases BioPortal
Application Information extraction
Organisms Homo sapiens
Diseases Arthritis, Rheumatoid, Myocardial Infarction