Computational protocol: Identifying medical terms in patient-authored text: a crowdsourcing-based approach

Similar protocols

Protocol publication

[…] To test our second hypothesis, we create a crowd-labeled dataset comprising 10 000 MedHelp sentences, and an expert-labeled dataset comprising 1000 CureTogether sentences. Using the procedures described above, this cost approximately $600 and $150, respectively. We train two models—a dictionary and a CRF—on the MedHelp dataset, and evaluate their performance via fivefold cross validation; we compare MetaMap, OBA, and TerMINE's output directly. Finally, we compare the performance of all five models against the CureTogether gold standard. [...] CRFs are probabilistic graphical models particularly suited to labeling sequence data. Their suitability stems from the fact that they relax several independence assumptions made by Hidden Markov Models; moreover, they can encode arbitrarily related feature sets without having to represent the joint dependency distribution over features. As such, CRFs can incorporate sentence-level context into their inference procedure. Our CRF training procedure takes, as input, labeled training data coupled with a set of feature definitions, and determines model feature weights that maximize the likelihood of the observed annotations. We use the Stanford Named Entity Recognizer package (, a trainable, Java implementation of a CRF classifier, and its default feature set. Examples of default features include word substrings (eg, ‘ology’ from ‘biology’) and windows (previous and trailing words); the full list is detailed in online supplementary Appendix A. We refer to our trained CRF model as ADEPT (Automatic Detection of Patient Terminology). […]

Pipeline specifications

Software tools MetaMap, TerMine, Stanford NER
Application Information extraction
Organisms Homo sapiens
Diseases Eye Diseases, Hereditary