Computational protocol: Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013

Similar protocols

Protocol publication

[…] Each mention of a relevant physical entity is annotated as a contiguous span of text that is assigned a type such as CELL or SIMPLE CHEMICAL from a closed set of entity types defined for the task. Figure shows examples of entity annotation.The recognition of mentions of physical entities in free text is a very well-studied task both in general-domain and biomedical natural language processing [,], and the recognition of many key entity types relevant to the CG and PC tasks has been considered in particular for molecular level entities in a number of tasks in the BioCreative series of community evaluations [,,]. Thus, to focus the efforts of participants on the novel aspects of the CG and PC tasks, manually created ("gold standard") physical entity mention annotations are provided to participants also for test data, following a convention first established in the BioNLP ST'09. [...] The annotation of the CG and PC task corpora followed the same overall process: document selection, automatic pre-annotation of entity mentions, manual finalization of entity annotations, and manual event annotation.While some of the aspects of the entity annotation are novel, many of the annotated entity types are in scope of established domain tools and resources. To reduce the overall annotation effort, we thus created preliminary annotation using a selection of automatic named entity and entity mention taggers. For SIMPLE CHEMICAL tagging, we used the OSCAR4 system, which was trained on the chemical entity mention recognition corpus of Corbett and Copestake []. For GENE OR GENE PRODUCT mention detection, we used BANNER[] for the CG task and NERsuite [] for the PC task. Both of these systems were trained on the Gene Mention task corpus introduced in the BioCreative 2 evaluation []. NERsuite was also applied for anatomical entity mention detection (CULLULAR COMPONENT only for the PC task). For these tagging tasks, the general machine learning-based system was trained on the Anatomical Entity Mention (AnEM) corpus [] following the approach presented by Pyysalo and Ananiadou []. As no broad-coverage corpus annotated specifically for mentions of macromolecular complexes was available, we applied heuristics based on the GENE OR GENE PRODUCT annotation and dictionary-based tagging to create the initial annotations for the PC task COMPLEX type. Finally, LINNAEUS [] was applied for the CG task ORGANISM mentions. The overall processing used the pipeline first introduced for similar analysis for the BioNLP ST'11 []. These tools were additionally integrated into the Argo workflow system [] to support the PC task curation process. Following initial automatic entity mention annotation, we performed manual revision of the outputs to correct tagger errors prior to advancing to the event annotation stage.We acknowledge that automatic annotation is not only far from perfect, but also carries a risk of introducing systematic errors, some of which may persist through subsequent manual revision. As entity annotations were not a target of extraction in either of the tasks, the possibility of some remaining bias from such errors was considered acceptable. By contrast, we wished to assure that the quality of the event annotations was as high as possible and to avoid any possibility of introducing systematic errors that might call into question whether the evaluation provides a fair representation of the comparative performance of different extraction approaches. For this reason, the event annotation of both tasks was created manually from scratch, forgoing any initial automatic annotation.All manual annotation, including the revision of the initial automatic entity mention annotations as well as the primary event annotation, was performed using the open source BRAT annotation tool [].The task-specific annotation process details are presented below. […]

Pipeline specifications

Software tools BioCreative, BANNER, NERsuite, LINNAEUS, Argo, BRAT
Application Information extraction
Diseases Neoplasms