Computational protocol: A document processing pipeline for annotating chemical entities in scientific documents

Similar protocols

Protocol publication

[…] We applied a supervised machine-learning approach, through the application of Conditional Random Fields (CRFs) [] provided by MALLET []. Additionally, we compiled a dictionary of chemical entity name, and used the matches of these names in the texts as features for the CRF model.The method applied for this work was developed on top of two frameworks: Gimli [] was used for feature extraction and to train the machine learning (ML) models, and Neji [] was used for pre- and post-processing tasks and as the framework for multi-threaded document annotation. Figure illustrates the overall architecture and the steps performed. [...] The system described in this work was trained and evaluated on the BioCreative IV CHEMDNER corpus [], which is provided in three sub-sets: a training set containing 3500 Medline abstracts annotated with 29478 mentions of chemical entities, a development set composed of 3500 abstracts with 29526 entity mentions, and a test set composed of 3000 abstracts, and containing 25351 mentions. Seven chemical entity classes were defined in the corpus annotation guidelines. However, instead of treating each class separately, we grouped all classes into a single class.The training and development sets were used to train and refine the machine learning models, and to perform the feature evaluation studies. The final model was trained on the combined training and development set and evaluated on the test set.The common evaluation metrics were used, namely Recall = TP/(TP + FN), Precision = TP/(TP + FP) and F1 = 2 × Precision × Recall/(Precision + Recall), were TP refers to true positives, FP to false positives, and FN refers to false negatives. [...] As shown in Figure , the first fundamental step is to perform sentence splitting, in order to divide the texts in the basic units of logical thought. For performing this step, we take advantage of Lingpipe [], which provides a model trained on biomedical corpora that presents high-performance results []. The following Natural Language Processing (NLP) tasks are achieved through a customized version of GDep [], a dependency parser for the biomedical domain built on top of the GENIA tagger that performs tokenization, lemmatization, part-of-speech (POS) tagging and chunking. We modified the tokenizer in GDep so that words containing the symbols "/", "-" or "." are always divided into multiple tokens, making its behaviour more consistent. This simple change proved to be effective when applied to gene/protein entity recognition in different corpora []. Finally, the corpus annotations were encoded with the BIO scheme. […]

Pipeline specifications

Software tools MALLET, Neji, BioCreative, GENIA Tagger
Application Information extraction
Organisms Homo sapiens