Computational protocol: Feature Engineering for Drug Name Recognition in Biomedical Texts: Feature Conjunction and Feature Selection

Similar protocols

Protocol publication

[…] DNR is usually formalized as a sequence labeling problem, where each word in a sentence is labeled with a tag that denotes whether a word is part of a drug name and its position in a drug name. BIO and BILOU are the most two popular tagging schemes used for NER. In the BIO tagging scheme, BIO, respectively, represent that a token is at the beginning (B) of an entity, inside (I) of an entity, and outside of an entity (O). In the BILOU tagging scheme, BILOU, respectively, represent that a token is at the beginning (B) of an entity, inside (I) of an entity, last token of an entity (L), and outside of an entity (O) and is an unit-length entity (U). Compared with the BIO tagging scheme, BILOU are more expressive and can capture more fine-grained distinctions of entity components. Some previous studies [, , ] have also shown that BILOU outperform BIO on NER tasks in different fields. Following them, we adopt BILOU to label drug names in this study. As four types of drugs are defined in the DDIExtraction 2013 challenge, “drug,” “brand,” “group,” and “no-human,” 17 tags (B-drug, I-drug, L-drug, U-drug, B-brand, I-brand, L-brand, U-brand, B-group, I-group, L-group, U-group, B-no-human, I-no-human, L-no-human, U-no-human and O) are actually used in our DNR system.CRF is a typical sequence labeling algorithm and has been demonstrated to be superior to other machine learning methods for NER. CRF-based method achieved the best performance on the DNR subtask of DDIExtraction 2013 challenge []. Moreover, CRF was also utilized by highly ranked systems on the medical concept extraction task of i2b2 2010 [], bio-entity recognition task of JNLPBA [], and gene mention finding task of BioCreAtIve []. Therefore, we use CRF in our DNR system. An open source implementation of CRF, CRFsuite (http://www.chokkan.org/software/crfsuite/), is used. [...] Singleton features used for DNR in this paper are as follows. Word Feature. The word feature is the word itself. POS Feature. POS type is generated by the GENIA (http://www.nactem.ac.uk/tsujii/GENIA/tagger/) toolkit for a word. Chunk Feature. Chunk information is generated by the GENIA toolkit for a word. Orthographical Feature. Words are classified into four classes {“All-capitalized,” “Is-capitalized,” “All-digits,” and “Alphanumeric”} based on regular expressions. The class label is used as a word's orthographical feature. In addition, {“Y”, “N”} are used to denote whether a word contains a hyphen or not. Affix Feature. Prefixes and suffixes are of the length of 3, 4, and 5. Word Shape Feature. Similar to [], two types of word shapes “generalized word class” and “brief word class” are used. The “generalized word class” maps any uppercase letter, lowercase letter, digit, and other characters in a word to “X,” “x,” “0,” and “O,” respectively, while the “brief word class” maps consecutive uppercase letters, lowercase letters, digits, and other characters to “X,” “x,” “0,” and “O,” respectively. For example, the word shapes of “Aspirin1+” are “Xxxxxxx0O” and “Xx0O.”In addition to the above features that are commonly used for NER, dictionary features are also widely used in DNR systems [, ]. Three drug dictionaries are used to generate dictionary features in the way similar to [], which denotes whether a word appears in a dictionary by {“Y”, “N”}. Dictionaries used in this paper are described as follows. DrugBank. DrugBank [] contains 6825 drug entries including 1541 FDA-approved small molecule drugs, 150 FDA-approved biotech (protein/peptide) drugs, 86 nutraceuticals, and 5082 experimental drugs (http://www.drugbank.ca/downloads). [email protected] [email protected] (http://www.fda.gov/Drugs/InformationOnDrugs/ucm079750.htm) is a database provided by U.S. Food and Drug Administration. It contains information about FDA-approved drug names, generic prescription, over-the-counter human drugs, and biological therapeutic products. Totally, 8391 drug names are extracted from the Drugname and Activeingred fields of [email protected] Jochem. Jochem [] is a joint chemical dictionary. 1527751 concepts are extracted from Jochem.Moreover, word embeddings feature that can capture semantic relations among words is also used. Word Embeddings Feature. Word embeddings learning algorithms can induce dense, real-valued vector representations (i.e., word embeddings) from large-scale unstructured texts for words. We use the skip-gram model proposed in [] to learn word embeddings on the article abstracts in 2013 version of MEDLINE (http://www.nlm.nih.gov/databases/journal.html). Following previous works, we set the dimension of word embeddings to 50 and the word2vec tool (https://code.google.com/p/word2vec/) is used as an implement of the skip-gram model. After inducing word embeddings for words, words are clustered into different semantic classes by k-means clustering algorithm. The semantic class that a word belonged to is used as its word embeddings feature. The optimal number of semantic classes is selected from {100,200,300,…, 1000} via 10-fold cross-validation on the training set of the DDIExtraction 2013 challenge and 400 is determined as the optimal number. […]

Pipeline specifications

Software tools BioCreative, GENIA Tagger
Databases DrugBank
Application Information extraction