Computational protocol: Improving chemical entity recognition through h-index based semantic similarity

Similar protocols

Protocol publication

[…] For this competition, we used the implementation of Conditional Random Fields (CRFs) on Mallet [], with the default values. In particular, we used only an order of 1 in the CRF algorithm. The following features were extracted from the training data to train the classifiers:Stem: Stem of the word token with the Porter stemming algorithmPrefix and Suffix size 3: The first and last three characters of a word token.Number: Boolean that indicates if the token contains digits.Furthermore, each token was given different label depending on whether it was not a chemical entity, a single word chemical entity, or the start, middle or end of a chemical entity.Since Mallet does not provide a confidence score for each label, we had to adapt the source code based on suggestions provided by the developers, so that for each label, a probability value is also returned, according to the features of that token. This information was useful to adjust the precision of our predictions, and to rank them according to how confident the system is about the extracted mention being correct.We used the provided CHEMDNER corpus, the DDI corpus and the patents corpus for training multiple CRF classifiers, based on the different types of entities considered on each dataset. Each title and abstract from the test set was classified with each one of these classifiers. In total, our system combined the results from fourteen classifiers: eight trained with the CHEMDNER corpus (7 types + 1 with every type), five trained with the DDI corpus (4 types + 1 with every type) and one trained with the patents corpus.After participating in the BioCreative IV challenge, we implemented a more comprehensive feature set with the purpose of detecting more chemical entities that would be missed by a smaller feature set. These new features are based on orthographic and morphological properties of the words used to represent the entity, inspired by other CRF-based chemical named entity recognition systems that had also participated in the challenge [-,]. We integrated the following features:Prefix and Suffix sizes 1, 2 and 4: The first and last n characters of a word token.Greek symbol: Boolean that indicates if the token contains Greek symbols. Case pattern: "Lower" if all characters are lower case, "Upper" if all characters are upper case, "Title" if only the first character is upper case and "Mixed" if none of the others apply.Word shape: Normalized form of the token by replacing every number with '0', every letter with 'A' or 'a' and every other character with 'x'.Simple word shape: Simplified version of the word shape feature where consecutive symbols of the same kind are merged.Periodic Table element: Boolean that indicates if the token matches a periodic table symbols or name.Amino acid: Boolean that indicates if the token matches a 3 letter code amino acids.With these new features, we were able to achieve better recall values while maintaining high precision. However, only the original list of features was used for the BioCreative IV challenge. [...] With the named chemical entities successfully mapped to a ChEBI identifier, we were able to calculate Gentleman's simUI [] for each pair of entities on a fragment of text. This measure is a structural approach, which explores the directed acyclic graph (DAG) organization of ChEBI []. We then used the maximum semantic similarity value for each entity as a feature for filtering and ranking.The output provided for each putative chemical named entity found is the classifier's confidence score, and the most similar putative chemical named entity mentioned on the same document through the maximum semantic similarity score. Using this information, along with the ChEBI mapping score, we were able to gather 29 features for each prediction. When a chemical entity mention is detected by at least one classifier, but not all, the confidence score for the classifiers that did not detect this mention was considered to be 0. These features were used to train a classifier able to filter false positives from our results, with minimal effect on the recall value. We used our predictions obtained by cross-validation on the training and development set to train different Weka [] classifiers, using the different methods implemented by Weka. The method that returned better results was Random Forest, and so we used that classifier on our test set predictions. […]

Pipeline specifications

Software tools BioCreative, Weka
Databases ChEBI DDI Corpus
Application Information extraction