Species named entity recognition/normalization software tools | Text mining
Species recognition has become increasingly important for the text mining community in recent years. In particular, it has been shown that accurately recognizing species and linking them to relevant genes or proteins is critical to the success of many downstream tasks such as gene normalization (GN) and protein-protein interaction extraction.
A named entity recognition system intended primarily for biomedical text. BANNER uses conditional random fields as the primary recognition engine and includes a wide survey of the best techniques described in recent literature.
An open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level.
Tries to index the taxonomic names using pattern-matching expressions and a lexicon of English words, providing a confidence score for resultant names. Rule-based, word frequency and regular expression-based approaches manage to capture genus–species combinations with high levels of precision and recall.
A hybrid rule-based/machine learning system to extract organism mentions from the literature. OrganismTagger includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID.
An open source tool for species recognition and disambiguation in biomedical text. In addition to the species detection function in existing tools, SR4GN is optimized for the gene normalization task. As such it is developed to link detected species with corresponding gene mentions in a document. SR4GN achieves 85.42% in accuracy and compares favorably to the other state-of-the-art techniques in benchmark experiments. Finally, SR4GN is implemented as a standalone software tool, thus making it convenient and robust for use in many text-mining applications.
Detects scientific names in plain text. Given a string, taxonfinder will scan through the contents and use a dictionary-based approach to identifying which words and strings are latin scientific organism names. It detects names at all ranks including species, genera, subspecies and more.
An NLP (natural language processing) solution written in PHP, which leverages systematic nomenclature rules that are used for taxonomic nomenclature in scientific publications. TaxonGrab may show general utility for indexing documents containing embedded taxonomic names.