1 - 50 of 62 results


Allows users to explore PubMed search results with the Gene Ontology (GO), a hierarchically structured vocabulary for molecular biology. GoPubMed provides the following benefits: first, it gives an overview of the literature abstracts by categorizing abstracts according to the GO and thus allowing users to quickly navigate through the abstracts by category. Second, GoPubMed automatically shows general ontology terms related to the original query, which often do not even appear directly in the abstract. Third, it enables users to verify its classification because GO terms are highlighted in the abstracts and as each term is labelled with an accuracy percentage. Fourth, exploring PubMed abstracts with GoPubMed is useful as it shows definitions of GO terms without the need for further look up.

CIIPro / Chemical In vitro-In vivo Profiling

A package to link chemical features and in vitro biological data with targeted in vivo biological activity. The CIIpro portal can automatically extract in vitro biological data from public resources for user-supplied compounds, and identify the most similar compounds based on their optimized bioprofiles. Compared to the existing hybrid approaches, the CIIPro portal provides a new read-across strategy to deal with missing data and biased data issues when using public data sources.

SSBT / Split Sequence Bloom Tree

An indexing scheme to support sequence-based querying of terabyte-scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the Sequence Bloom Tree (SBT) data structure for the same task. This tool is independent of the eventual queries, so the approach is not limited to searching only for known genes, but can potentially identify arbitrary sequences. SSBTs can be efficiently built, extended, and stored in limited space and do not require retaining the original sequence files. Using SSBT, datasets can be searched using low memory for the existence of arbitrary query sequences.


A text-mining software tool that integrates several state-of-the-art entity tagging systems (DNorm, GNormPlus, SR4GN, tmChem, and tmVar) and offer a batch-processing mode able to process arbitrary text input (e.g., scholarly publications, patents and medical records) in multiple formats (e.g. BioC). We support multiple standards to make our service interoperable and allow simpler integration with other text-processing pipelines. To maximize scalability, we have pre-processed all PubMed articles, and use a computer cluster for processing large requests of arbitrary text.

Cell line recognition

Cell line recognition and normalization system, supporting corpora and tagged documents. The aim is to create corpora that is suitable for training and evaluating machine learning systems to recognize and normalize established cell line names from text. We created two manually annotated corpora, Gellus and CLL. Gellus is suitable for the training of any machine learning systems in recognizing cell line name mentions while CLL is for evaluating the systems in recognizing the Cellosaurus cell line names.

PWTEES / PathWay Turku Event Extraction System

Extracts pathway interactions from the literature utilizing an existing event extraction tool and pathway named entity recognition (PathNER). PWTEES can be used to enrich the molecular context of diseases by applying large-scale text mining of events involving genes and pathways. We extended a state-of-the-art text mining system by introducing pathway named entity recognition to identify interactions involving both genes/proteins and pathways.


Improves grounding and relationship resolution for molecular entities commonly encountered in mining and curation of biomedical text. Bioentities is a curated resource that contains a set of identifiers representing protein families and complexes along with multiple types of mappings: (i) links between text strings and Bioentities identifiers, (ii) between Bioentities identifiers and identifiers representing protein families and complexes in other resources, and (iii) between Bioentities families/complexes and their constituent members.


Describes systematic chemical nomenclature. LeadMine is used for the identification and annotation of chemicals, protein targets, genes, diseases, species, named reactions, company names, cell lines. It uses a mixture of expertly curated grammars and dictionaries, as well as dictionaries automatically derived from public resources. We show that the heuristics developed to filter our dictionary of trivial chemical names (from PubChem) yields a better performing dictionary than the previously published Jochem dictionary. LeadMine differs from conventional machine learning approaches by being able to attribute all entities to a specific dictionary or grammar.


Exploits unlabeled data for incorporating domain knowledge into a named entity recognition model. BANNER-CHEMDNER includes natural language processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER unstructured information management architecture interface.


Identifies non-elliptical entity mentions in a coordinated noun phrase (NP) with ellipses. medtextmining proposes both intuitive graph-like and formal algebraic representation of a coordinated NP with ellipses. It is based on a practical named entity recognition (NER) system that effectively identified non-elliptical entity mentions using linguistic rules and an entity mention dictionary. The system was optimized by the Apriori algorithm which greatly reduces processing time for resolving ellipsis.

miRLiN / miRNA Literature Network

A semantic indexing method to extract relationships between terms and miRNAs directly from the biomedical literature. miRLiN provides access to a latent semantic indexing model, which contains the most recent and comprehensive collection of miRNA abstracts in MEDLINE. Users can query the model with any combination of terms or miRNAs. When querying with terms, miRLiN ranks all miRNAs in the collection with respect to semantic associations to the query. Selected miRNAs and terms can be visualized as a network graph, where the nodes represent the selected miRNAs and terms and the edges represent cosine values. LSI modeling of MEDLINE abstracts can be useful for knowledge discovery.


A text mining tool to find new associations between drugs. DrugQuest clusters DrugBank records based on their textual information in a multidimensional vector space. We mainly apply partitional clustering algorithms in order to group together DrugBank records based on their textual information. Toxicity, targeted pathways, targeted proteins, diseases and/or other interactors are few examples of such textual information. Uniquely assigning DrugBank records into clusters, based on tagged terms such as pathways diseases, molecules, biological processes, can make DrugQuest a promising tool for new concept discovery and detection of new drug associations.


A web-based NCBI-PubMed search application, which can analyze articles for selected biomedical verbs and give users relational information, such as subject, object, location, manner, time, etc. After receiving keyword query input, BWS retrieves matching PubMed abstracts and lists them along with snippets by order of relevancy to protein-protein interaction. Users can then select articles for further analysis, and BWS will find and mark up biomedical relations in the text. The analysis results can be viewed in the abstract text or in table form.

biomsef / BIOMedical Search Engine Framework

An open-source framework for the fast and lightweight development of domain-specific search engines. biomsef integrates taggers for major biomedical concepts, such as diseases, drugs, genes, proteins, compounds and organisms, and enables the use of domain-specific controlled vocabulary. The rationale behind this framework is to incorporate core features typically available in search engine frameworks with flexible and extensible technologies to retrieve biomedical documents, annotate meaningful domain concepts, and develop highly customized Web search interfaces.


A web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest(+) enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents.

NERsuite / Named Entity Recognition Suite

Simplifies research experiments. NERsuite uses various combinations of different NLP applications such as tokenizer, POS-tagger, lemmatizer and chunker to proceed. It contains three sub-functions: (1) a tokenizer, (2) a modified version of the GENIA tagger and (3) a named entity recognizer. This tool was tested on two biomedical Named Entity Recognition (NER) tasks. It is able to computes the beginning and the past the end positions of a given sentence.

OSCAR / Open-Source Chemistry Analysis Routines

A software for the recognition of named entities and data in chemistry publications. OSCAR4 can be used to identify chemical names, reaction names, ontology terms, enzymes and chemical prefixes and adjectives, and chemical data such as state, yield, IR, NMR and mass spectra and elemental analyses. In addition, where possible, any chemical names detected will be annotated with structures derived either by lookup, or name-to-structure parsing using open parser for systematic IUPAC nomenclature (OPSIN) or with identifiers from the Chemical Entities of Biological Interest (ChEBI) ontology.


An open-source software tool for identifying chemical names in biomedical literature, including chemical identifiers, drug brand and trade names and also systematic formats. tmChem uses conditional random fields with a rich feature set and rule-based post processing modules for resolving local abbreviations and improving consistency. tmChem achieved the highest performance of any submission to the BioCreative IV CHEMDNER task (over 87% F-measure). The tmChem system combines two linear chain conditional random fields (CRF) models employing different tokenizations and feature sets. Model 1 is an adaptation of the BANNER named entity recognizer. It uses the MALLET toolkit and is implemented in Java. Model 2 is repurposed from part of the tmVar system for locating genetic variants. It uses the CRF++ toolkit and is implemented in Perl and C++. Both models employ multiple post processing steps.


A hybrid system for extracting chemical entities from natural language texts. ChemSpot is based on a conditional random field trained for identifying International Union of Pure and Applied Chemistry (IUPAC) entities and a dictionary built from ChemIDplus for extracting drugs, abbreviations, molecular formulas and trivial names. Evaluations showed a major performance advantage compared with a freely available named entity recognition tool for chemical entities, OSCAR4. Thus, we believe that ChemSpot sets a new state-of-the-art in the recognition of chemical entities.

eFIP / extracting Functional Impact of Phosphorylation

A tool to support article selection and information extraction of functional impact of phosphorylated proteins. The current version focuses on protein-protein interactions (PPIs) as functional impact. In eFIP, PPIs refer to interactions between protein elements, including protein complexes and classes of proteins. Impact is defined as any direct relation between protein phosphorylation and PPI. The relation could be positive (phosphorylation of A increases binding to B), negative (when phosphorylated A dissociates from B) or neutral (phosphorylated A binds B).

BioInfer / Bio Information Extraction Resource

Provides the key types of annotation for a single set of sentences, expressing complex relationships between both physical and abstract entities. BioInfer is a public resource providing an annotated corpus of biomedical English that aimed at developing information extraction (IE) systems and their components in the biomedical domain. This corpus is unique in the domain in combining annotation types for a single set of sentences, and in the level of detail of the relationship annotation.

OpenDMAP / Open Source Direct Memory Access Parser

Advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. OpenDMAP is an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering.