1 - 50 of 54 results


Maps biomedical text to the Unified Medical Language System (UMLS) Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap breaks the text into phrases and then, for each phrase, it returns the mapping options ranked according to the strength of mapping. It is meant for applications that emphasize processing speed and ease of use. The tool is modular for local use thank to its Java implementation. It allows the user to use customized dictionaries and focus on a specific domain or provide broad coverage of text types and semantic types.


Identifies negation in textual medical records. NegEx implements several phrases indicating negation, filters out sentences containing phrases that falsely appear to be negation phrases, and limits the scope of the negation phrases. It enables lexical representations in other languages. The tool can be used to scan the documents and charts of an outgoing patient to make sure that the doctors haven’t missed anything important. It was translated to Swedish, French, and German and compared on corpora from each language.


Provides the infrastructure to combine standalone applications by exporting different data formats. PubMedPortable automatically builds a PostgreSQL relational database schema and a Xapian full text index on PubMed XML files as well as it provides an interface to BioC. The aim of PubMedPortable is to enable users to develop text mining applications and use cases with very basic programming knowledge. The integrated workflow allows users to retrieve, store, and analyse a disease-specific data set. The software library is small, easy to use, and scalable to the user’s system requirements.


Allows recognition of anatomical entity mentions in free text. AnatomyTagger is a machine learning-based system that identifies molecular entities and whole organism mentions to facilitate comprehensive analysis of entity references in biological and medical text. The machine learning based tagger integrates a variety of techniques shown to benefit tagging performance, including manually curated lexical resources, word representations induced from unannotated text, statistical true casing and non-local features.

AnatEM / Anatomical Entity Mention

Provides a corpus of documents that was used to train machine learning-based taggers. AnatEM was built in parton the Anatomical Entity Mention (AnEM) and Multi-Level Event Extraction (MLEE) corpora. It contains 212 documents: 600 drawn randomly from abstracts and full texts as in AnEM and 612 which are a targeted selection of PubMed abstracts relating to the molecular mechanisms of cancer. The corpus was split into separate training, development and test sets.


Determines whether clinical conditions mentioned in clinical reports are negated, hypothetical, historical, or experienced by someone other than the patient. ConText obtains reasonable to good performance for negated, historical, and hypothetical conditions across all report types that contain such conditions. It infers the status of a condition with regard to these properties from simple lexical clues occurring in the context of the condition. The tool is based on the approach used by NegEx for finding negated conditions in text. It can improve precision of information retrieval and information extraction from various types of clinical reports.

Vigi4Med Scraper

Extracts structured data from web forums. Vigi4Med Scraper is part of the Vigi4Med project for detecting adverse drug reactions in social networks. It is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum. The extracted data are anonymized and represented in a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by data analysis algorithms and allows the collected data to be directly linked to any existing semantic resource.

gnparser / Global Names Parser

Parses scientific names of any complexity. gnparser identifies which combinations of the most atomic parts of a name-string represent words or dates. It allows developers to define the rules that describe the general structure of target strings thank to the implementation of Parsing Expression Grammar (PEG). The tool can be used to form normalized names automatically. It transforms names of taxa into their semantic elements. gnparser aims to complete coverage of the biodiversity’s test suite.


Compiles and ranks information about more than 200 leading global research institutions according their influence on industry and innovation. QUT In4M is an open database scoring these institutions using method that merges scholarly work cited in patent literature and the estimated perceived value of the patents. The repository is composed of three mains panels that gives an overview of the different institutions, and allows users to make comparisons between them and explains the methods applied for ranking.

ADEPt / Adverse Drug Event annotation Pipeline

Allows users to detect and annotate temporally anchored mentions of Adverse drug events (ADEs) from a clinical text corpus. ADEPt is a modular pipeline that first perform ADE mentions’ identification, and then, organize it, for finally refining the classification thanks to contextual indicators furnished by the source. The application also includes a way for targeting ADE-specific patterns in psychiatric clinical text and an expandable dictionary depicting over 60 common ADEs.

BNER / Biomedical Named Entity Recognition

Allows users to recognize important biomedical entities (such as genes and proteins) from text. BNER is a recurrent neural network (RNN) framework allowing capture of morphological and orthographic information of words. It utilizes an attention model to encode character information of a word into its character-level representation. It combines character and word-level representations and then feed them into the long short-term memory recurrent neural network (LSTM-RNN) layer to model context information of each word.


Accepts a list of genes and returns a list of the papers which cocite any two or more of the genes. Gene-cocite is a web app designed to be an easy to use first step for biological researchers investigating the background of their list of genes. It is intended as a step between browsing PubMed and looking for functional assignments. Sixteen different organisms can be investigated. The proportion of the genes in the list which are cocited with at least one other gene is also given.

Multi-Label Representation

Recognizes not only disorder mentions in the form of contiguous or discontiguous words but also mentions whose spans overlap with each other. Multi-Label Representation is an approach to recognize disorder mentions from clinical narratives, which can be very complicated in some circumstances. Using binary digits to record the disorder mention details, the multi-label scheme enables to recognize complicated disorder mentions, e.g., those overlapping with each other.

MedXN / Medication Extraction and Normalization

Extracts comprehensive medication information and normalizes it to the most appropriate RxNorm concept unique identifier (RxCUI) as specifically as possible. MedXN focuses on medication normalization by mapping the comprehensive medication description to the best matching RxCUI. It externalizes pattern-matching rules and allows end users to easily customize them according to their needs. MedXN creates a flexible mechanism for the adaptation process. This method is part of Open Health Natural Language Processing package.

CLAMP / Clinical Language Annotation Modeling and Processing Toolkit

Enables recognition and automatic encoding of clinical information in narrative patient reports. CLAMP is based on proven methods in many clinical Natural Language Processing (NLP) challenges. It is customizable thank to its capacities to offer components such as named entity recognition, assertion, UMLS encoder, and component customizations. The tool allows to annotate target documents, generate models, and process clinical notes.

iProLINK / integrated Protein Literature, Information and Knowledge

Facilitates text mining/NLP research in the areas of literature-based database curation, named entity recognition, and ontology development. iProLINK is a resource for protein literature mining. The database can be used by computational or biological researchers to explore literature information on proteins and their features or properties. It also serves as a knowledge link bridging protein databases and scientific literature.


Assists in concepting extraction technologies accessible to groups without informatics support. TextHunter is a program that guides a user through all of the required processes to create and apply a concept extraction model for a selection of documents from start to finish. It was designed to operate as a standalone ‘offline’ program on desktop hardware. This tool was used to extract a diverse set of concepts that are typically in demand in clinical research environments.

Java MIAPE API / Java Minimal Information About a Proteomics Experiment API

Permits extraction and management of MIAPE information from commonly used proteomics data files. Java MIAPE API is an open source Java application programming interface (API) designed in modules that contains: (i) the classes needed to represent the MIAPE information of the different types of experiments, (ii) the until classes for the creation of the MIAPE document objects, (iii) the methods for the extraction of the MIAPE information from commonly used proteomics data files and (iv) the methods to be implemented by a persistence system.

YASMEEN / Yet Another Species Matching Execution Environment

Identifies species names matching between a set of input data and multiple reference data sets. YASMEEN is based on Concept Matching Engine and Tools (COMET) which models and supports generic data matching processes. It can be configured to include and combine a set of matchlets in the one, each matchlet deals with specific attributes of the species data model. Each matchlet will in turn produce a matching score according to its nature.

CLAMP-Cancer / Clinical Language Annotation Modeling and Processing Toolkit for Cancer

Extracts comprehensive types of cancer-related information in pathology reports. CLAMP-Cancer components build on a set of high performance Natural Language Processing (NLP) components that were proven in several clinical NLP challenges such as i2b2, ShARe/CLEF, and SemEVAL. It checks for errors in sequence and directs the user to the appropriate logical order with insertion of the required components for a working pipeline. Its components are supported by knowledge resources consisting of medical abbreviations, dictionaries, section headers, and a corpus of 400 annotated clinical notes derived from MTsamples.


Implements Support Vector Machines (SVMs) for the problem of pattern recognition. TinySVM is a new generation learning algorithms based on recent advances in statistical learning theory, and applied to large number of real-world applications, such as text categorization, hand-written character recognition. It uses sparse vector representation and supports standard C-SVR and C-SVM. Finally, TinySVM can handle several ten-thousands of training examples, and hundred-thousands of feature dimension.