A free service that tags gene, protein, and small molecule names in any web page within a few seconds. Clicking on a tagged term opens a small popup showing summary information, as shown below. Reflect can be installed as a plugin to Firefox or Internet Explorer, or can used by entering a URL in the field above.
Permits users to annotate entities by using a graphical web-based user interface called BRAT. NeuroNER can achieve named-entity recognition (NER) which purposes the following advantages: i) the exploitation of the sate-of-the-art prediction capabilities of neural networks, and ii) the creation or modification of annotations for a new or existing group.
A web-based text mining tool that extracts and incorporates comprehensive knowledge about E3s with their underlying mechanisms. E3Miner integrates available E3 data not only from the published literature but also from the biological databases, using natural language processing techniques.
An open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art.
A web application that combines Information Retrieval and Extraction from Medline. EBIMed finds Medline abstracts in the same way PubMed does. Then it goes a step beyond and analyses them to offer a complete overview on associations between UniProt protein/gene names, GO annotations, Drugs and Species.
Provides a part-of-speech tagger trained on the MEDLINE corpus. MedPost accepts text for tagging in either native MEDLINE format or XML, both available as save options in PubMed. It is based on a stochastic tagger that employs a hidden Markov model (HMM). The tagger is able to achieve high accuracy by using the contextual information in the HMM to resolve ambiguities.
Implements the alpha-closed frequent subtree method. The Glycan Miner Tool was able to extract a significant pattern from glycan array data. It was proved using a viral infection experiment on cells with modified glycans on the cell surface. It is also used to analyze the glycan array data of influenza viruses to find novel glycan structures other than sialic acid (SA) that may be involved in viral infection.
A free and user-friendly text annotation tool aimed to assist in carrying out the main biocuration tasks and to provide labelled data for the development of text mining systems. MyMiner allows easy classification and labelling of textual data according to user-specified classes as well as predefined biological entities.
Allows different types of sentence extraction. BioIE employs predefined categories of interest relating to proteins and custom extraction around different entities and concepts, together with statistical feedback on the source and extracted text. It uses five predefined categories of interest relating to proteins: structure, function, diseases and therapeutic compounds and localization and familial relationships.
Builds protein reports from related entries in Swiss-Prot. METIS employs data in the Swiss-Prot entries to find relevant literature, or to find search terms with which to seek this out. It reduces the time required to seek out and read relevant literature. This tool is able to extract pertinent sentences from the biomedical literature.
A web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest(+) enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents.
A text-mining software tool that integrates several state-of-the-art entity tagging systems (DNorm, GNormPlus, SR4GN, tmChem, and tmVar) and offer a batch-processing mode able to process arbitrary text input (e.g., scholarly publications, patents and medical records) in multiple formats (e.g. BioC). We support multiple standards to make our service interoperable and allow simpler integration with other text-processing pipelines. To maximize scalability, we have pre-processed all PubMed articles, and use a computer cluster for processing large requests of arbitrary text.
Cell line recognition and normalization system, supporting corpora and tagged documents. The aim is to create corpora that is suitable for training and evaluating machine learning systems to recognize and normalize established cell line names from text. We created two manually annotated corpora, Gellus and CLL. Gellus is suitable for the training of any machine learning systems in recognizing cell line name mentions while CLL is for evaluating the systems in recognizing the Cellosaurus cell line names.
A learning algorithm for unsupervised feature extraction, specifically designed for analysing noisy and high-dimensional datasets. KODAMA consists of two main parts: (i) the first step involves random assignment of each sample to a different class; (ii) in the second step, the cross-validated accuracy is maximized by an iterative procedure by swapping the class labels.
Uses both character embedding and word embedding for the biomedical named entity recognition (NER) tasks. GRAM-CNN is an end-to-end model allowing to extract local information between a target word and its neighbors and requiring no task specific resources or handcrafted features. The software can theoretically be applied to wide range of BioNER tasks. The approach was evaluated on three biomedical datasets.
A tool to support article selection and information extraction of functional impact of phosphorylated proteins. The current version focuses on protein-protein interactions (PPIs) as functional impact. In eFIP, PPIs refer to interactions between protein elements, including protein complexes and classes of proteins. Impact is defined as any direct relation between protein phosphorylation and PPI. The relation could be positive (phosphorylation of A increases binding to B), negative (when phosphorylated A dissociates from B) or neutral (phosphorylated A binds B).
Helps to identify environment descriptors, organisms, tissues and diseases mentioned in text and to annotate these using ontology/taxonomy terms. EXTRACT consists of a server that performs the Named Entity Recognition (NER) task, a bookmarklet that allows to submit text from a web page to the server and a popup that allows to inspect the identified terms and extract these annotations in tabular form.
Identifies non-elliptical entity mentions in a coordinated noun phrase (NP) with ellipses. medtextmining proposes both intuitive graph-like and formal algebraic representation of a coordinated NP with ellipses. It is based on a practical named entity recognition (NER) system that effectively identified non-elliptical entity mentions using linguistic rules and an entity mention dictionary. The system was optimized by the Apriori algorithm which greatly reduces processing time for resolving ellipsis.
The task of recognizing and normalizing protein name mentions in biomedical literature is a challenging task and important for text mining applications such as protein-protein interactions, pathway reconstruction and many more. ProNormz is an integrated approach for human proteins (HPs) tagging and normalization.
Retrieves structured information from free text clinical narratives. This approach starts by the detection of the named entities of interest and their relations. It then builds “information frames” from the each extracted item. This method is based on a natural language processing (NLP) system and unsupervised machine learning techniques. It was applied in the particular context of mammography reports.
Provides the key types of annotation for a single set of sentences, expressing complex relationships between both physical and abstract entities. BioInfer is a public resource providing an annotated corpus of biomedical English that aimed at developing information extraction (IE) systems and their components in the biomedical domain. This corpus is unique in the domain in combining annotation types for a single set of sentences, and in the level of detail of the relationship annotation.
Advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. OpenDMAP is an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering.
A machine learning-based solution for biomedical Named Entity Recognition (NER), which goal is to automatically extract names of biomedical entities from scientific text documents. Currently, Gimli supports the recognition of gene/protein, DNA, RNA, cell line and cell type names.
An open-source framework for the fast and lightweight development of domain-specific search engines. biomsef integrates taggers for major biomedical concepts, such as diseases, drugs, genes, proteins, compounds and organisms, and enables the use of domain-specific controlled vocabulary. The rationale behind this framework is to incorporate core features typically available in search engine frameworks with flexible and extensible technologies to retrieve biomedical documents, annotate meaningful domain concepts, and develop highly customized Web search interfaces.
Identifies terms and documents that are relevant to a gene and its products. Additional functionalities of eGIFT include finding terms in documents for a group of genes, finding genes sharing a specific term, finding related terms and related genes.
Trains a text document classifier and classify text documents. This machine learning method allow users to identify articles that contain protein-protein interaction (PPI) data. Simple Classifier performs binary classification to determine whether the given article is PPI relevant or not. On analyzing incorrectly classified article classification task cases, results show that false positives were seen when an article contained terms that usually indicate PPI, but were not used in that context. Simple Classifier can apply the feature selection method with different classifier algorithms.
A semantic indexing method to extract relationships between terms and miRNAs directly from the biomedical literature. miRLiN provides access to a latent semantic indexing model, which contains the most recent and comprehensive collection of miRNA abstracts in MEDLINE. Users can query the model with any combination of terms or miRNAs. When querying with terms, miRLiN ranks all miRNAs in the collection with respect to semantic associations to the query. Selected miRNAs and terms can be visualized as a network graph, where the nodes represent the selected miRNAs and terms and the edges represent cosine values. LSI modeling of MEDLINE abstracts can be useful for knowledge discovery.
A text mining tool to find new associations between drugs. DrugQuest clusters DrugBank records based on their textual information in a multidimensional vector space. We mainly apply partitional clustering algorithms in order to group together DrugBank records based on their textual information. Toxicity, targeted pathways, targeted proteins, diseases and/or other interactors are few examples of such textual information. Uniquely assigning DrugBank records into clusters, based on tagged terms such as pathways diseases, molecules, biological processes, can make DrugQuest a promising tool for new concept discovery and detection of new drug associations.
Improves grounding and relationship resolution for molecular entities commonly encountered in mining and curation of biomedical text. Bioentities is a curated resource that contains a set of identifiers representing protein families and complexes along with multiple types of mappings: (i) links between text strings and Bioentities identifiers, (ii) between Bioentities identifiers and identifiers representing protein families and complexes in other resources, and (iii) between Bioentities families/complexes and their constituent members.
A tool for the systematic detection of pathway mentions in the literature. PathNER is based on soft dictionary matching and rules, with the dictionary generated from public pathway databases. The rules utilise general pathway-specific keywords, syntactic information and gene/protein mentions. Detection results from both components are merged.
Extracts pathway interactions from the literature utilizing an existing event extraction tool and pathway named entity recognition (PathNER). PWTEES can be used to enrich the molecular context of diseases by applying large-scale text mining of events involving genes and pathways. We extended a state-of-the-art text mining system by introducing pathway named entity recognition to identify interactions involving both genes/proteins and pathways.
Aims to build a flexible, extensible system for a variety of natural language processing tasks. Distiller consists of a knowledge extraction framework that extracts and infers knowledge from texts. The main focus of the framework is the task of Automatic Keyphrase Extraction (AKE) which is the process of extracting relevant phrases from a document.
Allows biomedical text annotation. OGER consists of a web application that generates a high recall, low precision set of all the possible entities that can be found in a document. It is part of a two-stage pipeline, composed of a dictionary-based pre-annotator (OGER) and a machine-learning classifier (Distiller). This tool serves for biomedical entity recognition based on dictionary lookup and flexible matching.
A cross-language tool for searching MEDLINE/PubMed. Queries can be submitted as single terms or complex phrases in French, Spanish and Portuguese. Citations will be sent to the user in English. BabelMeSH uses a smart parser interface with a medical terms database in MySQL.
Labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. Stanford NER is based on a Monte Carlo method used to perform approximate inference in factored probabilistic models. It provides a general implementation of linear chain Conditional Random Field (CRF) sequence models. The tool can be used as a command line tool, runs on a server and provides a Java API.
Detects biomedical named entity such as genes, proteins, chemicals, diseases and cell lines. This approach consists of a multi-task learning framework, based on a neural network method. It provides a text-mining solution to assists researchers to exploit knowledge disseminated in the biomedical literature in a systematic and unbiased way. This method was improved with sharing character- and word-level information between different biomedical entity types.
Allows users to predict hypoxemia risk from electronic data. Prescience is a based machine learning method which aims to exploit data from electronic medical records systems to help anesthesiologists in detecting physiological events. The application is able to give assistance for managing a preoperative prediction of risk and real-time prediction of risk.
Simplifies tasks during biological investigations. miRAFinder is a Python package that allows to find miRNA names and keywords. This script was developed for the semi-automatic extraction and arrangement of updated information on miRNA and additional data from published article abstracts in PubMed. The information gleaned through such approach finds utility in miRNA analysis of specific diseases.
Text mining tool for relation extraction of Protein to DNA and to RNA interactions. Relna expands NLPBA corpus with: protein to RNA relations and protein to DNA elements. It creates method to given a text or PMID, recognize these kinds of relations.
Allows extraction of information from UniProtKB and published literature, or from users' own uploaded text. MINOTAUR aims to assist users who want to search specific types of information from PubMed. It aims to assist users to grab salient facts from biomedical literature.
A product of the PubGene Company designed to be used by anyone seeking information on health, medicine and biology. It is ideal for those seeking an overview of a complex subject while allowing the possibility to "drill down" to specific details. Search results are presented in a dashboard format comprised of panels containing various categories of information ranging from introductory sources to the latest scientific articles.
Analyzes English sentences and outputs the base forms, part-of-speech tags, chunk tags, and named entity tags. The tagger is specifically tuned for biomedical text such as MEDLINE abstracts. If you need to extract information from biomedical documents, this tagger might be a useful preprocessing tool.
Presents pre-processed input from the underlying parsing, protein recognition and DB identifier assignment systems. Eighteen thousand full text articles are indexed by GNSuite, and more than eighteen million abstracts from PubMed by MEDIE. The system accepts several sources of input such as, MEDIE, GNSuite, and LINNAEUS. This can easily be extended with other systems that provide stand-off annotations, since each system is presented in a separate tab in the user interface. All underlying results are integrated to improve recall.
Allows users to mine information from a range of unstructured data sources to create structured outputs. MC Miner is a state-of-the-art text mining engine that provides data usable to develop custom knowledgebases. This method can also be used as an indexing engine for developing an ontology. It could be used in different areas like: biology, gene, proteins, species, disease, processes, chemistry, and many others.
Simplifies research experiments. NERsuite uses various combinations of different NLP applications such as tokenizer, POS-tagger, lemmatizer and chunker to proceed. It contains three sub-functions: (1) a tokenizer, (2) a modified version of the GENIA tagger and (3) a named entity recognizer. This tool was tested on two biomedical Named Entity Recognition (NER) tasks. It is able to computes the beginning and the past the end positions of a given sentence.
Extracts UMLS concepts from biomedical texts such as scientific paper abstracts, experiments descriptions or medical notes and can be used to automatically curate and annotate BioMedical Literature or to index large documents databases and improve searches or discover relationships between them. Recognizing specific biomedical concepts from free text is an increasingly important process and Biolabeler focus on this task to help human and computer annotators to be more precise in order to improve the quality of the huge Biomedical text databases that bioinformatics and biologists has to deal with nowadays.
A text mining tool that detects co-occuring biomedical concepts in abstracts from the MedLine literature database. CoPub allows batch input of multiple human, mouse or rat genes and produces lists of keywords from several biomedical thesauri that are significantly correlated with the set of input genes.
A web-based tool to mine human PPIs from PubMed abstracts based on their co-occurrences and interaction words, followed by evidences in human PPI databases and shared terms in GO database. PPI Finder provides a useful tool for biologists to uncover potential novel PPIs.
Aims to explore concept profiles and facilitates a broad range of tasks, including literature-based knowledge discovery. The Anni functionality is based on the use of an ontology, which defines concepts such as genes, biological processes and diseases and their relations. The concept of this tool is to associate concepts to each other based on their associated sets of texts.
Gene fusion detection in Plants
Fusion transcripts (i.e., chimeric RNAs) resulting from gene fusions are well known in case of human. But, in plants, this phenomenon is not yet explored. We are planning to discover the fusion transcripts/gene fusions in different type of plants by using RNA-Seq datasets. Further, we are planning to understand the mechanism of gene fusion formation and significance of fusions in plants.
Whole genome and transcriptome sequencing data analysis of Plants
In this era of Next Generation Sequencing (NGS), there is huge amount of sequencing data available in the public domain. Any novel finding from these available datasets is major challenge for a computational biologist. We are interested in the analysis of whole genome and transcriptome sequencing data of different plants to fetch out the useful information from those datasets, with the help of bioinformatics tools. Currently, we are planning to study the gene clusters of secondary metabolite pathways in different plants.
Development of webservers, databases and computational pipelines for plant research
Development of database is necessary to compile and share the information with scientific community. We are dedicated to develop useful databases and webserver for plant research.
Another area of interest is to develop automated pipelines and tools for the analysis of high throughput genomics data, generated by NGS technologies.
Professional & Academic Background
Staff Scientist II (May 2017- present): National Institute of Plant Genome Research (NIPGR), New Delhi, India
Postdoctoral Research Associate (2015-2017): University Of Virginia, Charlottesville, VA, USA
Research Scientist (2014-2015): Sir Ganga Ram Hospital, New Delhi, India
PhD Bioinformatics (2009-2014): Bioinformatics Centre, Institute of Microbial Technology (IMTECH), Chandigarh under Jawaharlal Nehru University (JNU), New Delhi, India
M.Sc. Life Sciences (2007-2009): Jawaharlal Nehru University (JNU), New Delhi, India
B.Sc. Biotechnology (2004-2007): Jamia Millia Islamia (JMI), New Delhi, India
Awards and Fellowships
Junior and Senior Research Fellowship (2009-2014): Council of Scientific and Industrial Research (CSIR), New Delhi, India
GATE (Graduate Aptitude Test in Engineering): Qualified in years 2008 and 2009
Scientific Contributions/ Recognitions
Associate editor: Journal of Translational Medicine.
Editorial Board Member of Journal: Theoretical Biology and Medical Modelling.
Reviewer: PloS One, BMC Genomics, BMC Bioinformatics, BMC Biology, BMC Biotechnology, Frontiers in Physiology and several other journals.
Web Resources/ Databases (Developed/ Contributed)
A Platform for Designing Genome-Based Personalized Immunotherapy or Vaccine against Cancer (http://www.imtech.res.in/raghava/cancertope/)
GenomeABC: A webserver for benchmarking of genome assemblers. (http://crdd.osdd.net/raghava/genomeabc/).
Genomics web portal page. (http://crdd.osdd.net/raghava/genomesrs/).
Map/Alignment module of CancerDr: Cancer Drug Resistance Database. (http://crdd.osdd.net/raghava/cancerdr/).
Short reads and contigs alignment module of PCMDB: Pancreatic cancer methylation database. (http://crdd.osdd.net/raghava/pcmdb/).
Burkholderia sp. SJ98 database. (http://crdd.osdd.net/raghava/genomesrs/burkholderia/).
Rhodococcus imtechensis RKJ300 database. (http://crdd.osdd.net/raghava/genomesrs/rkj300/).
Genotrick: A pipeline for whole genome assembly and annotation of Genomes (http://crdd.osdd.net/raghava/genomesrs/genotrick/)
Development of Debian packages in OSDDlinux: A Customized Operating System for Drug Discovery. (http://osddlinux.osdd.net/).
A Web-Based Platform for Designing Vaccines against Existing and Emerging Strains of Mycobacterium tuberculosis. (http://crdd.osdd.net/raghava/mtbveb/).