Promote the development of biomedical text mining applications. BioCreative works closely with biocurators to understand the various curation workflows, the Text Mining (TM) tools that are being used and their major needs. One of the aims of the BioCreAtIvE challenge is to determine the state of the art for a given task in biomedical text mining. This can be achieved if a considerable number of participants from a given community participates and the provided results of each system is evaluated by domain experts using well defined evaluation metrics. To address the barriers in using TM in biocuration, BioCreative has been conducting user requirements analysis and user-based evaluations, and fostering standards development for TM tool re-use and integration.
Helps in your workflow for text analysis and database curation. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names.
A Web-based tool for accelerating manual literature curation (e.g. annotating biological entities and their relationships) through the use of advanced text-mining techniques. As an all-in-one system, PubTator provides one-stop service for annotating PubMed citations.
Identifies negation in textual medical records. NegEx implements several phrases indicating negation, filters out sentences containing phrases that falsely appear to be negation phrases, and limits the scope of the negation phrases. It enables lexical representations in other languages. The tool can be used to scan the documents and charts of an outgoing patient to make sure that the doctors haven’t missed anything important. It was translated to Swedish, French, and German and compared on corpora from each language.
Allows users to perform annotation. ezTag provides training data interactively. It supports all PubMed abstracts and PMC open access articles. Users have multiple ways of annotating bio-entities: (i) the pre-trained state-of-the-art bioentity taggers, (ii) the string pattern match tagger, which uses a user-provided lexicon and (iii) the customized tagger by training TaggerOne. It also supports training and iterative text annotation, which produces an annotated set of documents and a custom tagging module in any bio-entity.
An API for biomedical concept identification and a web-based tool that addresses these limitations. MEDLINE abstracts or free text can be annotated directly in the web interface, where identified concepts are enriched with links to reference databases. Using its customizable widget, it can also be used to augment external web pages with concept highlighting features. Furthermore, all text-processing and annotation features are made available through an HTTP REST API, allowing integration in any text-processing pipeline.
A free and user-friendly text annotation tool aimed to assist in carrying out the main biocuration tasks and to provide labelled data for the development of text mining systems. MyMiner allows easy classification and labelling of textual data according to user-specified classes as well as predefined biological entities.
Automatically annotates full scientific articles with categories from the first layer of the Core Scientific Concept (CoreSC) scheme. SAPIENT was trained on supervised machine learning algorithms and sequence labelling, and it employs conditional random fields (CRFs). The software can build extractive summaries of full papers in chemistry and biochemistry. This tool recognizes and qualifies discourse structure from the scientific literature.
Allows users to perform clinical text annotation. NCBO Annotator+ is a web application that contains several functions such as: scoring, detection of context (negation, experiencer, temporality), and coarse-grained concept recognition (with unified medical language system (UMLS) Semantic Groups). To perform, this tool uses a biomedical terms dictionary including about 600 semantic resources (with notably all UMLS and all the Open Biomedical Ontologies (OBO) Library ontologies).
Allows semantic disambiguation via approximate string matching. SimSem exploits a collection of strings such as dictionaries, LibLinear as its machine-learning component and SimString for fast approximate string matching. It uses semantic category disambiguation (SCD) for the assignation of the appropriate semantic category. This tool is applicable with manual annotation support tasks and can be used as a high-recall component in text processing pipelines.
A workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components.
Excludes patients with psychogenic nonepileptic seizure (PNES). YTEX employs natural language processing (NLP) methods to retrieve clinical notes from an electronic health record (EHR). It annotates syntactic constructs, named entities, and their negation context in clinical text by employing a modular pipeline of unstructured information management application (UIMA) annotator.
A web-based annotation tool for biomedical literature. BioQRator was designed to support any task annotating entities and relationships. It is also one of the first web tools which support the BioC format for annotation.
Provides simplified text to enhance the performance of Natural language processing (NLP) systems and text mining (TM) applications. iSimp denotes simplified sentences in a corpus file, along with the annotation of simplification constructs in the original sentence. It uses shallow parsing and recursive transition networks to detect all forms of simplifications. This tool is able to detect six types of simplification constructs: coordination, relative clause, apposition, introductory phrase, subordinate clause and parenthetical element.
A modular framework for coreference resolution in biomedical text. Bio-SCoRes incorporates a variety of coreference types, their mentions and allows fine-grained specification of resolution strategies to resolve coreference of distinct coreference type-mention pairs. Bio-SCoRes follows a pipeline architecture, consisting of several mandatory and optional steps (linguistic pre-processing, domain-specific pre-processing, configuring resolution strategies, coreferential mention detection and post-processing). Experiments on several types of biomedical corpora demonstrated the extensibility of the architecture and its ease of adaptation.
A platform for Biomedical Text Mining (BioTM) that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation.
A semantic-web-based environment for annotating data of interest in biomedical documents, browsing and querying the annotated data, and interactively refining annotation results if needed. Through Semantator, information of interest can be either annotated manually or semi-automatically using plug-in information extraction tools.
Provides the bioinformatics community with annotated Web services descriptions in diverse formats. BioSWR is a web services registry that provides standard Resource Description Framework (RDF) based Web services descriptions along with the traditional Web Service Definition Language (WSDL) based ones. The registry provides Web-based interface for Web services registration, querying and annotation, and is also accessible programmatically via Representational State Transfer (REST) API or using a SPARQL Protocol and RDF Query Language.
Get annotations for biomedical text with concepts from the ontologies. The Annotator service has access to a large dictionary of biomedical terms derived from the United Medical Language System (UMLS) and NCBO ontologies. To generate annotations for text, simply enter text in the box and press the submit button. The system matches words in the text to terms in ontologies by doing an exact string comparison (a “direct” match) between the text and ontology term names, synonyms, and ids.
A recognition component for NLP (Natural Language Processing) pipelines. NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The system’s matching options can be configured individually or in combination to yield specific system behavior for a variety of NLP tasks.
A fast and intuitive method for indexing tasks. By only relying on neighbor documents, the User-oriented Semantic Indexer does not need a representative learning set. Yet, it provides better results than the other approaches by giving a consistent annotation scored with a global criterion — instead of one score per concept.
Simplifies annotation of protein subcellular localizations. LocText employs named-entity recognition (NER), relation extraction (RE), normalized entities, and linked original sources that machine-learnt the semantics and syntax of scientific text. It is able to extract protein-in-location relationships from texts. This tool aims to reduce annotation time spent by curators but still requires further expert verification.
Allows a semantic search in multiple biomedical databases (PubMed included) and runs a query via relationships between concepts, so that you retrieve at ease more pertinent results and can navigate them by "key concepts". Quertle uses an advanced ontology of biological, medical, and chemical terms, so it is just fine to use the form you are most comfortable with and Quertle will find all the synonyms automatically.
A simple format to share text data and annotations. BioC allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities.
A handbook of domain-customized syntactic parsing guidelines based on iterative annotation and adjudication between two institutions (Kaiser Permanente and Vanderbilt University). Special considerations were incorporated into the guidelines Medical Treebank for handling ill-format sentences, which are common in clinical text. Medical Treebank (currently containing 1100 sentences) is the first one that applies Foster’s computationally verified approach to annotating ungrammatical clinical sentences.
Handles annotations in BioC format. BioC Viewer is a collaborative platform aiming to facilitate curation through a visualization interface compatible with PubMed Central (PMC) articles. This program allows researchers to select and retrieve protein-protein interaction and genetic interaction pairs that can be detected in publications available as full-text as well as to ease the corrections of misaligned annotations.
A text-mining platform for the management of annotation projects. Markyt provides an intuitive interface to visualize and edit annotated document sets, keeps track of multiples rounds of annotation and allows the comparison of annotation quality across rounds and among annotators. Annotation classes are represented by HTML class labels and customized to meet the specifications of the project.
A web-based tool for text annotation; that is, for adding notes to existing text documents. BRAT is designed in particular for structured annotation, where the notes are not freeform text but have a fixed form that can be automatically processed and interpreted by a computer.
A general-purpose text annotation tool that is integrated with the Protégé knowledge representation system. Knowtator facilitates the manual creation of training and evaluation corpora for a variety of biomedical language processing tasks.
Assists users to annotate patient phenotypes and diseases at the time of writing clinical reports or manuscripts. Phenotero allows researchers to reference classes from ontologies within clinical reports or manuscripts at the time of writing. This tool is based on the open-source citation software Zotero and thus builds on existing features: a word-processor plugin, standardized and extendible citation styles, a search function, and active community support.
A web-based management platform for collaborative annotation & curation. GATE Teamware is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.
A GUI-based text annotation tool for creating and visualizing annotations. MMAX2 uses a flexible stand-off XML data format, and has advanced and customizable methods for information and relation visualization.
Identifies, annotates, and indexes clinical documents. TIES is a natural language processing (NLP) pipeline and clinical document search engine. It supports tissue ordering and acquisition, building of Tissue Microarrays (TMAs), and integration with tissue banks and honest brokers. The tool provides a collaborative work space that enables research teams to work on queries and case sets together, even across institutions with separate TIES installations.
Simplifies the annotation process for scientific papers. SAPIENT implements functions that allow to annotate each sentence with a list of concepts from the Core Scientific Concept (CoreSC). It permits users to precisely describe scientific investigations and also chemical named entities.
Connects the static content of scientific articles to the dynamic world of online content. Utopia Documents brings up-to-date information directly to the desktop with a brand new look and feel that blend real-time updates with the typographic elegance of published articles. It allows readers to explore cited literature more easily and to navigate directly to cited articles (where available) or to find an article's online presence.
Adds ontology term selection to Excel spreadsheets. RightField can specify a range of allowed terms from a chosen ontology (subclasses, individuals or combinations). The resulting spreadsheet presents these terms to the users as a simple drop-down list. The tool enables users to import Excel spreadsheets, or generate new ones from scratch. It enables the scientist to consistently annotate their data without the need to explore and understand the numerous standards and ontologies available to them, and it does not require them to change normal practice.
Permits annotation thank to ontology terms. OnASSIs is able to compute semantic similarity measures based on the structure of the ontology between different annotated samples. It allows users to retrieve concepts from OBO ontologies in a given text with different options. This tool offers the possibility to annotate Gene Expression Omnibus (GEO) metadata for stored experiments and samples.
Receives and edits batches of abstracts in standard North American Association of Central Cancer Registries (NAACCR) format into the central registry. Prep Plus is a program that can run in file-server or client-server mode and stores tracking information in a database. The software can handle abstracts created by any software system. It allows edition of abstracts and presentation of cases individually for correction, as well as generation of error report and visual edition of cases.
Helps to quickly find interpretations of results from high-throughput experiments together with relevant literature or to simply scan the literature for discussed genes. GoGene provides the most recent and most complete facts about genes and can rank them according to novelty and importance. It accepts keywords, gene lists, gene sequences and protein sequences as input and supports search for genes in PubMed, EntrezGene and via BLAST. However, GoGene has a high recall of 75% and orthologous gene pairs can be distinguished from non-orthologous pairs purely based on text-mined annotations.
Allows annotation of lists of genes derived from microarray results by user defined terms. MILANO expands the gene names to include all their informative synonyms while filtering out gene symbols that are likely to be less informative as literature searching terms. It supports searching two literature databases: GeneRIF and Medline (through PubMed), allowing retrieval of both quick and comprehensive results. MILANO also has two major advances over similar tools: the ability to expand gene names to include all their informative synonyms while removing synonyms that are not informative and access to the GeneRIF database which provides short summaries of curated articles relevant to known genes.
Collects full-text biomedical journal articles. CRAFT is a manually annotated corpus with all coreferential phenomena of identity and apposition. It also identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology.
Filters target articles with ontology name and concept. Biotea Explorer facilitates the processing of biomedical literature by delivering a semantic dataset for PubMed Central (PMC) and its open-access subset. This subset has been improved with specialized annotation pipelines that uses existing infrastructure from the National Center for Biomedical Ontology.
Provides ranking service and interface embedded into a curation platform. NeXtA5 aims to supply an annotation workflow that will leverage automated methods for text analysis to speed up curation and furnish a better ranking of MEDLINE articles. Searches can be made by using two methods: by performing Boolean queries directly using PubMed via the e-Utils and by using a search engine based on a vector-space model that locally indexes the content of MEDLINE.
1 - 8 of 8
1 - 8 of 8
tagtog A text annotation tool to train AI Turn text into intelligence. Easy.
Gene fusion detection in Plants
Fusion transcripts (i.e., chimeric RNAs) resulting from gene fusions are well known in case of human. But, in plants, this phenomenon is not yet explored. We are planning to discover the fusion transcripts/gene fusions in different type of plants by using RNA-Seq datasets. Further, we are planning to understand the mechanism of gene fusion formation and significance of fusions in plants.
Whole genome and transcriptome sequencing data analysis of Plants
In this era of Next Generation Sequencing (NGS), there is huge amount of sequencing data available in the public domain. Any novel finding from these available datasets is major challenge for a computational biologist. We are interested in the analysis of whole genome and transcriptome sequencing data of different plants to fetch out the useful information from those datasets, with the help of bioinformatics tools. Currently, we are planning to study the gene clusters of secondary metabolite pathways in different plants.
Development of webservers, databases and computational pipelines for plant research
Development of database is necessary to compile and share the information with scientific community. We are dedicated to develop useful databases and webserver for plant research.
Another area of interest is to develop automated pipelines and tools for the analysis of high throughput genomics data, generated by NGS technologies.
Professional & Academic Background
Staff Scientist II (May 2017- present): National Institute of Plant Genome Research (NIPGR), New Delhi, India
Postdoctoral Research Associate (2015-2017): University Of Virginia, Charlottesville, VA, USA
Research Scientist (2014-2015): Sir Ganga Ram Hospital, New Delhi, India
PhD Bioinformatics (2009-2014): Bioinformatics Centre, Institute of Microbial Technology (IMTECH), Chandigarh under Jawaharlal Nehru University (JNU), New Delhi, India
M.Sc. Life Sciences (2007-2009): Jawaharlal Nehru University (JNU), New Delhi, India
B.Sc. Biotechnology (2004-2007): Jamia Millia Islamia (JMI), New Delhi, India
Awards and Fellowships
Junior and Senior Research Fellowship (2009-2014): Council of Scientific and Industrial Research (CSIR), New Delhi, India
GATE (Graduate Aptitude Test in Engineering): Qualified in years 2008 and 2009
Scientific Contributions/ Recognitions
Associate editor: Journal of Translational Medicine.
Editorial Board Member of Journal: Theoretical Biology and Medical Modelling.
Reviewer: PloS One, BMC Genomics, BMC Bioinformatics, BMC Biology, BMC Biotechnology, Frontiers in Physiology and several other journals.
Web Resources/ Databases (Developed/ Contributed)
A Platform for Designing Genome-Based Personalized Immunotherapy or Vaccine against Cancer (http://www.imtech.res.in/raghava/cancertope/)
GenomeABC: A webserver for benchmarking of genome assemblers. (http://crdd.osdd.net/raghava/genomeabc/).
Genomics web portal page. (http://crdd.osdd.net/raghava/genomesrs/).
Map/Alignment module of CancerDr: Cancer Drug Resistance Database. (http://crdd.osdd.net/raghava/cancerdr/).
Short reads and contigs alignment module of PCMDB: Pancreatic cancer methylation database. (http://crdd.osdd.net/raghava/pcmdb/).
Burkholderia sp. SJ98 database. (http://crdd.osdd.net/raghava/genomesrs/burkholderia/).
Rhodococcus imtechensis RKJ300 database. (http://crdd.osdd.net/raghava/genomesrs/rkj300/).
Genotrick: A pipeline for whole genome assembly and annotation of Genomes (http://crdd.osdd.net/raghava/genomesrs/genotrick/)
Development of Debian packages in OSDDlinux: A Customized Operating System for Drug Discovery. (http://osddlinux.osdd.net/).
A Web-Based Platform for Designing Vaccines against Existing and Emerging Strains of Mycobacterium tuberculosis. (http://crdd.osdd.net/raghava/mtbveb/).