Assists users in exploring data using inductive learning. Weka includes methods for inducing interpretable piecewise linear models of non-linear processes. It contains learning algorithms: (i) classifiers for both classification and regression, (ii) meta-classifiers that can improve the performance of the base classifiers, association rule learners, unsupervised learning methods (clustering) and (iii) a number of methods for pre-processing data called filters.
An abbreviation dictionary automatically constructed from the whole MEDLINE as of April, 2009. Acromine identifies abbreviation definitions by assuming a word sequence co-occurring frequently with a parenthetical expression to be a potential expanded form. Applied to the whole MEDLINE (9,635,599 abstracts), the implemented system extracted 68,007 abbreviation candidates and recognized 467,402 expanded forms. The current Acromine achieves 99% precision and 82-95% recall on our evaluation corpus that roughly emulates the whole MEDLINE.
A java based data extraction tool that is used to read and extract required data from files in tab delimited text format. SRADE was originally developed for the purpose to extract data from the next-generation sequencing files that had lot of redundant data.
Permits to make conversion between MSA formats used by popular tools and collapses sequences to haplotypes. ALTER implements a straightforward workflow that easily guides user through a four-step wizard in which the different options are automatically activated when the required information is available. The tool allows to eliminate redundancy to speed up phylogenetic analyses.
A web tool for integrated text mining and literature-derived bio-entity relation extraction. PLAN2L facilitates a more efficient retrieval of information relevant to heterogeneous biological topics, from implications in biological relationships at the level of protein interactions and gene regulation, to sub-cellular locations of gene products and associations to cellular and developmental processes, i.e. cell cycle, flowering, root, leaf and seed development.
Determines relative citation ratios (RCR) of articles available on PubMed. iCite permits users to see the total number of citations and the number of citations per year (CPY) received by an article, the number of expected CPY. It can report the field citation rate (FCR) for each publication. This tool allows the construction of a PubMed request or the research of PubMed identifiers directly in a web interface.
Automates data mapping across different datasets or from a dataset on Alzheimer’s disease to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. GEM allows researchers from around the world who have collected data on Alzheimer’s disease and aging to participate in a collaborative effort of data sharing.
Allows development of calibration equations for traits and data sets. BGLR is based on methods commonly used in genome-enabled prediction, including various parametric models and Gaussian processes that can be used for parametric or semiparametric regressions. It draws samples from the posterior density using a Gibbs sampler. This tool supports continuous as well as binary and ordinal traits.
An ontology-based text mining system for detecting and tracking the distribution of infectious disease outbreaks from linguistic signals on the Web. The system continuously analyzes documents reported from over 1700 RSS feeds, classifies them for topical relevance and plots them onto a Google map using geocoded information.
An information extraction (IE) tool, specifically designed to perform biomedical IE, and which is able to locate and analyse both abstracts and full-length papers. BioRAT is a biological research assistant for text mining, and incorporates a document search ability with domain-specific IE.
A text mining system for extracting information about molecular processes in biomedical articles. Using the data extracted by BioContext, it is possible to get an overview of a range of biomolecular processes relating to a particular gene, or anatomical location.
Allows recognition of anatomical entity mentions in free text. AnatomyTagger is a machine learning-based system that identifies molecular entities and whole organism mentions to facilitate comprehensive analysis of entity references in biological and medical text. The machine learning based tagger integrates a variety of techniques shown to benefit tagging performance, including manually curated lexical resources, word representations induced from unannotated text, statistical true casing and non-local features.
Uses as an SDK/API for machine learning and information extraction, primarily on text data. Minorthird's toolkit of learning methods is integrated tightly with the tools for manually and programmatically annotating text. Additionally, it combines tools for annotating and visualizing text with state-of-the art learning methods, and it contains methods to visualize both training data and the performance of classifiers, which facilitates debugging.
An exploitable brain region connectivity database can be extracted from a very large amount of scientific articles. These models extract large amounts of connectivity data from unstructured text and compare favourably against in-vivo connectivity data. They provide a helpful tool for neuroscientists to facilitate the search and aggregation of brain connectivity data.
Assists users for binding site prediction. MI-1 is a Multiple Instance Learning (MIL) algorithm for Calmodulin (CaM) binding site prediction. Its performance was evaluated on a set of CaM binding proteins extracted from the Calmodulin Target Database. It captures the minimal constraints. It generates the CaM binding propensity along a protein’s length and cannot explicitly identify multiple binding sites.
Combines many state-of-the-art methods which are applicable to a wide array of different parameter estimation problems. PESTO offers different features: (1) multi-start local optimization and interfaces to global and hybrid optimizers; (2) optimization-, integration-based or hybrid profile calculation for uncertainty and identifiability analysis; (3) several sampling methods for uncertainty and identifiability analysis; (4) visualization of all analysis results; (5) and efficient work flow and optional parallelization. Users can easily customize this tool.
Allows users to convert static biological expression language (BEL) knowledge assemblies in dynamic, agent-based modelings (ABMs). BEL2ABM is an application that provides a graphic user interface as well as some semi-automatic parameter optimization method. The application also allows users to interact with the simulation according to initial numbers, homeostatic mimicking, and more. It produces outputs in NetLogo format.
Allows users to store and manipulate experimental data for the purpose of numerical modeling. DataRail is an information processing system that aims to bridge the gap between data acquisition and modeling. The minimum information standard (MIDAS) is part of the DataRail system, and a series of additional tools are also applied to maintain the provenance of data and ensure its integrity through multiple steps of numerical manipulation.
Enables facile generation of the simulation systems of complex glycoconjugates with most sugar types and chemical modifications in the Protein Data Bank (PDB). Glycan Reader is a web application that provides features as (1) handling both PDB and PDBx/mmCIF formats, (2) identification of most sugar types and chemical modifications including various glycolipids in the PDB, and (3) its implementation in CHARMM-GUI and GlycanStructure.Org.
Assists in incorporating data from the Comparative Toxicogenomics Database into user-defined workflows. CTDquerier is an R package including features for query (at gene-, chemical- or disease-level), visualize (through a series of plots) and perform downstream analysis such as enrichment ones. This application is suited for integration into various pipelines and can be applied in association analyzes for genetic, toxicological and environmental investigating.
Assists in extracting and normalizing the location of infected host of viruses from the metadata fields. GeoBoost provides users a method for addressing sparse or incomplete metadata in GenBank sequence records. Moreover, it assigns probability scores for each possible location of the infected host (LOIH) to simplifying probabilistic geospatial modeling.
Enables the intuitive design of experiments in terms of growth conditions and sampling strategy using related Ontologies. Xeml Lab offers a convenient environment to plan and document experiments in a machine-readable manner, and provides information about the experimental design, sampling procedures and environmental conditions. It includes a new ontology for environmental conditions, called Xeml Environment Ontology.
Aims to infer the microbial association network. MPLasso is based on a graph structure learning method where nodes represent microbes and edges represent associations among microbes. It is able to deduce the sign of the edge. This tool shows good performances in edge recovery accuracy. It can effectively select taxa that are highly associated with high statistical confidence. MPLasso can serve to reveal the underlying dynamics of microbial communities.
Enumerates the tetrapyrrole macrocycles formed in a virtual library, identifies all isomers, and calculates the distribution of each product. PorphyrinViLiGe is a program that can enumerate the types and amounts of tetrapyrrole products formed upon combinatorial reactions, and perform subsequent data mining on the resulting virtual library. It also enables data mining to assess the number of products that exhibit chosen patterns of particular substituents.
Allows enumeration of the virtual libraries for four types of modular molecular architectures formed upon combinatorial reaction. Cyclaplex is a program that employs mathematical methods as well as generative algorithms. It enables a quantitative description of the theoretical composition of combinatorial libraries of important linear and cyclic molecular architectures. The software can be useful for understanding the possible diversity formed upon combinatorial reactions.
Allows the detection and correction of misidentified gene symbols, as well as on the fly file format conversion of structured data text files. Truke is a data format conversion tool with a unique corrupted gene symbol detection utility. It uses a previously built dictionary of gene symbols susceptible of being transformed to dates. This web app also offers an heuristic approach to deal with mixed data without specifying the date pattern.
Determines whether clinical conditions mentioned in clinical reports are negated, hypothetical, historical, or experienced by someone other than the patient. ConText obtains reasonable to good performance for negated, historical, and hypothetical conditions across all report types that contain such conditions. It infers the status of a condition with regard to these properties from simple lexical clues occurring in the context of the condition. The tool is based on the approach used by NegEx for finding negated conditions in text. It can improve precision of information retrieval and information extraction from various types of clinical reports.
Minimizes the number of validation experiments necessary for reliable performance estimation and fair comparison between algorithms through a cost-efficient method. VDA is a method for designing a minimal validation dataset to allow reliable comparisons between the performances of different algorithms. Implementation of the VDA approach achieves this reduction by selecting predictions that maximize the minimum Hamming distance between algorithmic predictions in the validation set. VDA can be used to correctly rank algorithms according to their performances.
Provides the infrastructure to combine standalone applications by exporting different data formats. PubMedPortable automatically builds a PostgreSQL relational database schema and a Xapian full text index on PubMed XML files as well as it provides an interface to BioC. The aim of PubMedPortable is to enable users to develop text mining applications and use cases with very basic programming knowledge. The integrated workflow allows users to retrieve, store, and analyse a disease-specific data set. The software library is small, easy to use, and scalable to the user’s system requirements.
Detects and escapes a wide variety of problematic text strings so that they are not erroneously converted into other representations upon importation into Excel. Microsoft Excel automatically converts certain gene symbols, database accessions, and other alphanumeric text and numbers into dates, scientific notation, and other numerical representations, which may lead to subsequent, irreversible corruption of the imported text.
Allows users to detect and annotate temporally anchored mentions of Adverse drug events (ADEs) from a clinical text corpus. ADEPt is a modular pipeline that first perform ADE mentions’ identification, and then, organize it, for finally refining the classification thanks to contextual indicators furnished by the source. The application also includes a way for targeting ADE-specific patterns in psychiatric clinical text and an expandable dictionary depicting over 60 common ADEs.
Extracts structured data from web forums. Vigi4Med Scraper is part of the Vigi4Med project for detecting adverse drug reactions in social networks. It is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum. The extracted data are anonymized and represented in a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by data analysis algorithms and allows the collected data to be directly linked to any existing semantic resource.
Summarizes the raw, continuous, inherently noisy, outlier-ridden, biased electronic health record (EHR) data in a high throughput setting. PopKLD aims to reduce the amount of human effort necessary to clean and summarize the data. It employs a non-parametric probability distribution estimate to proceed. This method creates an estimate of the mean and variance for every individual.
A named entity recogniser for the recovery of bioinformatics databases and software from primary literature. bioNerDS can recognise mentions of bioinformatics’ databases and software in primary literature with a reasonable accuracy. It achieved an F-measure of between 63% and 91% on different datasets (63%–78% at the document level).
Provides an easy way to build XML files following the Helmholtz Open BioInformatics Technology network (HOBIT) format descriptions from inside the user's own programs. BioDOM is designed to be a modular system which can easily be extended as necessary to accomodate new formats. Additionally, it provides functions to convert native non-XML output of various bioinformatic tools to the HOBIT XML formats. In addition to these functions, there are conversion functions for many commonly used non-XML formats, which allow traditional tools and services a smooth transition from their data formats towards the XML formats.
A public online searchable index of bioinformatics resources. Information describing the resources has been automatically extracted from the literature and indexed using natural language and text mining techniques. The index is automatically updated by analyzing new papers describing existing resources (databases, tools, services…).
Facilitates a search for online resources that are introduced in peer-reviewed papers. You can search by MeSH terms or author names in addition to free words. OReFiL extracts all URLs from MEDLINE abstracts and PubMed-indexed BioMed Central full-papers (implementation or availability sections), and indexes them with MeSH terms and author names.
Assists users in migration and creation of new semantic web applications from scratch. SCALEUS provides an open source graphic interface, focused on the biomedical domains, that can be deployed on top of traditional systems. It includes a RESTfull API for data management as well as a Resource Description Framework (RDFS) inference support over SPARQL queries for allowing establishment of knowledge inference rules and query federation for users’ information.
Consists of a framework for managing software and databases. The BIRCH system consists of a core of commonly-used programs for most typical bioinformatics tasks. It allows for seamless integration of locally-installed programs so that each BIRCH site can be tailored to the needs of the local user-community. It includes a network-centric design, allowing users to do any task from anywhere.
Provides a convolutional neural network (CNN)-based ranking approach for biomedical entity normalization. The software takes advantages of semantic and morphological information of biomedical entity mentions used for ranking candidates generated by handcrafted rules used in traditional rule-based systems. Thereby, it is able to generate candidates for a given biomedical entity mention and to rank biomedical entity candidates.
Allows users to recognize important biomedical entities (such as genes and proteins) from text. BNER is a recurrent neural network (RNN) framework allowing capture of morphological and orthographic information of words. It utilizes an attention model to encode character information of a word into its character-level representation. It combines character and word-level representations and then feed them into the long short-term memory recurrent neural network (LSTM-RNN) layer to model context information of each word.
Recognizes not only disorder mentions in the form of contiguous or discontiguous words but also mentions whose spans overlap with each other. Multi-Label Representation is an approach to recognize disorder mentions from clinical narratives, which can be very complicated in some circumstances. Using binary digits to record the disorder mention details, the multi-label scheme enables to recognize complicated disorder mentions, e.g., those overlapping with each other.
Intends to ease the production of dynamic programming algorithms. Generalized ADP is a framework which is based on the separation of traversal states of space, scoring and the user-wanted solutions. This method aims to assist users in programs implementation by the merging of multiple and reusable components coupled with a standardized grammar depicting the types of the attribute functions attached to each production rule.
Parses scientific names of any complexity. gnparser identifies which combinations of the most atomic parts of a name-string represent words or dates. It allows developers to define the rules that describe the general structure of target strings thank to the implementation of Parsing Expression Grammar (PEG). The tool can be used to form normalized names automatically. It transforms names of taxa into their semantic elements. gnparser aims to complete coverage of the biodiversity’s test suite.
Provides a unified graphical user interface for data extraction, data conversion and output composition. Vect is a visual programming tool that allows users to manipulate their sample data inside its user interface, and then generates Perl programs to replicate these tasks. Additionally, this application provides interactive feedback to assist users to identify any error in their code and resolve it.
Manipulates text reports to extract specific terms and knowledge from them. HITEx is an open-source natural language processing (NLP) software application. The software is built on top of Gate framework and uses Gate as a platform. It consists of the collection of Gate plug-ins that were developed to solve problems in medical domain and works by assembling these plug-ins into pipeline applications, along with other standard NLP plug-ins.
A domain-specific lemmatization tool for the morphological analysis of biomedical literature. The BioLemmatizer is tailored to the biological domain through integration of several published lexical resources related to molecular biology. It focuses on the inflectional morphology of English, including the plural form of nouns, the conjugations of verbs, and the comparative and superlative form of adjectives and adverbs.
Provides a pipeline to extract abstract patterns within texts. T-GOWler is a system that contains two main engines: (i) WfExtractor extracts workflows from texts and builds a workflow database, and (ii) WfMiner mines frequent generalized patterns. The extraction pipeline is based on a simple natural language processing (NLP) chain and a specific workflow tagger to classify workflow components.
Extracts workflows from texts and builds a workflow database. WfExtractor processes in three steps: (1) named entity extraction, (2) link extraction, and (3) workflow reconstruction from the extracted elements. It applies a domain ontology as a gazetteer to recognize workflow elements in texts. The tool uses a supervised machine learning approach to disambiguate software contexts using specific Java Annotation Patterns Engine (JAPE) rules.
Generates generalized subgraphs of workflows out of concrete ones. WfMiner creates frequent patterns using a level-wise pattern generation on the top of the ontology and a specific pattern-to-workflow matching algorithm to filter nonfrequent patterns. It uses data structure to prune patterns while calculating their supports using the generality relationship between them.
Compiles and ranks information about more than 200 leading global research institutions according their influence on industry and innovation. QUT In4M is an open database scoring these institutions using method that merges scholarly work cited in patent literature and the estimated perceived value of the patents. The repository is composed of three mains panels that gives an overview of the different institutions, and allows users to make comparisons between them and explains the methods applied for ranking.
Gives access to a searchable collection of neuroscience data, a catalog of biomedical resources, and an ontology for neuroscience. NIF is a dynamic inventory of web-based neuroscience resources designed to serve neuroscience investigators by facilitating directed and intelligent access to data and findings, aiding integration, synthesis, and connectivity across related data and findings, stimulating new and enhanced development of neuroinformatic resources, and enabling new and enhanced analyses of data.
Provides about 14 000 protein-ligand complexes. AutoBind were constructed an algorithm that automatically extracts information about protein- ligand-binding affinity. This tool is able to recognize candidate sentences describing protein- ligand-binding affinity. It provides a scoring function to rank the identified sentences for extracting binding affinities. The database is searchable by PDB identifier.
Provides a corpus of documents that was used to train machine learning-based taggers. AnatEM was built in parton the Anatomical Entity Mention (AnEM) and Multi-Level Event Extraction (MLEE) corpora. It contains 212 documents: 600 drawn randomly from abstracts and full texts as in AnEM and 612 which are a targeted selection of PubMed abstracts relating to the molecular mechanisms of cancer. The corpus was split into separate training, development and test sets.
Facilitates text mining/NLP research in the areas of literature-based database curation, named entity recognition, and ontology development. iProLINK is a resource for protein literature mining. The database can be used by computational or biological researchers to explore literature information on proteins and their features or properties. It also serves as a knowledge link bridging protein databases and scientific literature.
Provides a citation-based database about life science domains. Colil database contains citations, citation contexts and co-citations extracted from full-text publications. This database is built as a Linked Open Data (LOD) and uses the Resource Description Network (RDN). This tool offers three different services: an easy to search service, an ftp site and an advanced query builder. It aims to make biological research more efficient for researchers.
Aims to bring together donors with seekers of reagents. BRX was built to simplify and promote the sharing of biological reagents and more specifically antibodies. It parses the acknowledgements sections of papers that are available in PubMed Central. This platform supplies antibodies, DNA constructs, and cell-lines. It allows users to search for specific information, to comment on posts and to contact other researchers.
Gathers candidate biomarkers collected from PubMed publications. MarkerHub is a HCC-biomarker database that also provides a network visualization tool to assist bioinformaticians in discovering novel associations between genes and diseases based on direct/neighborhood associations. The database can facilitate biomarker researches by providing life scientists with a ranked list that can be validated in a larger population using clinical specimens.
Provides e-specimen over computer networks. e-Foram Stock allows users to access to three-dimensional virtual models developed by Tohoku University Museum. The database offers e-specimen based on information of real specimen. It is searchable by generic and specific names, locality and the geological age.
Permits users to find semantic relations in PubMed and generate hypothesis. LINK can be used to investigate genes, diseases and drugs. When it does not recognize one of these three entities, it employs natural language processing (NLP) method to proceed. This platform can serve to construct hypotheses for how a target affects disease, repurpose existing compounds to a different disease, find trends.
Facilitates identification of various author related components including co-author profiles, net output of a research organization of interest, and helps in referee searches. MC Identify aggregates metadata from a cluster of records to estimate the authors profile such as co-author distributions and keyword distributions, in order to predict how likely is it that a new record is “produced” by the same author.
Provides a searchable, enriched and indexed full text Patents. MCPaIRS is a database that allows user to search for full texts of granted patents and published applications from India. This online resource contains a well-designed front page with bibliographic details, abstract and representative drawing. The data is hand-curated by domain experts and provided in an easy to use web interface.
Provides a terminological resource that can serve as a hub for modern language processing techniques and data integration solutions connecting literature with biomedical data. LexEBI is composed of (1) the full scope of biomedical-chemical relevant terms, (2) abbreviations and their long forms from the scientific literature, and (3) frequency information from the scientific literature.
Provides access to semantically enriched content for all research articles from PubMed Central (PMC). BioLit is a database that includes full text or excerpts of open access articles directly within existing biological databases, and adds newly generated metadata to the articles for increasing their informative value. It applies a text-mining pipeline to identify ontology terms provided by a number of ontologies from the National Center for Biomedical Ontology (NCBO), as well as PDB IDs.
PhD ès Neurosciences, I worked 8 years on the brain and its diseases. I then specialized in bioinformatics (NGS, epigenetics) and worked in CEA and GENETHON before to join OMICX and help OMICtools community.
Gene fusion detection in Plants
Fusion transcripts (i.e., chimeric RNAs) resulting from gene fusions are well known in case of human. But, in plants, this phenomenon is not yet explored. We are planning to discover the fusion transcripts/gene fusions in different type of plants by using RNA-Seq datasets. Further, we are planning to understand the mechanism of gene fusion formation and significance of fusions in plants.
Whole genome and transcriptome sequencing data analysis of Plants
In this era of Next Generation Sequencing (NGS), there is huge amount of sequencing data available in the public domain. Any novel finding from these available datasets is major challenge for a computational biologist. We are interested in the analysis of whole genome and transcriptome sequencing data of different plants to fetch out the useful information from those datasets, with the help of bioinformatics tools. Currently, we are planning to study the gene clusters of secondary metabolite pathways in different plants.
Development of webservers, databases and computational pipelines for plant research
Development of database is necessary to compile and share the information with scientific community. We are dedicated to develop useful databases and webserver for plant research.
Another area of interest is to develop automated pipelines and tools for the analysis of high throughput genomics data, generated by NGS technologies.
Professional & Academic Background
Staff Scientist II (May 2017- present): National Institute of Plant Genome Research (NIPGR), New Delhi, India
Postdoctoral Research Associate (2015-2017): University Of Virginia, Charlottesville, VA, USA
Research Scientist (2014-2015): Sir Ganga Ram Hospital, New Delhi, India
PhD Bioinformatics (2009-2014): Bioinformatics Centre, Institute of Microbial Technology (IMTECH), Chandigarh under Jawaharlal Nehru University (JNU), New Delhi, India
M.Sc. Life Sciences (2007-2009): Jawaharlal Nehru University (JNU), New Delhi, India
B.Sc. Biotechnology (2004-2007): Jamia Millia Islamia (JMI), New Delhi, India
Awards and Fellowships
Junior and Senior Research Fellowship (2009-2014): Council of Scientific and Industrial Research (CSIR), New Delhi, India
GATE (Graduate Aptitude Test in Engineering): Qualified in years 2008 and 2009
Scientific Contributions/ Recognitions
Associate editor: Journal of Translational Medicine.
Editorial Board Member of Journal: Theoretical Biology and Medical Modelling.
Reviewer: PloS One, BMC Genomics, BMC Bioinformatics, BMC Biology, BMC Biotechnology, Frontiers in Physiology and several other journals.
Web Resources/ Databases (Developed/ Contributed)
A Platform for Designing Genome-Based Personalized Immunotherapy or Vaccine against Cancer (http://www.imtech.res.in/raghava/cancertope/)
GenomeABC: A webserver for benchmarking of genome assemblers. (http://crdd.osdd.net/raghava/genomeabc/).
Genomics web portal page. (http://crdd.osdd.net/raghava/genomesrs/).
Map/Alignment module of CancerDr: Cancer Drug Resistance Database. (http://crdd.osdd.net/raghava/cancerdr/).
Short reads and contigs alignment module of PCMDB: Pancreatic cancer methylation database. (http://crdd.osdd.net/raghava/pcmdb/).
Burkholderia sp. SJ98 database. (http://crdd.osdd.net/raghava/genomesrs/burkholderia/).
Rhodococcus imtechensis RKJ300 database. (http://crdd.osdd.net/raghava/genomesrs/rkj300/).
Genotrick: A pipeline for whole genome assembly and annotation of Genomes (http://crdd.osdd.net/raghava/genomesrs/genotrick/)
Development of Debian packages in OSDDlinux: A Customized Operating System for Drug Discovery. (http://osddlinux.osdd.net/).
A Web-Based Platform for Designing Vaccines against Existing and Emerging Strains of Mycobacterium tuberculosis. (http://crdd.osdd.net/raghava/mtbveb/).