Focuses on variant discovery and genotyping. GATK provides a toolkit, developed at the Broad Institute, composed of several tools and able to support projects of any size. The application compiles an assortment of command line allowing one to analyze of high-throughput sequencing (HTS) data in various formats such as SAM, BAM, CRAM or VCF. The website includes multiple documentation for guiding users.
Allows users to interact with high-throughput sequencing data. SAMtools permits the manipulation of alignments in the SAM/BAM/CRAM formats: reading, writing, editing, indexing, viewing and converting SAM/BAM/CRAM format. It limits the mapping quality of reads with excessive mismatches and applies base alignment quality to fix alignment errors. This tool can sort and merge alignments, remove polymerase chain reaction (PCR) duplicates or generate per-position information.
Performs peak finding and downstream data analysis for next-generation sequencing analysis. HOMER affords several tools and methods to make use of ChIP-Seq, GRO-Seq, RNA-Seq, DNase-Seq, Hi-C and other types of functional genomics sequencing data sets. This software offers support to UCSC visualization, peaks annotation, quantification of transcripts and repeats or differential features, enrichment and expression.
Gives access to many free software tools for sequence analysis. EMBOSS aims to serve the molecular biology community. It permits the creation and the release of software in an open source spirit. This tool is useful for sequence analysis into a seamless whole. It is free of charge and is available in open source.
A software suite for the comparison, manipulation and annotation of genomic features in browser extensible data (BED) and general feature format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets.
A Galaxy based web server for processing and visualizing deeply sequenced data. The web server's core functionality consists of a suite of newly developed tools, called deepTools, that enable users with little bioinformatic background to explore the results of their sequencing experiments in a standardized setting. Users can upload pre-processed files with continuous data in standard formats and generate heatmaps and summary plots in a straight-forward, yet highly customizable manner.
Performs gene and isoform level quantification from RNA-Seq data. RSEM is a software package that quantifies gene and isoform abundances from single-end (SE) or paired-end (PE) RNA-Seq data. The software enables visualization of its output through probabilistically-weighted read alignments and read depth plots. It does not require a reference genome and thus can be useful for quantification with de novo transcriptome assemblies.
A flexible toolkit for exploring datasets generated by nanopore sequencing devices from MinION for the purposes of quality control and downstream analysis. Poretools operates directly on the native FAST5 (an application of the HDF5 standard) file format produced by ONT and provides a wealth of format conversion utilities and data exploration and visualization tools.
Provides assistance for the problem of mapping various types of IDs to each other. Onto-Translate brings to users a non-redundant and complete mapping from any type of identification system to any other type. This software exploits the custom design of Onto-Tools database that contains 20 publicly available biological databases such as KEGG or GenBank. It permits to perform conversions of individual genes in one format into another.
Builds mapping assemblies from short reads generated by the next-generation sequencing machines. Maq is particularly designed for Illumina-Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLiD data. Maq first aligns reads to reference sequences and then calls the consensus. At the mapping stage, maq performs ungapped alignment. For single-end reads, maq is able to find all hits with up to 2 or 3 mismatches, depending on a command-line option; for paired-end reads, it always finds all paired hits with one of the two reads containing up to 1 mismatch. At the assembling stage, maq calls the consensus based on a statistical model.
Allows to manipulate, organize, summarize and visualize MinION nanopore sequencing data. poRe enables users to manipulate MinION FAST5 files into run folders, extract FASTQ, gather statistics on each run and plot a number of key graphs, such as read-length histograms and yield-over-time. Two graphical-user-interfaces (GUIs) for MinION data processing, organization and extraction are available through the package.
Improves the design and use of polymerase chain reaction (PCR)-based methylation assays. methPrimer was developed to store and retrieve validated methylation assays. This resource is intended to be a search portal for validated methylation assays. It also aims to establish a certain level of standardization and uniformity in the use of PCR based methylation assays. Each primer set is provided with a unique identifier to access them directly or refer to in a publication.
Permits users to parse, analyze and manipulate VCF files. VCFtools is a software package for composed of two modules: the first is a general API that allows various operations to be performed on VCF files, including format validation, merging, comparing, intersecting, making complements and basic overall statistics; the second module analyze single-nucleotide polymorphism (SNP) data in VCF format, assisting researchers to estimate allele frequencies, levels of linkage disequilibrium and various quality control (QC) metrics.
Examines epigenomic and transcriptomic next generation sequencing (NGS) data. Octopus-toolkit can be used for antibody- or enzyme-mediated experiments and studies for the quantification of gene expression. It can accelerate the data mining of public epigenomic and transcriptomic NGS data for basic biomedical research. This tool provides a private and a public mode: one to process the user’s own data, and the other to analyze public NGS data by retrieving raw files from the GEO database.
Enables reading of sequencing files from the SRA database and writing files into the same format. The NCBI SRA Toolkit is provided in the form of the SRA SDK, and can be compiled with GCC. It allows users to programmatically access data housed within SRA and convert it from the SRA format: ABI SOLiD native, fasta, fastq, sff, sam, Illumina native. This method is available for all commons platforms.
Assists users in manipulating high-throughput sequencing (HTS) data and formats. Picard is a Java toolkit that provides a set of command line scripts. It comprises Java-based utilities that manipulate SAM files, and a Java API for creating new programs that reads and writes SAM files. Both SAM text format and SAM binary (BAM) format are supported. It also works with next generation sequencing (NGS).
A statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data. BCFtools can manipulate variant calls in the variant call format (VCF) and its binary counterpart BCF. It also can discover somatic and germline mutations with appropriate input data, efficiently estimate site allele frequency, allele frequency spectrum and linkage disequilibrium, and test Hardy–Weinberg equilibrium and association.
Generates an ISA-Tab structured investigation out of nmrML files. nmrML2ISA reduces the time needed to capture metadata substantially (capturing 90% of metadata on assay and sample levels), is much less prone to user input errors. It improves compliance with minimum information reporting guidelines and facilitates more finely grained data exploration and querying of datasets.
A software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit. The BamTools C++ API/library has been successfully integrated into a variety of applications. It provides the BAM file support for several utilities in the BEDtools suite.
Aims to facilitate data exchange and file conversions between population genetics programs. PGDSpider is able to read 27 different file formats and can export data into 29 other file formats. It can be integrated in complex data analysis pipelines thank to its command line version. The tool provides feature to store a preferred conversion settings in order to repeat conversions of similar input formats.
Provides comprehensive alignment-based analysis of Nanopore reads through a simple, easy to use interface. NanoOK generates detailed tabular and graphical output plus an in-depth multi-page PDF report including error profile, quality and yield data. NanoOK is multi-reference, enabling detailed analysis of metagenomic or multiplexed samples. Four popular Nanopore aligners are supported and it is easily extensible to include others.
A conversion tool to read and write SAM, BAM and CRAM formats using a unified Application Programming Interface (API). It also permits the most efficient use of threads when converting between differing file formats, automatically balancing the encoder and decoder work loads. Scramble is not a drop-in replacement for the Samtools API; however, a port of the CRAM components of Scramble has been made to the HTSlib library and is available within Samtools.
A suite of software tools for manipulating data common to next-generation sequencing experiments, such as FASTQ, BED and BAM format files. With modules that operate from FASTQ pre-processing through BAM post-processing and RPKM calculations, NGSUtils compliments existing tools and provides unique functionality that helps each step of an NGS data analysis pipeline. NGSUtils covers different aspects of NGS data analysis, including pre-processing, post-processing, filtering, format conversion and final result calculations. NGSUtils provides a stable and modular platform for data management and analysis.
Converts BioPAX level 2 and level 3 files into SBML files including the Qualitative Models extension. BioPAX2SBML includes pathways from BioCarta, Reactome, and from the National Cancer Institute, from BioPAX formats to the SBML format, including the qual extension. Compared to existing conversion approaches with similar scope, BioPAX2SBML conversions result in comprehensive and correct SBML models, created for all pathways in the nature PID.
Intends to parse and manipulate multiple aspects and properties of molecular data. fconv is a robust and comprehensive tool involved in a broad range of computational workflows that are currently applied in drug design environment. Typical tasks are as follows: conversion and error correction of formats such as PDB(QT), MOL2, SDF, DLG and CIF; extracting ligands from PDB as MOL2; automatic or ligand-based cavity detection; root-mean-square deviation (RMSD) calculation and clustering; substructure searches; alignment and structural superposition; building of crystal packings; adding hydrogens; calculation of various properties like the number of rotatable bonds; molecular weights or van der Waals volumes.
Permits quality control of Next-Generation-Sequencing (NGS) tumor-normal experiments. NGS-Bits is separate into four steps: (1) gather information from raw reads, (2) map reads, (3) extract variant lists, and (4) combine result from precedent steps to then add quality control (QC) metrics for tumor-normal experiments. This tool includes all stages of single-sample NGS data analysis and adds special QC metrics for DNA sequencing of tumor-normal pairs.
Analyzes or annotates VCF files and organizes tools that perform diverse analyses using VCF files. VCF-kit adds essential utilities to process and analyze VCF files, including primer generation for variant validation, dendrogram production, genotype imputation from sequence data in linkage studies, and additional tools. It can be used to produce a phylogenetic tree from a VCF. The tool centralizes a collection of tools and scripts using variant call format.
Identifies a web enabled isomorphic map between Variant Call Format (VCF) and Resource Description Framework (RDF). VCF2RDF is a VCF parser that acts as an isomorphic mapping function to (evolvable) linked data entirely within 3rd generation Web Technologies.
Allows users to reformate and filter bioinformatics files. JVARKIT aims to simplify the grammar employed to filter bioinformatic file, for rendering possible to write a loop or a custom function. JVARKIT is a set of more than 100 java-based tools for bioinformatics.
A collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing. Next-Generation sequencing machines usually produce FASTA or FASTQ files, containing multiple short-reads sequences (possibly with quality information). The main processing of such FASTA/FASTQ files is mapping (aka aligning) the sequences to reference genomes or other databases using specialized programs. Example of such mapping programs are: Blat, SHRiMP, LastZ, MAQ and many many others.
Reads files created with the graphical network editor Escher and converts them to files in community standard formats. EscherConverter is written in Java, and it is available as a standalone executable file that includes a graphical user interface with graph drawing capabilities and a command-line interface. It converts between SBML, SBGN-ML, and JSON-files.
Allows users to analyze, filter, annotate or transform biological sequence data. FAST is able to realize automated sampling, permutations and bootstrapping of sequences and sites and compute a population genetic statistics. It can assist empower non-biologist programmers to develop and communicate bioinformatics workflows for scientific investigations and publishing.
Serves as a hub for data input, format conversion, and data export to other applications. ChIP-Convert imports and converts external data formats into compressed SGA. The software also provides more specific conversion schemes such as proper conversion of the BED-like narrow peak format used by ENCODE. It can be used to export data from the mass genome annotation (MGA) repository in other formats such as BED or FPS.
Enables genotyping and variant annotation of resequencing data produced by second generation next generation sequencing (NGS) technologies. CoVaCS is an automated system that provides tools for variant calling and annotation along with a pipeline for the analysis of whole genome shotgun (WGS), whole exome sequencing (WES) and targeted resequencing data (TGS). The software allows non-specialists to perform all steps from quality trimming to variant annotation.
Analyzes raw sequencing data from several next generation sequencing (NGS) platforms. MutAid is a pipeline performing six different steps: (i) quality control and filtering; (ii) mapping reads to reference genome; (iii) variant detection, effect prediction and cross-referencing and lastly (iv) and then produces a summary of all information generated. It can be used to interpret mutational variants from various data generated by targeted gene-panel sequencing or whole genome sequencing.
Allows users to filter, convert and combine multiple data files produced by high-throughput technologies. HTDP aims to aid global, real-time processing of large data sets using GUI. The software provides unlimited filtering and data reduction capabilities, also using itemized filtering conditions from external files. It can be used for conversion between different standard formats that are commonly used for high-throughput data.
Speeds up pre-processing for next-generation sequencing (NGS) data. sam2bam converts the data format from SAM to BAM. It consists of parallel software components that can fully utilize multiple processors, available memory, high-bandwidth storage, and hardware compression accelerators. This tool provides plug-in functions that can be used to analyze, filter, and convert input data.
Aims to search and retrieve The Cancer Genome Atlas (TCGA) data. TCGA2BED converts them in the structured BED format for their seamless use and integration. Additionally, it supports the conversion in CSV, GTF, JSON, and XML standard formats. Furthermore, TCGA2BED extends TCGA data with information extracted from other genomic databases (i.e., NCBI Entrez Gene, HGNC, UCSC, and miRBase). TCGA2BED also provides an automatically updated data repository with publicly available Copy Number Variation (CNV), DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental data of TCGA converted into the BED format, and their associated clinical and biospecimen meta data in attribute-value text format.
A simple GUI software tool for visualizing published ChIP-seq raw data. SraTailor automatically converts an SRA into a BigWig-formatted file. Simplicity of use is one of the most notable features of SraTailor: entering an accession number of an SRA and clicking the mouse are the only steps required to obtain BigWig-formatted files and to graphically visualize the extents of reads at given loci. SraTailor is also able to make peak calls, generate files of other formats, process users' own data, and accept various command-line-like options. Therefore, this software makes ChIP-seq data fully exploitable by a wide range of biologists.
A scalable bioinformatic tool for exploring and analyzing nanopore sequencing data that can run both individual computers and in the Hadoop distributed computing framework. The Hadoop environment allows virtually unlimited scaling up in data size and provides better runtimes for datasets containing a large number of reads. HPG Pore allows efficient management of huge amounts of data and thus constitutes a practical solution for data analysis needs in the near future as well as a promising model for the development of new tools to deal with future genomic big data.
Allows users to support conversion between different Next Generation Sequencing (NGS) files. NGS-FC is a crossed-platform software which summarizes information from 14 NGS databases. It can be used as a converter tool or as a framework to add new conversion classes and databases. Its supports external scripts, and format conversion scripts can thus be integrated.
Reads and writes nucleic/protein sequences in various formats. ReadSeq is a conversion program for bioinformatics, that can read and reformat 18 different formats. The software includes a Graphic User Interface (GUI), Command Line Interface (CLI) and also a Common Gateway Interface (CGI) for use from a web server.
Facilitates translation of biomedical research questions to language amenable for computational analysis. GROK supports various deep sequencing (DS)-related operations such as preprocessing, filtering, file conversion, and sample comparison. It supports major genomic file formats and allows storing custom genomic regions in efficient data structures such as red-black trees and SQL databases. The tool can facilitate answering biomedical research questions and establish experimentally testable predictions.
Provides utility functions implementing commonly used genomic operations. bedr is a formal BED-operations framework that offers a formal R interface to interact with BEDTools and BEDOPS. In addition to sort operations, it also supports identification of overlapping regions which can be collapsed to avoid downstream analytical challenges. This method is compatible with the ubiquitous BED tools paradigm and integrates with R-based workflows.
Consists in a set of programs aimed at exporting various formats (e.g. VCF, BAM, AXT). glactools represents genotype likelihoods or allele counts as block compressed binary files that can be indexed. It introduces two formats: GLF for storing genotype likelihoods and ACF for allele counts. The GLF format contains genotype likelihoods for single individuals. The ACF format stores the number of times a specific base is observed in an individual or population.
Allows users to convert GenBank files into Sequin format for further submission in NCBI databases. GenBank 2 Sequin is a web application which aims to propose a public browser to simplify the conversion of files without the need of advanced computation skills. The program can handle several files and includes options to refine the targeted sequence such as the ability to specify the user-needed organism, genetic code or molecular type.
Generates animated images of a molecular representation. Pdb2mgif creates animated images of molecules to be displayed by all standard web browsers without depending on an extra visualization program. This software exploits the 3D structure and produces an animated image that is not browser-specific.
Converts fsa files into PostScript format. FSA2PS provides an easy to handle software toolkit implemented in Perl. Due to its platform independency, it can be integrated into a variety of software projects or data analysis pipelines. The created PostScript depicts the chromatogram deduced from fragment analysis together with information on the analysis run above the displayed results. This PostScript’s output can afterwards be converted into various file formats.
PhD ès Neurosciences, I worked 8 years on the brain and its diseases. I then specialized in bioinformatics (NGS, epigenetics) and worked in CEA and GENETHON before to join OMICX and help OMICtools community.
Gene fusion detection in Plants
Fusion transcripts (i.e., chimeric RNAs) resulting from gene fusions are well known in case of human. But, in plants, this phenomenon is not yet explored. We are planning to discover the fusion transcripts/gene fusions in different type of plants by using RNA-Seq datasets. Further, we are planning to understand the mechanism of gene fusion formation and significance of fusions in plants.
Whole genome and transcriptome sequencing data analysis of Plants
In this era of Next Generation Sequencing (NGS), there is huge amount of sequencing data available in the public domain. Any novel finding from these available datasets is major challenge for a computational biologist. We are interested in the analysis of whole genome and transcriptome sequencing data of different plants to fetch out the useful information from those datasets, with the help of bioinformatics tools. Currently, we are planning to study the gene clusters of secondary metabolite pathways in different plants.
Development of webservers, databases and computational pipelines for plant research
Development of database is necessary to compile and share the information with scientific community. We are dedicated to develop useful databases and webserver for plant research.
Another area of interest is to develop automated pipelines and tools for the analysis of high throughput genomics data, generated by NGS technologies.
Professional & Academic Background
Staff Scientist II (May 2017- present): National Institute of Plant Genome Research (NIPGR), New Delhi, India
Postdoctoral Research Associate (2015-2017): University Of Virginia, Charlottesville, VA, USA
Research Scientist (2014-2015): Sir Ganga Ram Hospital, New Delhi, India
PhD Bioinformatics (2009-2014): Bioinformatics Centre, Institute of Microbial Technology (IMTECH), Chandigarh under Jawaharlal Nehru University (JNU), New Delhi, India
M.Sc. Life Sciences (2007-2009): Jawaharlal Nehru University (JNU), New Delhi, India
B.Sc. Biotechnology (2004-2007): Jamia Millia Islamia (JMI), New Delhi, India
Awards and Fellowships
Junior and Senior Research Fellowship (2009-2014): Council of Scientific and Industrial Research (CSIR), New Delhi, India
GATE (Graduate Aptitude Test in Engineering): Qualified in years 2008 and 2009
Scientific Contributions/ Recognitions
Associate editor: Journal of Translational Medicine.
Editorial Board Member of Journal: Theoretical Biology and Medical Modelling.
Reviewer: PloS One, BMC Genomics, BMC Bioinformatics, BMC Biology, BMC Biotechnology, Frontiers in Physiology and several other journals.
Web Resources/ Databases (Developed/ Contributed)
A Platform for Designing Genome-Based Personalized Immunotherapy or Vaccine against Cancer (http://www.imtech.res.in/raghava/cancertope/)
GenomeABC: A webserver for benchmarking of genome assemblers. (http://crdd.osdd.net/raghava/genomeabc/).
Genomics web portal page. (http://crdd.osdd.net/raghava/genomesrs/).
Map/Alignment module of CancerDr: Cancer Drug Resistance Database. (http://crdd.osdd.net/raghava/cancerdr/).
Short reads and contigs alignment module of PCMDB: Pancreatic cancer methylation database. (http://crdd.osdd.net/raghava/pcmdb/).
Burkholderia sp. SJ98 database. (http://crdd.osdd.net/raghava/genomesrs/burkholderia/).
Rhodococcus imtechensis RKJ300 database. (http://crdd.osdd.net/raghava/genomesrs/rkj300/).
Genotrick: A pipeline for whole genome assembly and annotation of Genomes (http://crdd.osdd.net/raghava/genomesrs/genotrick/)
Development of Debian packages in OSDDlinux: A Customized Operating System for Drug Discovery. (http://osddlinux.osdd.net/).
A Web-Based Platform for Designing Vaccines against Existing and Emerging Strains of Mycobacterium tuberculosis. (http://crdd.osdd.net/raghava/mtbveb/).