Identifies the structural variation (SV) by whole genome de novo assembly. SOAPsv aims to show that SVs reports for a greater fraction of diversity between individuals than do single nucleotide polymorphisms (SNPs). This software also demonstrates that de novo assembly can detect SVs of a large range of lengths. The SV maps of human genomes allows to initially describe the genomic patterns of SVs and their relationship with a variety of genomic features.
To characterize the mutational spectrum of somatic SVs in cancer, it is important to identify both simple (e.g., deletion, insertion, and inversion) and complex SVs at base-pair resolution. Meerkat predicts both germline and somatic SVs directly from short read data, focusing on complex events.
A tool designed for efficient and accurate variant-detection in high-throughput sequencing data. By using local realignment of reads and local assembly it achieves both high sensitivity and high specificity. Platypus can detect SNPs, MNPs, short indels, replacements and (using the assembly option) deletions up to several kb. It has been extensively tested on whole-genome, exon-capture, and targeted capture data.
A Perl/C++ package that provides genome-wide detection of structural variants from next generation paired-end sequencing reads. BreakDancer sensitively and accurately detected indels ranging from 10 base pairs to 1 megabase pair that are difficult to detect via a single conventional approach.
Assists users to infer an underlying genotype at each structural variants (SVs). SVTyper is a Bayesian likelihood algorithm that can operate on copy-neutral events such as inversions and translocations as well as copy number variants (CNVs). It permits the production of SV genotypes, useful for meaningful variant interpretation, as well as quantitative estimates of breakpoint allele frequencies that allow inference of the fraction of tumor cells that carry a particular variant.
Detects genotype insertions and deletions from paired-end reads. CTK is a suite of tools for next-generation sequencing (NGS) data analysis and is based on an internal segment size approach to discover indel variation from paired-end read data. It contains also, among others, a long-indel-aware read mapper (LASER), a BAM converter to a list of alignment pairs with prior probabilities and a split feature by chromosome.
A computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. The package is composed of three modules, PEMer workflow, SV-Simulation and BreakDB. PEMer workflow is a sensitive software for detecting SVs from paired-end sequence reads. SV-Simulation randomly introduces SVs into a given genome and generates simulated paired-end reads from the ‘novel’ genome. Subsequent analysis with PEMer workflow on the simulated reads can facilitate parameterize PEMer workflow. BreakDB is a web accessible database developed to store, annotate and dsplay SV breakpoint events identified by PEMer and from other sources.
Provides computational tools and methods for high-quality insertion sequence (IS) annotation. ISsaga uses established ISfinder annotation standards and permits rapid processing of single or multiple prokaryote genomes. ISsaga provides general prediction and annotation tools, information on genome context of individual ISs and a graphical overview of IS distribution around the genome of interest.
Allows identification of genomic rearrangements. GRIDSS is a module software suite containing tools which performs genome-wide break-end assembly prior to variant calling using a positional de Bruijn graph assembler. The GRIDSS pipeline comprises three distinct stages: extraction, assembly, and variant calling. The software identifies non-template sequence insertions, microhomologies and large imperfect homologies, and supports multi-sample analysis.
Integrates sequencing reads from next-generation sequencing (NGS) and single-molecule sequencing (SMS) technologies to accurately assemble and detect structural variations (SV) in human genome. By identifying homologous SV-containing reads from different technologies through a bipartite-graph-based clustering algorithm, our approach turns a whole genome assembly problem into a set of independent SV assembly problems, each of which can be effectively solved to enhance assembly of structurally altered regions in human genome.
Identifies regions of the genome suspected to harbor a complex event. SVelter then resolves the structure by iteratively rearranging the local genome structure, in a randomized fashion, with each structure scored against characteristics of the observed sequencing data. SVelter is able to accurately reconstruct complex chromosomal rearrangements when compared to well-characterized genomes that have been deeply sequenced with both short and long reads. SVelter is able to interrogate many different types of rearrangements, including multi-deletion and duplication-inversion-deletion events as well as distinct overlapping variants on homologous chromosomes.
An approach that uses a 'kmer' strategy to assemble misaligned sequence reads for predicting insertions, deletions, inversions, tandem duplications and translocations at base-pair resolution in targeted resequencing data. Variants are predicted by realigning an assembled consensus sequence created from sequence reads that were abnormally aligned to the reference genome. Using targeted resequencing data from tumor specimens with orthogonally validated SV, non-tumor samples and whole-genome sequencing data, BreaKmer had a 97.4% overall sensitivity for known events and predicted 17 positively validated, novel variants.
Detects and visualizes structural variation from paired-end mapping data. Under this scheme, abnormally mapped read pairs are clustered based on the location of a gap signature. Several important features, including local depth of coverage, mapping quality and associated tandem repeat, are used to evaluate the quality of predicted structural variation. Compared with other approaches, it can detect many more large insertions and complex variants with lower false discovery rate. Moreover, inGAP-sv, written in Java programming language, provides a user-friendly interface and can be performed in multiple operating systems.
Identifies transposase sequences, inverted repeats and candidate target direct repeats of insertion sequences (ISs) in complete genomes. IScan is able to identify ISs with an arbitrary number of ORFs, including ISs with ORFs encoded on both strands. IS annotation in existing genomes may be highly heterogeneous, because different researchers may use different annotation methods. A tool like IScan thus allows the user to create consistent IS annotation with multiple user-specified parameters (repeat length, sequence similarity to a reference family member, etc.) across multiple genomes. This consistency and flexibility is essential for detailed analyses of IS evolution across multiple genomes.
A computational tool for automated annotation of insertion sequences (ISs). OASIS takes advantage of widely available transposase annotations to identify candidate ISs and then uses a computationally efficient maximum likelihood method of multiple sequence alignment to identify the edges of each element. Thanks to its speed and flexibility, OASIS is capable not only of providing detailed IS information for a single genome but also of annotating thousands of genomes within hours, making it a valuable high-throughput tool for a global investigation of IS distribution across diverse taxa.
Given paired-end mapped reads and a candidate high-copy region, Reprever identifies (a) the insertion breakpoints where the extra duplicons inserted into the donor genome and (b) the actual sequence of the duplicon. Reprever resolves ambiguous mapping signatures from existing homologs, repetitive elements and sequencing errors to identify breakpoint. At each breakpoint, Reprever reconstructs the inserted sequence using profile hidden Markov model (PHMM)-based guided assembly.
A statistical framework and algorithm for structural variant (SV) detection from whole genome sequencing data. SWAN integrates multiple features, including insert size, hanging read pairs and read coverage into one statistical framework and detects putative SVs through genome-wide likelihood ratio scans. SWAN remaps soft-clip/split read clusters to supplement the likelihood analysis, joins multiple sources of evidence and identifies break points whenever possible. SWAN has improved sensitivity for detecting structural variants smaller than 10 kilobases and is particularly successful at identifying deletions smaller than 500 base pairs.
Detects breakpoints of large deletions and medium sized insertions from paired-end short reads. Pindel is a program that uses pattern growth algorithm to identify the break points of large deletions (1 bp–10 kb) and medium sized insertions (1–20 bp) from 36 bp paired-end short reads. The software can be useful for addressing the structural variations between individuals from next-gen high throughput sequencing.
Automates identification of insertion sequence (IS) in prokaryotic genomes. ISEScan is a pipeline capable of providing the detailed IS information without requiring the availability of the pre-annotated genome data like GenBank genome annotation. It could also perform better than previous automatic IS annotation system. This resource offers to the community a powerful tool to discover the roles of IS elements on the evolution of prokaryotic genomes.
A comprehensive analysis platform for the processing, analysis and visualization of structural variation based on sequencing data or genomic microarrays, enabling the rapid identification of disease loci or genes. Vivar allows you to scale your analysis with your work load over multiple (cloud) servers, has user access control to keep your data safe but still easy to share, and is easy expandable as analysis techniques advance.
Identifies genomic structural variations from paired-end and mate-pair sequencing data. SVDetect isolates and predicts intra- and inter-chromosomal rearrangements from paired-end/mate-pair sequencing furnished by the high-throughput sequencing technologies. This software proceeds first by collecting all pairs that are suspected to come from the same structural variant (SV). It then employs a sliding-window strategy to detect all groups of pairs sharing similar genomic location.
An algorithm for the correct alignment of two nucleotide sequences containing SVs, i.e. deletion, insertion, tandem duplication or inversion. The algorithm does not require the adjustment or modification of the alignment scoring scheme(s) that is usually tuned for a particular alignment purpose, e.g. cross-species, contig or read alignments. Thus, the algorithm can be universally applied in various biological studies relying on alignment.
Detects structural variants from targeted short-DNA reads. Both real and simulated data are used to demonstrate SLOPE's ability to rapidly detect insertion/deletion events of various sizes as well as translocations and viral integration sites with high sensitivity and low false discovery rate.
A method that identifies SVs and their precise breakpoints from whole-genome resequencing data. PRISM uses a split-alignment approach informed by the mapping of paired-end reads, hence enabling breakpoint identification of multiple SV types, including arbitrary-sized inversions, deletions and tandem duplications.
Detects somatic structural variations (SVs) and viral integration events. Seeksv simultaneously uses split read signal, discordant paired-end read signal, read depth signal and the fragment with two ends unmapped. It can detect deletion, insertion, inversion and interchromosome transfer at single-nucleotide resolution. Unlike others methods, seeksv merges soft clipped-reads from the same breakpoint into a clipped long sequence individually and does not rely on any of the assembly software.
Assists users in discovering and scoring structural variants (SVs), medium-sized indels and large insertions. Manta was developed to discover variants from a sequencing assay’s paired and split-read mapping information. It automates estimation of insert size distribution and exclusion of high depth reference compression regions. This method also includes scoring models for germline analysis of diploid individuals and somatic analysis of tumor-normal sample pairs.
Finds deletions with exact breakpoints from low-coverage next-generation sequencing (NGS) data. SVseq performs two steps: it (1) applies an enhanced split reads mapping approach to identify candidate deletion sites from sequence reads, and (2) uses mapped paired-end reads spanning candidate deletions as supports to filter false positives. It was tested using the 1000 genomes project pilot low-coverage data and pilot high-coverage data.
An integrated structural variation (SV) caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes.
A software tool that performs detection and assembly of DNA insertion variants in NGS read datasets with respect to a reference genome. MindTheGap is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. MindTheGap uses an efficient k-mer-based method to detect insertion sites in a reference genome, and subsequently assemble them from the donor reads.
Conducts multiple splits at arbitrary locations in a read. Gustaf can deal with single-end and paired-end reads. It discovers local alignments of a read, and then chains local alignments into a semi-global read-to-reference alignment. This tool recognizes dispersed duplications and intra-chromosomal translocations with exact breakpoints. It utilizes standard graph algorithms to assess relationships of the alignments.
A program designed to process previously aligned, Illumina Paired-end whole genome sequence data to identify structural variants such as deletions, insertions and tandem duplications. Simulations using RAPTR-SV and other, similar algorithms for SV detection revealed that RAPTR-SV had superior sensitivity and precision, as it recovered 66.4% of simulated tandem duplications with a precision of 99.2%.
A method for discovering and genotyping novel sequence insertions. PopIns takes as input a reads-to-reference alignment, assembles unaligned reads using a standard assembly tool, merges the contigs of different individuals into high-confidence sequences, anchors the merged sequences into the reference genome, and finally genotypes all individuals for the discovered insertions.
Detects structural variants in cancer using whole genome sequencing data with or without matched normal control sample. SV-Bay does not only use information about abnormal read mappings but also assesses changes in the copy number profile and tries to associate these changes with candidate SVs. The likelihood of each novel genomic adjacency is evaluated using a Bayesian model. In its final step, SV-Bay annotates genomic adjacencies according to their type and, where possible, groups detected genomic adjacencies into complex SVs as balanced translocations, co-amplifications, and so on. A comparison of SV-Bay with BreakDancer, Lumpy, DELLY and GASVPro demonstrated its superior performance on both simulated and experimental datasets.
Detects structural variations (SVs) in mate pair (MP) datasets. Ulysses is a paired-end method (PEM)-based software including an SV scoring module, which improves SV detection accuracy in MP libraries. This method can annotate the full spectrum of SV, including deletions (DEL), segmental duplications (DUP), inversions (INV), small insertions (sINS, with a size smaller than the library IS), large insertions (INS), reciprocal translocations (RTs) and non-reciprocal translocations (NRT).
Identifies bacterial insertion sequences (ISs) and their sequence elements-inverted and direct repeats-in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours, making it a valuable high-throughput tool for a global search of IS elements.
A probabilistic method for somatic structural variation (SV) prediction by jointly modeling discordant and concordant read counts. PSSV is specifically designed to predict somatic deletions, inversions, insertions and translocations by considering their different formation mechanisms. Simulation studies demonstrate that PSSV outperforms existing tools. PSSV has been successfully applied to breast cancer data to identify somatic SVs of key factors associated with breast cancer development.
Employs a two-stage process to evaluate and refine structural variation (SV) predictions. SVmine is an algorithm for further mining of SV predictions from multiple algorithms to improve the sensitivity, specificity and breakpoint resolution of SV detection. It first performs quality evaluation and filters low quality SV predictions. Then, it refines breakpoint positions of the high quality SVs by performing precise “sandwich” realignments of soft-clipped reads. The realignment strategy used by SVmine can also be generalized to Pacbio long read data.
Detects the insertion of all mobile elements or chromosomal rearrangements. PanISa is a program designed for the ab initio detection of insertion sequence (IS) in bacterial genomes. It can accelerate the identification of new ISs from short-read data. It can also be useful for the elucidation of the evolution of prokaryote lineages. This method was validated on the genomes of five major human bacterial pathogens.
Tools for insert site detection and for the assembly of novel insertions. BASIL features an efficient sliding window implementation for clustering read alignments at insertion sites and clipping signals into a novel combination. ANISE allows for the practical assembly of insertions that is robust to repeated copies and uses the overlap-layout-consensus (OLC) assembler approach.
Allows variant detection, combining mismatch, split-read, read pair, and read depth whole genome sequence (WGS) evidence. GROM is able to detect single nucleotide variants (SNVs), indels, structural variants (SVs), and copy number variants (CNVs). It can determine abnormal insert size by employing a sample of 10 million paired reads. This tool provides functions to simultaneously perform duplicate filtering.
Calculates annotations from one or more aligned bam files from many high-throughput sequencing technologies, and then builds a one-class model using these annotations to classify candidate structural variants (SVs) as likely true or false positives. SVClassify method gives the highest scores to SVs that are insertions or large homozygous deletions, and have accurate breakpoints. Deletions smaller than 100-bps often have low scores with our method, so other methods like svviz are likely to give better results for very small SVs.
Allows detection of deletions, and is particularly designed to handle overlapping deletions. AggloIndel was applied to a data set from several samples taken from an acute lymphoblastic leukaemia patient. It collects mappings of paired ends aligned to a reference genome, likely originating from the same deletion in clusters. This tool employs the technique of agglomerative clustering. It merges pair of clusters of maximum similarity and replaces them by a new cluster.
Permits to automate and discover structural variations (SVs). Tardis is a toolkit that integrates read pair, read depth, and split read (using soft clipped mappings) sequence signatures to discover several types of SV, while resolving ambiguities among different putative SVs. This application is suitable for cloud use as the memory footprint is low. It is also capable of characterizing deletions, small novel insertions, tandem duplications, inversions, and mobile element retrotransposition.
Finds structural variant breakpoints in Illumina paired-end next-generation sequencing (NGS) data. SoftSearch is a breakpoint detection tool for paired-end NGS instruments that uses multiple sequence features to infer breaks point, for characterizing location and type of structural variants. The software can identify large Insertions, large deletions, inversions, tandem duplications, novel sequence insertion locations, and chromosomal translocations.
Integrates calls from one or more breakpoint detection methods and classifies the structural variant (SV). CLOVE can build a graph data structure from the provided breakpoint information and then looks for patterns that are characteristic of more complex rearrangement types. It is able to classify complex events from the data. The tool is a flexible method to utilize in any SV calling pipeline. It can process joint inputs from multiple methods as an attractive feature.
Calculates the distance between two multi-chromosomal genomes with unequal content, but without duplicated markers. UniMoG can be used to classify one genome into another one also employing double cut and join (DCJ) operations, insertions and deletions methods. It supports one pair or an arbitrary number of genomes as input.
Provides a structural variation (SV) caller for long reads. Sniffles is mainly designed for PacBio reads, but also works on Oxford Nanopore reads. SV are larger events on the genome (e.g. deletions, duplications, insertions, inversions and translocations). Sniffles can detect all of these types and more such as nested SVs (e.g. inversion flanked by deletions or an inverted duplication). Furthermore, Sniffles incorporates multiple auto tuning functions to determine data set depending parameter to reduce the overall risk of falsely infer SVs.
A package for aligning sequences to a reference genome. MUMdex consists of an aligner, an alignment format, an analysis software and a portable population database of common structural variants to aid filtering. The aligner saves read pair information in an indexed lossless compact binary format as MUMs plus the sequence not covered by MUMs. MUMdex computes a numerical invariant for each pair of MUMs and, depending of the value, signals either genome rearrangements (inversions, translocations or indels) or problems in the reference genome. It can also detect single nucleotide polymorphisms (SNPs), but less efficiently than standard methods.
Identifies structural variants from de novo assemblies. PAVFinder is able to detect translocations, inversions, duplications, insertions, deletions, simple-repeat expansions/contractions for genomic structural variants. It can be applied to transcriptomic structural variants and transcriptomic splice variants to find information such as gene fusions, partial tandem duplications (PTD), skipped exons or retained introns between others.
Searches a subsequence from a genomic sequence. Cassiopee-c is able to find exact match and allows substitutions, insertions and deletions. It can make indexation based on a suffix tree with compression. The tool offers a way to save returned sequences to avoid to reindex a whole sequence in case of large datasets.
PhD ès Neurosciences, I worked 8 years on the brain and its diseases. I then specialized in bioinformatics (NGS, epigenetics) and worked in CEA and GENETHON before to join OMICX and help OMICtools community.
Gene fusion detection in Plants
Fusion transcripts (i.e., chimeric RNAs) resulting from gene fusions are well known in case of human. But, in plants, this phenomenon is not yet explored. We are planning to discover the fusion transcripts/gene fusions in different type of plants by using RNA-Seq datasets. Further, we are planning to understand the mechanism of gene fusion formation and significance of fusions in plants.
Whole genome and transcriptome sequencing data analysis of Plants
In this era of Next Generation Sequencing (NGS), there is huge amount of sequencing data available in the public domain. Any novel finding from these available datasets is major challenge for a computational biologist. We are interested in the analysis of whole genome and transcriptome sequencing data of different plants to fetch out the useful information from those datasets, with the help of bioinformatics tools. Currently, we are planning to study the gene clusters of secondary metabolite pathways in different plants.
Development of webservers, databases and computational pipelines for plant research
Development of database is necessary to compile and share the information with scientific community. We are dedicated to develop useful databases and webserver for plant research.
Another area of interest is to develop automated pipelines and tools for the analysis of high throughput genomics data, generated by NGS technologies.
Professional & Academic Background
Staff Scientist II (May 2017- present): National Institute of Plant Genome Research (NIPGR), New Delhi, India
Postdoctoral Research Associate (2015-2017): University Of Virginia, Charlottesville, VA, USA
Research Scientist (2014-2015): Sir Ganga Ram Hospital, New Delhi, India
PhD Bioinformatics (2009-2014): Bioinformatics Centre, Institute of Microbial Technology (IMTECH), Chandigarh under Jawaharlal Nehru University (JNU), New Delhi, India
M.Sc. Life Sciences (2007-2009): Jawaharlal Nehru University (JNU), New Delhi, India
B.Sc. Biotechnology (2004-2007): Jamia Millia Islamia (JMI), New Delhi, India
Awards and Fellowships
Junior and Senior Research Fellowship (2009-2014): Council of Scientific and Industrial Research (CSIR), New Delhi, India
GATE (Graduate Aptitude Test in Engineering): Qualified in years 2008 and 2009
Scientific Contributions/ Recognitions
Associate editor: Journal of Translational Medicine.
Editorial Board Member of Journal: Theoretical Biology and Medical Modelling.
Reviewer: PloS One, BMC Genomics, BMC Bioinformatics, BMC Biology, BMC Biotechnology, Frontiers in Physiology and several other journals.
Web Resources/ Databases (Developed/ Contributed)
A Platform for Designing Genome-Based Personalized Immunotherapy or Vaccine against Cancer (http://www.imtech.res.in/raghava/cancertope/)
GenomeABC: A webserver for benchmarking of genome assemblers. (http://crdd.osdd.net/raghava/genomeabc/).
Genomics web portal page. (http://crdd.osdd.net/raghava/genomesrs/).
Map/Alignment module of CancerDr: Cancer Drug Resistance Database. (http://crdd.osdd.net/raghava/cancerdr/).
Short reads and contigs alignment module of PCMDB: Pancreatic cancer methylation database. (http://crdd.osdd.net/raghava/pcmdb/).
Burkholderia sp. SJ98 database. (http://crdd.osdd.net/raghava/genomesrs/burkholderia/).
Rhodococcus imtechensis RKJ300 database. (http://crdd.osdd.net/raghava/genomesrs/rkj300/).
Genotrick: A pipeline for whole genome assembly and annotation of Genomes (http://crdd.osdd.net/raghava/genomesrs/genotrick/)
Development of Debian packages in OSDDlinux: A Customized Operating System for Drug Discovery. (http://osddlinux.osdd.net/).
A Web-Based Platform for Designing Vaccines against Existing and Emerging Strains of Mycobacterium tuberculosis. (http://crdd.osdd.net/raghava/mtbveb/).