A SNP-set (e.g., a gene or a region) level test for association between a set of rare (or common) variants and dichotomous or quantitative phenotypes. SKAT aggregates individual score test statistics of SNPs in a SNP set and efficiently computes SNP-set level p-values, e.g. a gene or a region level p-value, while adjusting for covariates, such as principal components to account for population stratification. SKAT also allows for power/sample size calculations for designing for sequence association studies.
Recognizes and investigates gene-gene effects. hapConstructor employs a haplotype-mining method that can take into consideration multi-locus data at two genes and test for association and interaction. It is available through a Monte Carlo (MC) testing framework and provides empirical construction-wide significance assessment for hypothesis testing. This tool can be useful for hypothesis generation.
A suite of R routines for the analysis of indirectly measured haplotypes. The statistical methods assume that all subjects are unrelated and that haplotypes are ambiguous (due to unknown linkage phase of the genetic markers). The main functions are: haplo.em, haplo.glm, haplo.score, haplo.power, and seqhap.
Inference of trait associations with SNP haplotypes and other attributes using the EM algorithm. The R functions are used for inference of trait associations with haplotypes and other covariates in generalized linear models. The functions are developed primarily for data collected in cohort or cross-sectional studies. They can accommodate uncertain haplotype phase and handle missing genotypes at some SNPs.
Offers a platform for performing genome wide association studies (GWAS) based on haplotypes. ParaHaplo is an application leaning on data parallelism to allow users to perform analysis with an increased speed for the assessing of both haplotypes and P values. The application can be used in conjunction with other software for running: (i) genotype imputation and haplotype reconstruction; (ii) haplotype estimation and (iii) haplotype-based GWAS.
Provides an assortment of methods to establish and fit a wide range of models. BhGLM offers an R package which is developed to handle about six different types of models including Bayesian hierarchical, negative binomial, or Cox survival models. The application includes features to compute measures to evaluate a given model as well as utilities which serves to numerically and graphically summarize it.
A weighted haplotype-based approach and an imputation-based approach, to test for the effect of rare variants with GWAS data. Both methods can incorporate external sequencing data when available. We evaluated our methods and compared them with methods proposed in the sequencing setting through extensive simulations. Our methods clearly show enhanced statistical power over existing methods for a wide range of population-attributable risk, percentage of disease-contributing rare variants, and proportion of rare alleles working in different directions.
Models haplotype association with disease in population studies. GENEBPM is a reversible-jump Markov chain–Monte Carlo (MCMC) algorithm that assesses the evidence in favor of disease association with polymorphisms in a candidate gene or a small candidate region. This method was developed to obtain maximum-likelihood estimates of the relative frequencies of haplotypes consistent with a sample of observed single-nucleotide–polymorphism (SNP) genotypes.
Allows users to handle and solve the single individual haplotyping (SIH) problem. PEATH can identify reliable haplotypes (low error rates and reliably longer haplotype length). It shows the best phased length and N50 values: the length of the haplotype is initialized by the number of total mutation sites and the phasing blocks are divided only in cases with no connection by the overlapped sequence reads. Moreover, this algorithm can be useful for long read sequencing technologies.
A package designed to call haplotypes from phased marker data. GHap R identifies the different haplotype alleles (HapAllele) present in the data and scores sample haplotype allele genotypes based on HapAllele dose (i.e., 0, 1 or 2 copies). The output is not only useful for analyses that can handle multi-allelic markers, but is also conveniently formatted for existing pipelines intended for bi-allelic markers.
Combines an algorithm designed to cluster haplotypes of interest from a given set of haplotypes with two existing tools: Haploview, for analyses of linkage disequilibrium blocks and haplotypes, and PLINK, to generate all possible diplotypes from given genotypes of samples and calculate linear or logistic regression. In addition, procedures for generating all possible diplotypes from the haplotype clusters and transforming these diplotypes into PLINK formats were implemented. Diplotyper is a fully automated tool for performing association analysis based on diplotypes in a population. Diplotyper is useful for identifying more precise and distinct signals over single-locus tests.
A tree-based ensemble method that takes into account the correlation structure among the genetic markers implied by linkage disequilibrium in GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification.
A fast predictor for the inference of blood groups from single nucleotide variant (SNV) databases. BOOGIE correctly predicted the blood group with 94% accuracy for the Personal Genome Project whole genome profiles where good quality SNV annotation was available. Additionally, BOOGIE produces a high quality haplotype phase, which is of interest in the context of ethnicity-specific polymorphisms or traits. The versatility and simplicity of the analysis make it easily interpretable and allow easy extension of the protocol towards other phenotypes.
A generalized linear model with regularization approach for detecting disease-haplotype association using unphased single nucleotide polymorphisms data that is applicable to both common disease/common variant (CD/CV) and common disease/rare variant (CD/RV) scenarios. Our simulation study demonstrates the gain in power for detecting associations with moderate sample sizes. For detecting association under CD/RV, regression type methods such as that implemented in hapassoc may fail to provide coefficient estimates for rare associated haplotypes, resulting in a loss of power compared to rGLM. Furthermore, our results indicate that rGLM can uncover the associated variants much more frequently than can hapassoc.
An R package that performs Logistic Bayesian Lasso for finding association of SNP haplotypes and environmental factors with a trait in a case-control setting. Bayesian lasso is used to find the posterior distributions of logistic regression coefficients, which are then used to calculate Bayes Factor to test for association.
Converts genotyping input into various outputs. SNPTransformer accepts linkage and chip formats as input, and transforms them into: packages for association, transmission disequilibrium tests (TDT), calculating linkage disequilibrium (LD) measures, haplotype inference, haplotype block partition, tagSNPs and multilocus interaction. This tool can be used to perform data analysis for genome-wide association studies (GWAS).
Detects disease association by a set of markers, at any user-specified polymorphic site(s), under arbitrary disease model and sample sizes. HaploPowerCalc uses an approach based on haplotype-sampling. The software is designed for users who wish to estimate the power (or sample sizes required to obtain adequate power) in their association study.
Builds a collection of haplotype polymorphisms (haploSNPs) from phased genotype data. HaploSNP employs HAPI-UR method to computationally phase genotypes. It then creates haploSNPs (one for the ancestral allele and one for the derived allele) for each polymorphic SNP and extends them until it finds a mismatch explained only by a recombination between the current haploSNP and the mismatch SNP.
Provides haplotype analysis in unrelated individuals that can treat quantitative, binary, survival and polychotomous phenotype analyses. THESIAS is a multiple-imputation algorithm that never assigns haplotype to individuals. It is based on the Stochastic Expectation Maximisation (SEM) algorithm, a method that has the advantage over the standard EM algorithm of being more robust to problems of lack of convergence and convergence to local minima.
A package for haplotype block identification, visualization and haplotype tagging single nucleotide polymorphism (htSNP) selection. HaploBlockFinder can also compare the haplotype block structure with local linkage disequilibrium (LD) pattern. As HaploBlockFinder is based on the greedy algorithm, an inherent limitation is that it does not guarantee that the haplotype blocks are globally optimal. In addition, this tool can be either run as a web service or standalone executables on local machine.
Estimates haplotypes within each haplotype block. htSNPer allows molecular geneticists to perform haplotype block analysis and haplotype tagging SNPs (htSNPs) selection using different definitions, different performance criteria, as well as different algorithms. It is a program that has integrated four haplotype block definitions: chromosome coverage, average pairwise LD |D’, estimated pairwise LD confidence limits with minor modifications, and no historical recombination.
Gene fusion detection in Plants
Fusion transcripts (i.e., chimeric RNAs) resulting from gene fusions are well known in case of human. But, in plants, this phenomenon is not yet explored. We are planning to discover the fusion transcripts/gene fusions in different type of plants by using RNA-Seq datasets. Further, we are planning to understand the mechanism of gene fusion formation and significance of fusions in plants.
Whole genome and transcriptome sequencing data analysis of Plants
In this era of Next Generation Sequencing (NGS), there is huge amount of sequencing data available in the public domain. Any novel finding from these available datasets is major challenge for a computational biologist. We are interested in the analysis of whole genome and transcriptome sequencing data of different plants to fetch out the useful information from those datasets, with the help of bioinformatics tools. Currently, we are planning to study the gene clusters of secondary metabolite pathways in different plants.
Development of webservers, databases and computational pipelines for plant research
Development of database is necessary to compile and share the information with scientific community. We are dedicated to develop useful databases and webserver for plant research.
Another area of interest is to develop automated pipelines and tools for the analysis of high throughput genomics data, generated by NGS technologies.
Professional & Academic Background
Staff Scientist II (May 2017- present): National Institute of Plant Genome Research (NIPGR), New Delhi, India
Postdoctoral Research Associate (2015-2017): University Of Virginia, Charlottesville, VA, USA
Research Scientist (2014-2015): Sir Ganga Ram Hospital, New Delhi, India
PhD Bioinformatics (2009-2014): Bioinformatics Centre, Institute of Microbial Technology (IMTECH), Chandigarh under Jawaharlal Nehru University (JNU), New Delhi, India
M.Sc. Life Sciences (2007-2009): Jawaharlal Nehru University (JNU), New Delhi, India
B.Sc. Biotechnology (2004-2007): Jamia Millia Islamia (JMI), New Delhi, India
Awards and Fellowships
Junior and Senior Research Fellowship (2009-2014): Council of Scientific and Industrial Research (CSIR), New Delhi, India
GATE (Graduate Aptitude Test in Engineering): Qualified in years 2008 and 2009
Scientific Contributions/ Recognitions
Associate editor: Journal of Translational Medicine.
Editorial Board Member of Journal: Theoretical Biology and Medical Modelling.
Reviewer: PloS One, BMC Genomics, BMC Bioinformatics, BMC Biology, BMC Biotechnology, Frontiers in Physiology and several other journals.
Web Resources/ Databases (Developed/ Contributed)
A Platform for Designing Genome-Based Personalized Immunotherapy or Vaccine against Cancer (http://www.imtech.res.in/raghava/cancertope/)
GenomeABC: A webserver for benchmarking of genome assemblers. (http://crdd.osdd.net/raghava/genomeabc/).
Genomics web portal page. (http://crdd.osdd.net/raghava/genomesrs/).
Map/Alignment module of CancerDr: Cancer Drug Resistance Database. (http://crdd.osdd.net/raghava/cancerdr/).
Short reads and contigs alignment module of PCMDB: Pancreatic cancer methylation database. (http://crdd.osdd.net/raghava/pcmdb/).
Burkholderia sp. SJ98 database. (http://crdd.osdd.net/raghava/genomesrs/burkholderia/).
Rhodococcus imtechensis RKJ300 database. (http://crdd.osdd.net/raghava/genomesrs/rkj300/).
Genotrick: A pipeline for whole genome assembly and annotation of Genomes (http://crdd.osdd.net/raghava/genomesrs/genotrick/)
Development of Debian packages in OSDDlinux: A Customized Operating System for Drug Discovery. (http://osddlinux.osdd.net/).
A Web-Based Platform for Designing Vaccines against Existing and Emerging Strains of Mycobacterium tuberculosis. (http://crdd.osdd.net/raghava/mtbveb/).