Finds associations in large datasets. IHW is based on the Benjamini-Hochberg procedure and uses weights derived from the data. It divides the tests into groups based on the covariate. This tool assigns low weight to covariate-groups with low signal. It is capable of avoiding loss of false discovery rate (FDR) control by employing randomization in the form of hypothesis splitting into k-folds.
Provides a batch version of a neural learning algorithm for Independent Component Analysis (ICA). fastICA is an R package based on a fixed-point method. It was introduce using a very simple yet highly efficient fixed-point iteration scheme for finding the local extrema of the kurtosis of a linear combination of the observed variables. The computations can be performed either in batch mode or semi-adaptively.
MXM is a flexible R package which offers feature selection algorithms for predictive or diagnostic models along with (Bayesian) network construction algorithms. State of the art feature selection algorithms include FBED and SES with the latter returning multiple sets of statistically equivalent variables (one of the few algorithms in the literature). The algortihms can handle many types of response variables, such as continuous, binary, multiclass, ordinal, (censored) time to event, repeated measurements, percentages etc.
Allows a variety of statistical tests. G*Power can analyze the power of tests based on (1) single-sample tetrachoric correlations, (2) comparisons of dependent correlations, (3) bivariate linear regression, (4) multiple linear regression based on the random predictor model, (5) logistic regression, and (6) Poisson regression. It can also be used to compute effect sizes and to display graphically the results of power analyses.
Fixes the rejection region in multiple hypothesis testing adjustment. Myriads uses a discriminant rule based on the maximum distance between the uniform distribution of p-values and the observed one, to set the null for a binomial test. It assists users to detect true effects jointly with the reasonable proportion of false discoveries one should assume.
Assists in analysing longitudinal and growth curve data. MASAL is based on a high-dimensional smoothing technique and does not impose functional restrictions on time and covariates a priori. It can be used as a guide before using the other models or as a benchline validation. The software is applicable provided that a within-subject covariance matrix is supplied.
Allows users to specify a broad range of models involving continuous parameters by coding their log posteriors up to a proportion. Stan is a state-of-the-art platform for statistical modeling and high-performance statistical computation. This resource provides full Bayesian inference for posterior expectations including parameter estimation and posterior predictive inference by defining appropriate derived quantities of interest.
Assists users in conceptualization, visualization and manipulation of datasets available on the DiscoveryDB database. DiscoverySpace supports all possible data models with only minimal configuration on the part of the database administrator. It aims to expose the content and power of the underlying database while abstracting away its low-level complexity. This tool permits users to traverse multiple biological databases.
Offers a large variety of graphical tools for visual inspection of receiver operating characteristics (ROC) curves, including ROC curves, sensitivity and specificity curves and distribution plots. easyROC is a web-tool that combines several R packages. It delivers basic OCR statistics such as the AUC as well as its standard error, confidence interval and statistical significance.
Allows users to upload their own data and easily create Principal Component Analysis (PCA) plots and heatmaps. Data can be uploaded as a file or by copy-pasteing it to the text box. Data format is shown under "Help" tab. Several R packages are used internally, including shiny, ggplot2, pheatmap, RColorBrewer, FactoMineR, pcaMethods, shinyBS and others.
Counts the exact number of “testable” motif combinations and derives a tighter bound of family-wise error rate (FWER), allowing the calibration of the Bonferroni factor. LAMP is a branch-and-bound algorithm. The software can be used to provide an integrated analysis of heterogeneous biological data. It was applied to human breast cancer transcriptome data and permitted to find statistically significant combinations of up to eight motifs.
Allows users to perform adjustment for confounding (AC) variation and dimension reduction simultaneously. AC-PCA provides a standalone software that can be applied to various genomics data for classifying, for instance, yeast mutants using metabolic foot printing or immune cells using DNA methylome. It was tested on a human brain development exon array dataset, a model organism ENCODE RNA sequencing dataset and simulated data.
Computes the statistics for sparse primate infection data. VacMan is useful to calculate an rms standard deviation, which ranks by merit secondary titrations designed, and the non-infection probability at different viral doses, which permits minimal challenge dose (MCD) estimation. It can serve to test P values, which decide whether a treatment is efficient.
Provides functions to conducting univariate and multivariate meta-analysis using a Structural Equation Modelling (SEM) approach. metaSEM is an R package that implements a two-stage structural equation modeling (TSSEM) approach to conducting fixed- and random-effects meta-analytic structural equation modeling (MASEM) on correlation/covariance matrices. Many of the techniques available in this SEM package can be easily extended to meta-analysis.
Provides a set of functions that attempt to streamline the process for creating predictive models. caret is an R package that contains tools for (i) data splitting, (ii) pre-processing, (iii) feature selection, (iv) model tuning using resampling, and (v) variable importance estimation. The package started off as a way to provide a uniform interface the functions themselves, as well as a way to standardize common tasks (such parameter tuning and variable importance).
Provides methods for the descriptive and inferential statistical analysis of directional data. CircStat can be used to explore and summarize important properties of a sample of angular data such as central tendency, spread, symmetry or peakedness. The functions implemented in the software allow to test the popular question of circular uniformity, while other methods allow to investigate more specific hypothesis about the mean direction of one or multiple samples.
Allows multivariate data analysis. FactoMineR allows to take into account different types of variables (quantitative or categorical), different types of structure on the data (a partition on the variables, a hierarchy on the variables, a partition on the individuals) and finally supplementary information (supplementary individuals and variables). It performs classical methods such as Principal Components Analysis (PCA), Correspondence analysis (CA), Multiple Correspondence Analysis (MCA) as well as more advanced methods.
Performs Bayesian clustering using a Dirichlet process mixture model. PReMiuM allows binary, categorical, count and continuous response, as well as continuous and discrete covariates. This tool supplies several functions for post-processing of different outputs. Moreover, it assists users to determine which covariates actively drive the mixture components.
Executes simple and partial Mantel tests. zt is a command-line software, capable of managing very large matrices, that seeks the correlation between two matrices and eliminates the non-valid ones by controlling the effect of a third one. This tool can be used to determine distance between genetic and environmental subjects.
Allows to evaluate and visualize the performance of scoring classifiers. ROCR features over 25 performance measures that can be freely combined to create two-dimensional performance curves. It uses standard methods for investigating trade-offs between specific performance measures, including receiver operating characteristic (ROC) graphs, precision/recall plots, lift charts and cost curves. The tool allows for studying the intricacies inherent to many biological datasets and their implications on classifier performance.
Assists users in manipulating high-throughput sequencing (HTS) data and formats. Picard is a Java toolkit that provides a set of command line scripts. It comprises Java-based utilities that manipulate SAM files, and a Java API for creating new programs that reads and writes SAM files. Both SAM text format and SAM binary (BAM) format are supported. It also works with next generation sequencing (NGS).
Assists users in fitting Structural Equation Models (SEM). OpenMx is an open source application that can estimates maximum likelihood parameters for models with multivariate outcomes given an observed covariance matrix. It allows users to specify matrix algebra calculations as part of his model. It also allows the definition of boundary constraints with respect to constants and with respect to other parameters.
Allows to obtain robust, reproducible pairs of temporal and spatial components at the individual subject level from concurrent electroencephalographic and functional magnetic resonance imaging data. BICAR is an algorithm which allows to find biologically relevant paired sources involved in visual processing, motor planning, execution, and attention, which are highly reproducible and present in multiple subjects. The algorithm ranks each joint source by a task-independent measure of reproducibility.
Estimates q-values and posterior error probabilities (PEP) directly from score distributions. Qvality employs a standard bootstrap procedure to estimate the prior probability of a score being from the null distribution. It relies upon non-parametric logistic regression to estimate PEP. The tool is able to estimate both types of scores directly from a null distribution, without requiring the user to calculate p-values.
Predicts the survival of cancer patients from microarray data, and classifies obese and lean individuals from metagenomic data. pensim can be applied for high-dimensional feature selection and prediction of genomic data. The tool contains a function for generating synthetic high-dimensional data with time-to-event or binary outcome, and blocks of predictor variables defined by collinearity and association with outcome, with options for introducing labeling errors and for censoring of survival times.
Computes maximal information-based measures of dependence between two variables in large datasets. minepy reduces the large memory requirement of the original Java implementation, has good upscaling properties and offers a native parallelization for the libraries minerva. Its computing times are about twice those of the Java solution, but the speedup is close to 70 for minerva on 100 cores via MPI on a Linux cluster.
Provides the mine function allowing the computation of Maximal Information-based Nonparametric Exploration (MINE) statistics. Minerva allows native parallelization: based on the R package parallel, the number of cores can be passed as parameter to mine, whenever multi-core hardware is available. The main function mine takes the dataset and the parameter configuration as inputs and returns the four MINE statistics.
Analyzes continuous signal and discrete region tracks from high-throughput genomic experiments. ACT is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It correlates related tracks and analyzes them for saturation. The tool takes less than a minute to generate the plot for up to 30 input files each with a few thousand lines. It provides an option to compute the coverage of a random sample of the input file combinations.
Assists in measuring and visualizing early retrieval performance. CROC is a general-principled approach for calculating any portion of the ROC or AC curve, particularly the early part. It also permits users to amplify events of interest and disambiguate the performance of various classifiers by measuring the relevant aspects of their performance. This method has been demonstrated on a publicly available drug discovery benchmark dataset.
Detects statistical connections between grouping structures and grouping factors or correlates. PERM takes in a collection of grouped data and outputs a P-value. It allows to compute P-values from artificial groups built on the basis of perfect homogeneity within each group. The tool has been designed with a view to tackle problems involving the detection of statistical connections between grouping structure and candidate aggregating variables.
Permits multiple analyses of large-scale genotype datasets. bigstatsr can deal with several types of filebacked big matrices (FBMs) such as: unsigned char, unsigned short, integer and double. It includes statistical tests based on linear and logistic regressions. This tool integrates an iterative principal component analysis (PCA) algorithm in which the matrix has to be sequentially accessed over a hundred times. It is useful for genotyped single nucleotide polymorphisms (SNPs).
Aims to detect and genotype simple and complex genetic variants in an individual or population. VARI is an implementation of a succinct colored de Bruijn graph that significantly reduces the amount of memory required to store and use the colored de Bruijn graph. It can be applied in much larger and more ambitious sequence projects than was previously possible. The tool can consume k-mer counts from either Cortex’s binary files or KMC2.
Estimates the null distribution and the p-value of the log-rank based on a recent reformulation. VALORATE was developed and tested for cancer genomics that is heavily affected by unbalanced survival groups. For a given number of alterations that define the size of survival groups, the estimation involves a weighted sum of distributions that are conditional on a co-occurrence term where mutations and events are both present.
Offers access to more of 40 prevalent feature selection algorithms through an interactive interface. FSelector is a Ruby gem that was developed to support various bioinformatics research, such as text mining, microarray analysis and mass spectra analysis. It also offers several data pre-processing techniques related to feature selection, such as normalization, discretization and missing data imputation.
Allows users to detect clinical disease subtypes. SBC combines an Accelerated Failure Time (AFT) model coupled to a Dirichlet Process Gaussian Mixture Model (DPMM) to gather clinical end point data and heterogeneous omics. It is able to both identify and predict a patient sub-group on testing data as well as predict survival-time. The software was tested on cancer patient data from two different datasets.
Implements a number of efficient statistical methods developed for : (i) estimating subgroup treatment effects and gene–treatment interactions, (ii) exploiting the gene–treatment independence dictated by randomization, and (iii) including the case-only estimator, the maximum estimated likelihood estimator and the semiparametric maximum likelihood estimator for parameters in a logistic model. TwoPhaseInd is an R package computationally scalable to genome-wide studies, as illustrated by an example from Women’s Health Initiative.
Executes receiver operating characteristics (ROC) curve and precision-recall calculations. Precrec is built on a trapezoidal rule to measure Area Under the Curve (AUC) scores. This software offers some visualization features for calculated curves. It depends on additional adjacent points and enables association with the number of support points for the whole curve and non-linear interpolation.
Discovers significant combinations of alleles. MP-LAMP is parallelized to decrease time-consuming analysis. It allows users to traverse the search tree collectively without load unbalance. This tool is useful for genome wide association study (GWAS) analysis. It is based on the limitless arity multiple-testing procedure (LAMP) approach that aims to reduce the correction factor by a tighter bound of family-wise error rate (FWER).
Permits P-value combinations by using popular analysis methods. OPATs enables a gene region to be extended upstream and downstream by a prespecified width. It can be used to identify genetic markers and marker sets associated with complex diseases and traits of interest. The tool does not require genotypic and phenotypic data in an analysis. It can be useful for analysis of P-values from different types of molecular markers in an omics study, family- and population-based association studies.
Serves for Bayesian structure learning in undirected graphical models. BDgraph is a program that can deal with Gaussian, non-Gaussian, discrete and mixed datasets. This tool includes various functional modules, including data generation for simulation, several search algorithms, graph estimation routines, a convergence check and a visualization tool. Moreover, this package simplifies the analysis of a pipeline by using three functional modules.
Improves the decomposition and interpretation of functional magnetic resonance imaging (fMRI) data with independent component analysis (ICA). RAICAR is an ICA method based on reproducibility. The software utilizes repeated ICA realizations and relies on the reproducibility between them to rank and select components. It estimates the number of components, provides the order of the components, based on component reproducibility and leads to improved data decomposition by selectively averaging across ICA realizations.
Allows exact p-value calculation score test in heritability. RL-SKAT is a computational method that can be used in the case of a single variance component and constant response vector. This process permits to speed up the analysis by orders of magnitude. This software could also be employs to answer several questions, such as (i) estimation of the underlying heritability of a phenotype, (ii) estimating the uncertainty of such estimation, (iii) phenotype prediction, and many others.
Contains a set of tools displaying, analyzing, smoothing and comparing receiver operating characteristic (ROC) curves. pROC proposes multiple statistical tests to compare ROC curves, and in particular partial areas under the curve that allows proper ROC interpretation. It is based on U-statistics theory and asymptotic normality method to compare the areas under the curve (AUCs). The tool provides a consistent and user-friendly set of functions building and plotting a ROC curve, several methods smoothing the curve, computing the full or partial AUC over any range of specificity or sensitivity, as well as computing and visualizing various confidence intervals.
Detects independence between two random variables especially in non-linear situations. BNNPT is based on a permutation test of the square error (SE) of bagging nearest neighbor estimator. It is able to explore the non-linear relationships between two continuous variables without specific domain knowledge. This tool is efficient in testing nonlinear correlation in real data applications.
Predicts the surgical/pathological stage of the disease in a large cohort of endometrial cancer (EC) patients. RERT was developed to preoperatively identify an advanced surgical FIGO stage. It uses sHE4 and sCA125 biomarkers together with other preoperatively available clinical and pathological variables such as covariates (age, body mass index (BMI), number of children, menopause status, contraception, hormone replacement therapy (HRT), hypertension, grading from biopsy, clinical stage).
Computes sample size, effect size, and power statistics for factorial ANOVA designs. MorePower calculates relational confidence intervals for ANOVA effects, as well as Bayesian posterior probabilities for the null and alternative hypotheses. Its high numerical precision and ability to work with complex ANOVA designs could facilitate researchers’ attention to issues of statistical power, Bayesian analysis, and the use of confidence intervals for data interpretation.
Supports vector classification, regression and distribution estimation. LIBSVM solves C-SVM classification, nu-SVM classification, one-class-SVM, epsilon-SVM regression, and nu-SVM regression. It also provides an automatic model selection tool for C-SVM classification. It also supports multi-class classification. A typical use of LIBSVM involves two steps: (i) training a dataset to obtain a model and (ii) using the model to predict information of a testing dataset.
Permits exploration and creation of hypotheses about recurrent neural network (RNN) hidden state dynamics. LSTMVIS allows the formulation of hypothesis about the semantics of a subset of hidden states by selecting a range of words that may express an interesting property. It can be used for external annotations to verify or reject hypothesizes. It allows the development of effective visual encodings and interactions.
Allows users to capture the non-linearity in data and also find the best subset model. Parameter Selection Algorithm permits to capture some of the non-linearities of the data into the model, introduce automatic interpretable interaction and transformation among predictions, and also pick the best model. It can produce an optimal subset of variables, rendering the overall process of model selection more efficient.
Helps with bioinformatics machine learning troubles. feseR is an R package for high-dimensional omics data analysis. With a feature selection, this application provides a workflow combined univariate/multivariate correlation filters with wrapper feature backward elimination. It could be applied in combination to answer two different machine learning problems: regression and classification.
Gene fusion detection in Plants
Fusion transcripts (i.e., chimeric RNAs) resulting from gene fusions are well known in case of human. But, in plants, this phenomenon is not yet explored. We are planning to discover the fusion transcripts/gene fusions in different type of plants by using RNA-Seq datasets. Further, we are planning to understand the mechanism of gene fusion formation and significance of fusions in plants.
Whole genome and transcriptome sequencing data analysis of Plants
In this era of Next Generation Sequencing (NGS), there is huge amount of sequencing data available in the public domain. Any novel finding from these available datasets is major challenge for a computational biologist. We are interested in the analysis of whole genome and transcriptome sequencing data of different plants to fetch out the useful information from those datasets, with the help of bioinformatics tools. Currently, we are planning to study the gene clusters of secondary metabolite pathways in different plants.
Development of webservers, databases and computational pipelines for plant research
Development of database is necessary to compile and share the information with scientific community. We are dedicated to develop useful databases and webserver for plant research.
Another area of interest is to develop automated pipelines and tools for the analysis of high throughput genomics data, generated by NGS technologies.
Professional & Academic Background
Staff Scientist II (May 2017- present): National Institute of Plant Genome Research (NIPGR), New Delhi, India
Postdoctoral Research Associate (2015-2017): University Of Virginia, Charlottesville, VA, USA
Research Scientist (2014-2015): Sir Ganga Ram Hospital, New Delhi, India
PhD Bioinformatics (2009-2014): Bioinformatics Centre, Institute of Microbial Technology (IMTECH), Chandigarh under Jawaharlal Nehru University (JNU), New Delhi, India
M.Sc. Life Sciences (2007-2009): Jawaharlal Nehru University (JNU), New Delhi, India
B.Sc. Biotechnology (2004-2007): Jamia Millia Islamia (JMI), New Delhi, India
Awards and Fellowships
Junior and Senior Research Fellowship (2009-2014): Council of Scientific and Industrial Research (CSIR), New Delhi, India
GATE (Graduate Aptitude Test in Engineering): Qualified in years 2008 and 2009
Scientific Contributions/ Recognitions
Associate editor: Journal of Translational Medicine.
Editorial Board Member of Journal: Theoretical Biology and Medical Modelling.
Reviewer: PloS One, BMC Genomics, BMC Bioinformatics, BMC Biology, BMC Biotechnology, Frontiers in Physiology and several other journals.
Web Resources/ Databases (Developed/ Contributed)
A Platform for Designing Genome-Based Personalized Immunotherapy or Vaccine against Cancer (http://www.imtech.res.in/raghava/cancertope/)
GenomeABC: A webserver for benchmarking of genome assemblers. (http://crdd.osdd.net/raghava/genomeabc/).
Genomics web portal page. (http://crdd.osdd.net/raghava/genomesrs/).
Map/Alignment module of CancerDr: Cancer Drug Resistance Database. (http://crdd.osdd.net/raghava/cancerdr/).
Short reads and contigs alignment module of PCMDB: Pancreatic cancer methylation database. (http://crdd.osdd.net/raghava/pcmdb/).
Burkholderia sp. SJ98 database. (http://crdd.osdd.net/raghava/genomesrs/burkholderia/).
Rhodococcus imtechensis RKJ300 database. (http://crdd.osdd.net/raghava/genomesrs/rkj300/).
Genotrick: A pipeline for whole genome assembly and annotation of Genomes (http://crdd.osdd.net/raghava/genomesrs/genotrick/)
Development of Debian packages in OSDDlinux: A Customized Operating System for Drug Discovery. (http://osddlinux.osdd.net/).
A Web-Based Platform for Designing Vaccines against Existing and Emerging Strains of Mycobacterium tuberculosis. (http://crdd.osdd.net/raghava/mtbveb/).
Nan Xiao Genomic Data Scientist
Seven Bridges Genomics (United States)
My research focuses on developing scalable statistical machine learning methods with software to detect key signals and reveal meaningful patterns from high-dimensional data. My Erdős number: 4.
I am an active contributor to the R and Bioconductor community, with 20+ open source R packages and Shiny applications for machine learning, data visualization, and reproducible research.