Classification software tools | Gene expression microarray data analysis
Gene expression profiling based on microarray technology has been applied widely on monitoring global transcriptome changes in biological samples. In cancer research, one of the major microarray applications is to identify genes, or features, whose expression patterns can discriminate samples with distinct states (usually defined by the phenotype of samples such as primary or metastatic tumour).
Serves for the functional analysis of gene expression and genomic data. Babelomics offers the possibility to explore the effects of alteration in gene expression levels or changes in genes sequences within a functional context. It provides user-friendly access to a full range of methods that cover: (1) primary data analysis; (2) a variety of tests for different experimental designs; and (3) different enrichment and network analysis algorithms for the interpretation of the results of such tests in the proper functional context.
Outperforms decision tree in both training and validation. Decision forest is an ensemble method developed by combining the predictions from multiple independent decision tree models to reach a better prediction. This method yields much high prediction accuracy in the high confidence regions compared to decision tree. Decision forest generally gives higher positive predictivity than other method, and even higher positive predictivity within definable high confidence regions.
Provides several unique features in a modular and flexible system for the analysis of microarray data. The design and modular conception of CARMAweb allows the use of the different analysis modules either individually or combined into an analytical pipeline. CARMAweb performs (i) data preprocessing (background correction, quality control and normalization), (ii) detection of differentially expressed genes, (iii) cluster analysis, (iv) dimension reduction and (v) visualization, classification, and Gene Ontology-term analysis.
Provides a theoretical analysis of the minimal-redundancy-maximal-relevance condition. mRMR is a framework allowing users to minimize redundancy, and it uses a series of intuitive measures of relevance and redundancy to select promising features for both continuous and discrete data sets. The incremental selection scheme of this method avoids the difficult multivariate density estimation in maximizing dependency. It can also be combined with other feature selectors.
Finds alternative and equivalent solutions for a classification problem building multiple rule-based classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. The software includes an ad-hoc knowledge repository and a querying tool.
Identifies genes with expression values that exhibit a specific pattern of correlations with the endpoint variables defined by prior biological knowledge. PROMISE is a statistical procedure that designed to determine whether a variable exhibits a specific pattern of correlations with a set of other variables. The software performs an integrated analysis of microarray gene expression data with multiple endpoint variables. It offers an alternative approach for evaluating genomically-based pleiotropic phenotypes.
Allows multiple-microarray investigation and integrative information discovery. geneCBR aims to assist researchers in the diagnosis of cancer. It represents a valuable solution to solve the problem of the systematic classification of tumor types. This tool can be used in the context of interdisciplinary working groups to classify and cluster cancers. Its goal is to ease the interpretation of prediction in concert with incorporated knowledge.
Permits users to develop statistical models from large-scale datasets. GALGO is built on a genetic algorithm (GA) variable selection strategy and it uses this procedure to select models with a fitness value. This tool furnishes functions for the analysis of the populations of selected models and features to reconstruct and determine representative summary models. It suits for developing multivariate statistical models using multivariate variable selection.
A package which can be used to quickly identify and assess TSP (Top Scoring Pairs) classifiers for gene expression data. Tspair can rapidly calculate the TSP for typical gene expression datasets, with tens of thousands of genes. The TSP can be calculated both in R or with an external C function, which allows both for rapid calculation and flexible development of the tspair package. It includes functions for calculating the statistical significance of a TSP by permutation test, and is fully compatible with Bioconductor expression sets.
Provides functions to perform fast variable selection based on the Wilcoxon rank sum test in the cross-validation or Monte-Carlo cross-validation settings, for use in microarray-based binary classification. WilcoxCV is based on a simple mathematical formula using only the ranks calculated from the original data set. It reduces computation time dramatically (up to a factor 50) compared to the standard approach.
An R package for kTSP (k–Top Scoring Pairs). SwitchBox selects the gene pairs for the kTSP decision rule. The package has a method for calculating sample-specific scores based on the pairs, which can be extended beyond classification to class discovery problems. For computational efficiency and speed, switchBox calculates the score between all feature pairs using C routines.
Provides a classification model that integrates external relational information. GEDFN is a deep feedforward network classifier embedding feature graph information. It achieves sparse connected neural networks by constraining connections between the input layer and the first hidden layer according to the feature graph. This model can be used both in classification and in selection of biologically relevant features.
Integrates different multiscale algorithms for binarization and for trinarization of one-dimensional data with methods for quality assessment and visualization of the results. By identifying measurements that show large variations over different time points or conditions, this quality assessment can determine candidates that are related to the specific experimental setting.
Performs normalization, features selection and builds classification. DaMiRseq uses a thoughtful decision-making process for assisting the user in selecting the best putative predictors for classification. The software permits users to identify transcriptional biomarkers. It provides functions to filter genomic features and samples for cleaning up data, and to identify and remove the unwanted source of variation for adjusting data.
Allows individualized pathway-based classification. PROPS creates individualized features that reflect pathway activity using Gaussian Bayesian networks. It can calculate the log likelihood of each patient’s data, which can be interpreted as a measure of pathway perturbation and dysregulation. This tool takes into account pathway topology, rather than treating pathways as gene sets. It contributes biological insight into differentiating Crohn’s disease (CD) and ulcerative colitis (UC).
Allows to determine significant separations in reduced dimensionality data. ClusterSignificance contains three stages: (1) principal curve projection, (2) separation classification, and (3) score permutation. It attempts to make faithful representations in low dimensional space in order to have a minimal effect on false positives for the significance testing procedure.
Provides a computational method for discretization of gene expression data. RefBool uses a user-defined gene expression library as a reference for defining each genes state. Each query can be assigned to three states: active, inactive state and an intermediately expressed state, based on user-defined significance thresholds. Then, measurements are associated to a p- and q-value indicating the significance of each classification.
Enables users to calculate multiple performance measures and explore the stability of feature selection. ClassifyR provides an implementation of a general framework for classification of results. It also includes a comprehensive set of performance measures to ease post-processing. This method has four stages: data transformation, feature selection, classifier training, and prediction. It can be easily extensible.
Assists users in assembling and annotating the whole transcriptome of non-model organisms from RNASeq data. CAARS uses sequences from one or several species that can be related to guide transcript assembly and annotation. It implements a trans-species assembly, based on one or several guide sequences, which can be distantly related. Furthermore, it can annotate the transcripts by integrating them in phylogenies built with a set of helper sequences.
Determines the most important and most informative feature values in an Random Forest (RF) model. COMPACT+FV is an algorithm that iterates over every) Random Trees (RT) in the RF, and for every feature, extracts every IF-THEN rule (if any) containing the positive value of that feature and uses that rule’s statistics to measure the feature’s importance.