GOBLET / Global Organisation for Bioinformatics Learning, Education & Training
Provides a global, sustainable support and networking structure for bioinformatics educators/trainers and students/trainees. GOBLET is organized by tags, simplifying the users’ navigation and allowing them to find trainers with appropriate expertise. It contains more than 80 training materials and courses, that can be attributed to users.
Consists of an open source community-driven platform for dissemination of life science events, such as courses, conferences and workshops. iAnn can be used for entering and annotating announcement information. It offers a collection of services and widgets that can be embedded in external websites, and display pre-programmed filtered lists of announcements.
PBLR / Population based Bounded Low-Rank
Enables single-cell RNA sequencing (scRNA-seq) data imputation. PBLR is a cell sub-population based bounded low-rank method that can (1) recover transcriptomic level and dynamics masked by dropouts, (2) improve low-dimensional representation, and (3) restore the gene-gene co-expression relationship. The software also automatically detects cell subpopulations. It has few parameters, making it generally applicable to data from diverse labs or techniques.
gkm-DNN / gapped k-mer Deep Neural Network
Assists users in training neural networks. gkm-DNN is a software designed for classification. It consists of calculating gapped k-mer frequency vector (gkm-fv) and training the neural networks. This method can also be easily extended to other problems such as regression and ranking. It provides five classes to implement the gapped k-mer deep neural network: MLPBuilder, PresaveData, Train, PredictDataset and PredictCSV.
Serves for the characterization of long noncoding RNA molecules (lncRNA) from raw transcriptome sequencing data. LncPipe is a Nextflow-based computational pipeline for systematical identification and analyses of lncRNAs from RNA-seq data. This software contains four main analysis procedures including read alignment, lncRNA identification, sequence assembly and differential expression analysis. It provides various features such as pipeline canceling, parameters resetting and analysis resuming from any continuous checkpoint.
SIMD / Statistical Inferences with MeDIP-seq Data
Estimates the methylation level for a single CpG site. SIMD is an R package providing an inferential analysis that identifies differentially expressed CpG sites in methylated DNA immunoprecipitation followed by sequencing (MeDIP-seq) data, using statistical framework and expectation–maximization (EM) algorithm. The software considers the possible structure whereby immunoprecipitated short reads are mapped to the methylated CpG sites.
HMpre / Human mRNA N6-Methylation predictor
Assists users with class imbalance in the training data of human mRNA prediction. HMpre is a computational method in which m6A samples and non-m6A samples are labeled as positive and negative respectively. The classifier is then trained with all the samples without selecting a subset of negative samples and prevents over-fitting by defining different costs for the misclassified positive and negative samples.
Identifies N6-methyladenosine (m6A) sites. RAM-NPPS is a sequence-based predictor that detects potential m6A sites within uncharacterized RNA sequences. The software uses the single RNA sequence local information of multi-interval nucleotide pairs, as well as the global position information, specially the global information of discrepancy between positive samples and negative samples. It provides m6A site identification specific for three species: Saccharomyces cerevisiae, Homo sapiens, and Arabidopsis thaliana.
Predicts N6-methyladenine (m6A) sites in A. thaliana. RFAthM6A exploits the machine learning method, random forest, and four different types of features to establish four reliable prediction models, which are named RFPSNSP, RFPSDSP, RFKSNPF and RFKNF. The linear combination of the prediction scores of RFPSDSP, RFKSNPF and RFKNF can improve the prediction performance.The software was applied to the A. thaliana gene AT1G36370.1.
CHOISS / Choosing Optimal-Interval SNP Set
Permits users to find a set of single nucleotide polymorphism (SNP) markers according to the interval regularity. CHOISS gathers algorithms for selecting markers at a chosen density, given the desired number of SNPs or interval. It offers features such as graphical visualization of the results and an automatic construction of input file from a text form of NCBI GenBank contig data.
MLR-tagging / Multiple Linear Regression tagging
Enables informative single nucleotide polymorphism (SNP) tag selection and genotype prediction. MLR-tagging is a program assisting users to solve the informative SNP selection problem (ISSP) on genotypes. It implements a SNP prediction method based on multiple linear regression analysis. This tool is designed to directly predict genotypes without the explicit requirement of haplotypes.
Provides tag single-nucleotide polymorphism (SNPs) selection application. WCLUSTAG allows users to prioritize different SNPs and genomic regions in a systematic association screen, depending on current genomic and disease data budget. This method takes account of functional as well as linkage disequilibrium (LD) information. An online web interface also permits users to import their own genotype data, or to directly withdraw HapMap data from the mirror database, for the calculations.
R-SVM / Recursive Support Vector Machine
Selects important genes/biomarkers for the classification of noisy data. R-SVM consists of a method that aims to analyze high-throughput proteomics and microarray data and recover informative features. It uses the class means to represent the samples for feature selection. It also includes features to avoid taking the outlier samples as support vectors (SVs).
GA/KNN / Genetic Algorithm and k-Nearest Neighbor
Allows to conduct multi-dimensional classification. GA/KNN chooses a small fraction of genes that jointly discriminate between different classes of samples and evaluates the relative predictive importance of all genes for sample classification. It is based on a stochastic search strategy and a multivariate approach for which sample heterogeneity is accommodated. This tool simplifies the subclass discovery.
Serves for class prediction using microarray data. Prophet is a web application that aims to fulfill the demand of a tool for prediction purposes in the microarray context. This program contains two main functions: train and predict. It builds a prediction rule based on genes. There are several options for defining the dataset of genes to be used for training the predictors. It accepts user-defined selections of genes and can find the optimal subset within the whole set of genes.
Permits users to develop statistical models from large-scale datasets. GALGO is built on a genetic algorithm (GA) variable selection strategy and it uses this procedure to select models with a fitness value. This tool furnishes functions for the analysis of the populations of selected models and features to reconstruct and determine representative summary models. It suits for developing multivariate statistical models using multivariate variable selection.
HSRA / Hadoop Spliced Read Aligner
Enables scalable mapping of very large RNA-seq datasets on distributed memory systems. HSRA is a spliced read aligner that assists researchers to perform their RNA-seq analyses using Big Data platforms and frameworks. It allows biologists to distribute their mapping tasks over the nodes of a computing cluster or cloud platform by combining a multithreaded spliced aligner. It supports single- and paired-end read alignments and is capable of processing input datasets compressed.
parSRA / Parallel execution of Short Read Aligners
Simplifies the execution of existing short read aligners (SRA) on distributed-memory systems. parSRA is a program that can parallelize SRA on multicore clusters which can work with different underlying methods. Moreover, it uses the following techniques to improve scalability: 1) a splitting of the input reads using the FUSE kernel module available in most of current Linux distributions; 2) a balanced on-demand distribution of the reads based on the shared locks of UPC++ for parallel computing that follows the Partitioned Global Address Space (PGAS) paradigm.
Detects piwi-interacting RNAs (piRNAs) based on circonvolution neural network (CNN). piRNN is a deep learning program suited for detection within four organisms including drosophila melanogaster and rattus norvegicus. The application was developed to perform its analysis without requiring genome or epigenomic additional data and also provides the training protocol to train both new models for new species as well as to apply a novel training to the existing ones.
Aims to annotate circular RNAs (circRNAs) and determine the specificity of circRNA primers. CircPrimer is a program that offers two main functions: search of circRNAs and annotation of circRNAs. It generates a template for designing divergent primers or primers with one primer spanning the spliced junction. Moreover, users can extract the spliced sequences and genomic sequences of any circRNA, including novel circRNAs.
DECIDE / Decision-aid & E-Counselling for Inherited Disorder Evaluation
Allows creation of clinical decision aids. DECIDE facilitates supportive functions of a genetic counsellor through patient-centred education and enhanced decision-making. It enables users to work individually, as a couple or family, or with a relative or friend, allowing patients to make decisions in ways that they find most comfortable and supportive. This tool provides an opportunity to make a decision about pharmacogenomics incidental findings or carrier status for recessive diseases.
PanACEA / Pan-genome Atlas with Chromosome Explorer and Analyzer
Permits visualization of pan-chromosome data. PanACEA leverages bacterial genomic data for the analysis of pan-genomes in the context of a consensus pan-chromosome. The software consists of multi-tiered views with circular representations of chromosome(s)/plasmid(s) containing selectable and user-configurable colored functional gene annotations/ontologies. It contains also zoomed-in linear illustrations of per genome flexible genomic islands (fGI) content in the flexible genomic regions (fGRs) located throughout the pan-chromosomes.
BAMSI / BAM Search Infrastructure
Allows users to manage large genomic data sets. BAMSI is an application programming interface (API) that exploits cloud services to perform its analysis. This program enables the formulation of customized queries and includes a monitoring service to screen the system status as well as the progress of tasks. It can be used to filter information from the 1000 Genomes platform, with a focus on the extraction of small subsets of data.
Suits for automated pipeline optimization. ToTem enables the integration of a variety of tools and pipelines and their automatic optimization based on benchmarking results controlled through efficient analysis management. Users can start an analysis from any level of the process with the possibility of adding almost any tool or code. This tool assures the reproducibility of the pipeline parameters via cross validation techniques that penalize the final precision, recall and F-measure.
A robust rank-based classifier for gene expression classification. AUCTSP works according to the same principle as top scoring pair (TSP) but differs from the latter in that the probabilities that determine the top scoring pair are computed based on the relative rankings of the two marker genes across all subjects as opposed to for each individual subject.
Extracts subsequences from GenBank annotations. AnnotationBustR (i) reads GenBank annotations in R, (ii) pulls out the gene(s) of interest given a set of search terms and a vector of taxon accession numbers supplied by the user, and (iii) writes the sequence for the gene(s) of interest to FASTA formatted files for each locus. It enables users to extract subsequences from concatenated sequences, plastid, and mitochondrial genomes where gene names for subsequences may vary substantially.
Permits high-throughput classification of European, African, and Native American mitochondrial haplogroup lineages. Hi-MC uses a defined panel of mitochondrial single nucleotide polymorphisms (SNPs) for classification of mitochondrial haplogroups eliminating the need for time-consuming SNP selection. Moreover, it offers an algorithm that leverages a validated human mitochondrial DNA (mtDNA) SNP panel for mitochondrial haplogroup classification and is particularly valuable for studies in which sequencing is not feasible.
HmmUFOtu / Hidden Markov Model [HMM]-based Ultra-Fast OTU tool
Performs assignment for bacterial 16S and amplicon sequencing. HmmUFOtu is a pipeline that aims to assist users in determining microbial community composition and diversity. It classifies every read from submitted sequences within a known reference tree to then performs a phylogeny-based operational taxonomic units (OUT) clustering and produces an assignment for each read. This application is able to support a wide range of DNA substitutions models including GTR or HKY85.
dNBFA / dependent Negative Binomial Factor Analysis
Analyzes RNA-seq count data. dNBFA is built on a Bayesian covariate-dependent negative binomial factor analysis. This software doesn’t need any ad-hoc data normalization, data preprocessing nor co-expression network construction steps. It provides closed-form Gibbs sampling update equations by taking advantage of data augmentation techniques. It can be used to derive meaningful functional modules and is appropriate for two-stage co-expression network-based methods.
Permits users to analyze large genomic sequences. QUAST-LG is a module of QUAST that can evaluate mammalian-size assemblies. It includes quality metrics that take into account specific features of the eukaryotic genomes, such as the abundance of transposons. This tool enables assessment for novel species by incorporating reference-free tools as a part of its pipeline.
Evaluates high-throughput sequencing data and reconstructs viral quasispecies. TenSQR leans on a tensor factorization framework that clusters with successive data removal to infer strains in a quasispecies from the most to the least abundant one. This process guarantees reliable discovery and accurate reconstruction of rare strains existing in highly imbalanced populations and enhances the detection of deletions in such strains.
tusv / Tree Unmixing with Structural Variants
Performs automated joint deconvolution and phylogeny inference of tumor genomic data. tusv simultaneously deconvolves inferences of structural variations (SVs), derived from the Weaver variant caller, and reconstructs the likely evolution of clonal populations via these SV events. The software is able to reconstruct clonal populations and phylogenetic histories from simulated tumor data, and is also effective on real data supportive of a range of tree topologies and complexities.
Serves for mutation-based clustering of tumors. ccpwModel's binary variables indicating the status of cancer driver genes in tumor samples and the involvement of those genes in a dozen core cancer pathways (CCPWs) are considered as features in the Ward’s hierarchical clustering. It assists users to detect a lethal LIHC sub-type, in which tumor cells are characterized by the mutation-disturbed RAS/PI3K pathways.
Clusters tumors based on the somatic mutation spectra of the putative cancer driver genes. xGeneModel is a method in which the functional similarities of the putative cancer driver genes and their confidence scores are integrated with the mutation events to calculate the genetic distance between tumors. The clustering process is transparent since the distance of two tumors is calculated from the genotypes of a few pairs of cancer genes.
Exploits independent genome wide associations studies (GWAS) in multiple traits to identify associated variants. CONFIT is an application which uses summary statistics to assess the degree of shared effects between traits. This program can be performed with a gathering of GWAS on different traits and can be used to highlight unique loci in a given dataset. It was tested for analysis in both the North Finland Birth Cohort (NFBC) and the UK Biobank (UKKB) datasets.
TSNet / Tumor Specific Net
Permits the construction of tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet allows users to build tumor-cell specific networks by adequately accounting for tumor purity heterogeneity (TPH) in network inference. This tool is based on gaussian graphical models (GGMs) which have been extensively utilized for network inference in many studies.
Provides a pipeline for executing chromatin profiling assays. ATAC2GRN gathers optimized ATAC-seq and DNase1-seq pipelines to assess accurate genome regulatory network (GRN) inference. This software assists for maximizing ChIP recovery for transcription factor occupancy assessment. The project is composed of three main parts: one to generate figures, one part in both bash and Snakemake for the pipelines and the last one to estimate pipeline recapitulation of ChIP-seq.
PRICE / Probabilistic inference of codon activities by an EM algorithm
Allows users to detect open reading frames (ORFs) using ribosome profiling experiments embedded in a pipeline for data analysis. PRICE includes a statistical model of the ribo-seq experiments and interprets actively translated codons using maximum likelihood for all reads. This pipeline implements all steps necessary to identify and score codons and ORFs from ribo-seq experiments. It includes also modules to pre-process and map sequencing reads.
Predicts representative structure of a set of sub-optimal structures from homologous RNA sequences. aliFreeFold is an alignment-free method that estimates a representative structure from a set of homologous RNA sequences by using suboptimal secondary structures generated for each sequence. This tool utilizes n-motif representation to represent each sub-optimal structure of an RNA sequence. It computes also a vector representation of structures assisting to extract representative structures containing conserved structural features.
RNAspa / RNA Shortest Path Approach
Aims to discover secondary structure of a set of ncRNA molecules. RNAspa is a comparative structure program for a set of RNA sequences. This software is appropriate to withstand the effects of contaminated data. RNAspa has been validated on three different datasets that contain various ncRNA families taken for the Rfam database.
RNAcast / RNA consensus abstract shapes technique
Calculates near-optimal abstract shape space and predicts an abstract shape common to all sequences to obtain a consensus. RNAcast aims to offer for each sequence the thermodynamically best structure that has this common shape. This software’s approach is essentially linear in the number of sequences since the shape space is smaller than the structure space and identification of common shapes can be performed in linear time.
Guides global alignment via a non-progressive local approach. Align-m doesn’t generate a single multiple alignment but produces a set of largely consistent pairwise alignments that can be eventually be converted into a multiple alignment by aligning them to a reference. It processes by aligning sequences thought three steps: (1) computation of a set of high-scoring local multiple alignments; (2) combination of data with substitution matrix scores and (3) using of alignments consistent with each other in the final alignment.
Applies one-dimensional convolutional neural network (CNNs) to classify pairwise alignments of ncRNA sequences. CNNclust is based on the combination of two types of distributed representation with secondary-structure information specific to ncRNAs and with mapping profiles of next-generation sequence reads. This software can be used to the clustering of novel and unannotated regions by employing other tools to determine expressed block regions in reference genome.
Serves as a library to train distributed representations of variable-length k-mers. dna2vec’s embeddings can describe variable-length k-mers in a consistent fashion via nucleotide concatenation experiments. The k-mers found are consistent across different lengths. This software can display nearest-neighbor alignment similarity and the arithmetic of this tool’s vectors is comparable to nucleotides concatenation.
IDEA / Interactive Differential Expression Analyzer
Serves for the identification of differential expression genes by providing an interactive interface. IDEA assists for visualization of results via different charts and tables and users can interact during the analysis process. This tool includes as interactive features: dynamic data operation, parameter adjustment and real-time plot rendering. It is available as a web application and as a desktop version for a heavy use.
MOMDR / MultiObjective Multifactor Dimensionality Reduction
Offers a method to detect potential single nucleotide polymorphism (SNP)-SNP interactions (SSIs). MOMDR is an approach based on an extension of the multifactor dimensionality reduction (MDR) with no need of a specific mode of inheritance. This program is suited for managing unbalanced data sets and can be applied to depict locus genotype combinations that are linked with high and low-risk disease groups.
Identifies non-tandem repeats and enables analysis of mitochondrial sequences form plan species. ROUSFinder uses BLAST to detect non-tandem repeats within mitochondrial genomes. The software automates the task of identifying repeats in both direct and inverted orientations and removes the full-length match. It provides an output that can be used for annotation in spreadsheets for further analysis.
NBS2 / Network-Based Supervised Stratification
Enables cancer subtype classification. NBS2 is an application which is based on the supervised random walk algorithm with an additional loss function that determines the direction of network propagation. The software was developed to be able to detect how adjusting the weight of each molecular interaction and can assist users in identifying underlying biological pathways. It was tested on both glioblastoma and breast tumors.
Provides an interactive web application for high-throughput biological data modeling and visualization. HTPmod includes modules for plant growth modeling based on time-series data (growMod), prediction of important traits based on high-dimensional data (predMod) and visualization of high-throughput data (htpdVis). This platform suits for modeling and visualizing large-scale, high-dimensional data sets under a broad context.
Offers a method to elaborate benchmark experiments. SummarizedBenchmark is an R package to construct benchmark comparisons with common metrics or customized functions that also aims to simplify reproduction and replication of various benchmark studies. It was tested on benchmarks dealing with variates experiments including differential expression (DE) testing or single-cell RNA-seq simulation.
NG-TAS / Next Generation-Targeted Amplicon Sequencing
Assists for profiling circulating tumor DNA (ctDNA). NG-TAS features several functions: (1) optimization for low input ctDNA; (2) high level of multiplexing for analyzes of multiple gene targets; (3) a computational pipeline for data analysis and (4) a competitive costing. This tool can be adapted to different cancer types and clinical contexts and is conceived to be flexible for the choice of gene targets and regions of interest.
MIPUP / MInimum Perfect Unmixed Phylogenies
Investigates tumor evolution. MIPUP is a multi-sample method intending to reconstruct ancestral relation between clones and samples. This program proposes an integer linear programming (ILP) formulation that uses a relation between perfect phylogenies and branchings in a directed acyclic graph. It also includes features allowing users to report the totality of the optimal solutions or a defined number of these.
Serves as a calculation and visualization tool for high-throughput CRISPR genome-editing data analysis. CRISPRMatch integrates analysis steps like mapping reads, measuring mutation frequency (deletion and insertion), evaluating accuracy and efficiency of genome-editing systems and outputting visualization of tables and figures. This software suits for genome-editing data of CRISPR nuclease transformed protoplasts that could assess the targeted mutation efficiency of DNA endonucleases and regions of guide RNAs.
Encrypts, in a secure way, genomic data in several formats, and also compacts data in Fasta/Fastq formats. Cryfa is a secure encryption tool that follows industry recommendations for upholding security of in-transit and at-rest genomic data. The software performs a block transformation, followed by shuffling the transformed information and ultimately, a fast authenticated encryption on the shuffled content. The encryption is performed with the advanced encryption standard (AES).
Generates clustering trees and provides a visualization feature for interrogating clustering as resolution increases. clustree allows users to build a layout with only a subset of edges and offers options for controlling the aesthetics of nodes based on the attributes of samples in the clusters represented. This software aims to demonstrate the relationships between clusters at multiple resolutions.
Enables multi-species orthology inference. SonicParanoid is a program that detects orthologous relationships among multiple species. The software treats groups as elements of numeric sets. It permits addition and deletion of proteomes by reusing the results from previous runs, which is beneficial to users who need to maintain their own orthology databases. This tool can contribute to annotate new genomes and find target genes in medical and biotechnological applications.
DARRC / Dynamic Alignment-Free and Reference-Free Read Compression
Serves as a dynamic alignment-free and reference-free read compression tool. DARRC suits to incrementally update compressed archives with new genome sequences without the need to execute a full decompression of the archives. This tool focuses on the issue of pan-genome compression by encoding the sequences of a pan-genome as a guided de Brujin graph. It can compress both single-end and paired-end read sequences of any length.
SOBDetector / Strand Orientation Bias Detector
Offers a method for formalin-fixed paraffin-embedded (FFPE)-introduced artifacts removal from a set of targeted variants. SOBDetector is an application based on the combined consideration of an original binary alignment file and a set of mutations to highlight strand orientation bias. This program can be applied to paired-end single-sample variation calling pipelines for multiple species.
ERNIE / Enhanced Research Network Informatics Environment
Consists of a data platform with associated analytical workflow. ERNIE is a cloud-based platform that leverages modern data science and open source software to aggregate data at scale for discovery. It is composed of a database and associated methods for generating and analyzing networks constructed from its data. The platform contains FDA records, clinical trials and clinical guidelines data, bibliographic and patent data, and funding records from the National Institutes of Health.
deGSM / de Bruijn Graph Constructor in Scalable Memory
Serves for constructing very large de Bruijn graphs. deGSM is a program able to build Burrows-Wheeler transformation (BWT) of the unitigs of a graph and recover these unitigs with the constructed BWT string. This program can be used to manage very large genome sequence, such as the contigs and scaffolds recorded in GenBank database as well as Picea abies high-throughput sequencing (HTS) dataset.
CharGer / Characterization of Germline variants
Automates the determination of the pathogenicity of germline variants. CharGer is based on the original American college of medical genetics and genomics (ACMG) scoring system. It implements 16 custom modules and a user-adjustable scoring system. This tool attributes custom scores for modules and classification thresholds. It enables researchers to prioritize variants for investigation.
Allows users to write pipelines in the Go programming language. SciPipe is a workflow library based on flow-based programming principles, implemented as a library in the Go programming language. The software leverages the dataflow paradigm to achieve dynamic scheduling of tasks based on input data. It provides features such as provenance tracking, ability to run push-based workflows up to a specific stage of the workflow, or flexible support for file naming of intermediate data files generated by workflows.
Deduces the "lab-of-origin" of DNA sequence datasets. predict-lab-origin is an application intending to identify signatures which characterize the processing of engineering DNA and extract them to attribute a specific dataset to a given lab. This method leans on the training of a deep Circonvolution Neural Network (CNN) to perform its classification and does not need a hand-selection of features or sequence-function information.
Estimates pairwise distances between input genomes. BinHash is a program that contains the latest improvements for MinHash in data-mining. It reduces each genome into a highly compressed form by using b-bit one-permutation rolling MinHash with optimal densification. Moreover, it can report all statistics reported by Mash and includes the option to report, for each query, only its k closest targets in terms of mutation distance.
Describes how quantitative trait loci (QTLs) determine interspecies competition and cooperation in a community. CoCoM analyzes markers from different species throughout their genomes and returns combinations of QTLs. It can depict complete picture of genetic architecture by highlighting neglected indirect and genome-genome epistatic genetic effects of QTLs. This tool employs a dynamic model that capitalizes on time series phenotypic data to retrieve interactions.
MOVIE / Multi-Omics VIsualization of Estimated contributions
Provides a framework to compare the performance of unsupervised multi-omics methods through the construction of the contribution plot. MOVIE evaluates method performance by examining the contribution of each sample in each data type towards the common variation space and utilizing a k-fold cross validation to assess stability and potential overfitting.
iVar / intrahost Variant analysis from replicates
Offers a method for processing virus sequencing data and calling intrahost single nucleotide variants (iSNVs) from technical replicates. Ivar is an application which is able to measure intrahost virus diversity from cells, mosquitoes, primates, birds, and human. This program can be applied for highlighting ecological, epidemiological, and immunological drivers of virus evolution into a wide range of systems.
Estimates gene expression levels ab initio from sequences. ExPecto is based on a deep learning method with spatial feature transformation and L2-regularized linear models. It can be applied to a wide regulatory region of 40-kb promoter-proximal sequences. This tool builds a repository of potential regulatory sequence representations capable of determining the epigenomic effects of any genomic variant from sequence.
Performs tree reconstruction based on the supermatrix paradigm. OneTwoTree is a web-server that, for a given list of taxa names of interest, (i) retrieves all available sequence data from NCBI GenBank, (ii) clusters these into orthology groups, (iii) identifies the most informative set of markers, (iv) searches for an appropriate outgroup, and (v) assembles a partitioned sequence matrix that is then used for the final phylogeny reconstruction step.
LRScaf / Long Reads Scaffolder
Enables users to improve draft genomes using third generation sequencing (TGS) data. LRScaf introduces a strategy for keeping valid alignments and produces fewer misassemblies. Moreover, it was designed to separate the mapping and scaffolding procedures. For instance, this tool can be used for improving the contiguity of human draft assemblies.
Generates realistic Illumina reads via a sequencing simulator. InSilicoSeq leans on kernel density estimators to design read quality of real sequencing data. This software suits for simulating metagenomic samples and creating sequencing data from a single genome. It can model GC-bias, insert size distribution and PHRED scores and it features substitution, insertion and deletion errors.
Maps loci within natural populations and determines their contribution to the additive genetic variance. SNP-skimming enables an initial search for loci generating within-population variation. It can identify the effects of alleles that are not at intermediate frequency across sampled individuals. This tool conducts deep sequencing analysis for confirming the targeted loci.
Primal Scheme
Allows users to develop multiplex primer schemes. Primal Scheme is a web application which is able to design amplicon­-based sequencing of RNA virus genomes directly from clinical samples. This platform first produces a set of candidate primer that are then ranked and scored to lastly reports to the user the primer sequences that obtains the lesser score. It includes options allowing the user to set the desired amplicon length as well as needed overlap.
Consists of an alignment-free functional binning and abundance estimation pipeline. Carnelian is a program that allows users to represent translated metagenomic reads (amino-acid sequences) into low-dimensional manifolds to construct a compact feature space. This tool has three main functions: it (i) translates reads (amino acid sequences) from whole metagenome sequencing studies; (ii) leverages the low-density even-coverage Opal-Gallager hashes to encode translated reads into lowdimensional manifolds; and (iii) allows the production of functional vectors containing effective read counts.
MAVIS / Merging Annotation Validation and Illustration of Structural Variants
Offers a pipeline tool for post-processing structural variant calls. MAVIS predicts putative fusion product for structural variant calls with single base-pair resolution. This pipeline proceeds through six steps: converting, clustering, validating, annotating and illustrating, pairing and ultimately summarizing. It is appropriate to resolve the issue of multiple gaps in putative structural variant characterization that cannot be resolved with standalone callers or existing merging tools.
Allows users to determine the best-fit model and to infer maximum likelihood (ML) phylogenies on thousands of independent multiple sequence alignments (MSAs). ParGenes is a program that includes a simple command-line interface allowing to: select the best-fit model, infer ML trees, and compute bootstrap support values on thousands of gene MSAs in a single MPI run.
Allows users to deduce whole genome duplication (WGD) from next generation sequencing (NGS) data. GenoDup is a pipeline leaning on the automatic computation of rates of synonymous substitutions per synonymous site (dS) values based on paralogous gene pairs or/and anchor gene pairs. This application was tested on two empirical Arabidopsis thaliana and Oncorhynchus mykiss datasets.
Allows users to extrapolate species richness and other relevant biodiversity patterns at the whole forest scale from local information on species. This tool is designed to infer macro-ecological patterns from local species’ occurrences. It also permits researchers to obtain information on species abundances to construct the relative species abundance (RSA) at both local and global scales.
Emphasizes global gene level coverage and inter-individual variation in breadth of coverage for genes. WEScover permits users to retrieve breadth and depth of coverage across population scale whole exome sequencing (WES) datasets. It conducts a test statistic and p-value for a one-way analysis of variance to check differences between means of populations. This tool provides a list of genetic tests with indications allowing users to find optimal genetic tests per phenotype and/or genes of interest.
gpps / General Parsimony Phylogeny from Single cell
Allows users to infer tumor progression which includes mutation losses from single cell sequencing data. gpps exploits an integer linear programming (ILP) approach that employs a maximum likelihood search to retrieve the best tree that explains the input, starting from single cell data. This software produces the ILP formulation which depends to an ILP solver to obtain the optimal solution.
MVSE / Mosquito-borne Viral Suitability Estimator
Allows estimation of a climate-driven mosquito-borne viral suitability index. MVSE is a program able to determine a suitability index based on a climate-driven mathematical expression for the basic reproductive number of a well established mathematical model for the transmission dynamics of mosquito-borne viruses. For instance, this tool can be used for studying a mosquito-borne viral suitability index.
Predicts cancer-specific synthetic lethal (SL) interactions of any given susceptibility gene using a machine learning approach. DiscoverSL provides an integrative computational pipeline for prediction and in silico validation of SL interactions derived from patient-specific mutations in cancer. The software also includes additional plot modules for intuitive visualization. It can be useful for discovering clinically relevant and targetable synthetic lethal interactions in cancer.
Hapi / Haplotyping with imperfect genotype data
Allows inference of chromosome-length haplotypes using single gamete cells. Hapi enables users to perform phasing of an entire chromosome through three steps: (1) data preprocessing, (2) inference of draft haplotypes, and (3) assembly of high-resolution chromosomal haplotypes. The software also includes a crossover analysis module permitting downstream analyses and visualization of crossover positions identified in the observed gametes. It is able to analyze heterozygous single nucleotide polymorphisms (hetSNPs) data of single gamete cells generated using any genotyping platform.
Identifies microbial pathogens from metagenomic next-generation sequencing (mNGS) data. IDseq is a cloud-based platform which purpose is to avoid users the need of a significant on-premise computational infrastructure to perform their analysis. This program was tested to investigate unbiased mNGS for studying the etiology of fever in Ugandan children patients in a rural district hospital.
BARM tools
Provides a collection of tools designed to work on the BARM database. BARM tools can measure the quantity of genes and their annotations within BARM for any kind of Baltic Sea metagenomic and meta-transcriptomic samples. This collection of programs allows researchers to reutilize the content of the repository in order to run their own investigations.
Provides a tool for ortholog identification. JustOrthologs focuses on orthologs that exploit the conservation of gene structure by applying lengths of coding sequence (CDS) regions and dinucleotide percentages. This software permits users to decrease ortholog classification runtimes to execute whole genome analyses of different species that were previously difficult to perform. It can also recover orthologous gene relationships without a sequence alignment.
SECAPR / Sequence Capture Processor
Allows users to process targeted enriched Illumina sequences from raw reads to alignments. SECAPR serves to guide users from raw sequencing results to cleaned and filtered multiple sequence alignments (MSAs) for phylogenetic and phylogeographic analyses. This tool provides functions permitting researchers to choose appropriate settings for their specific datasets.
Offers a method for detecting mutational signatures. SparseSignatures is a package leaning on a combination of an explicit background mode, a LASSO approach and a repeated bi-cross-validation to perform its analysis with the aim of maximizing background noise decreasing. This program is able to highlight signatures starting from a set of point mutations while computing their exposure values in each patient.
Deduces ancestral characters on a rooted phylogenetic tree. PastML can determine several likely states reflecting the uncertainty of the inferences. It aggregates the neighboring nodes with identical prediction to produce a compact, tree-shaped and graphical representation of the most likely ancestral scenarios. This tool uses a discrete approximation method of the marginal posterior probabilities of the character states, attached to the internal nodes of the tree.
Allows for reproducible, higher-resolution comparisons across studies. ZetaHunter is an application that uses a dataset with a defined taxonomy based on operational taxonomic units (OTUs). It includes a curated database to identify the members of the Zetaproteobacteria, though it can be used with any curated 16S rRNA gene database. This method can also be used for non-Zetaproteobacteria.
Provides a row-linear model to assess sensitivity and precision of genomic platforms. consensus permits users to make direct comparisons of platforms with each other and can be utilized to evaluate locus-specific biases and characteristics distinct to both the platforms themselves and the genome measured. This tool is independent of the biological variation in the data to assess the measurement quality and can be used as a screening procedure to detect platforms with deviant measurements.
gPCA / groupwise Principal Component Analysis
Permits to quantify the variational patterns among multiple groups of data sets, derived from several Omics modalities, over various experimental and/or disease conditions. gPCA is an integrative framework based on two techniques: (i) the classical principal component analysis (PCA) and (ii) analysis of variance (ANOVA). This tool represents an approach to multivariate and multi-group analysis, for reducing complex biological patterns fundamental components.
CDRPN / Computing Distance for Rooted Phylogenetic Networks
Determines the topological dissimilarity for rooted phylogenetic networks. CDRPN is a web application which provides users with four possible metrics that can be run individually or simultaneously: (i) semi-equivalence metric; (ii) tripartition metric; (iii) equivalence metric and (iv) vector metric. This application can also being run as a standalone software and accepts only files formatted in extended Newick format.
Serves for genotype estimation from read depth in polyploids and diploids. polyRAD determines genotype probabilities through a Bayesian framework in which priors are based on mapping population design, Hardy-Weinberg equilibrium (HWE) or population structure with or without linkage disequilibrium (LD). It allows multi-allelic loci and can incorporate population structure and multiple inheritance modes with an option for mapping populations.
Personalized Regression
Estimates patient-specific models by learning patterns of differentation between samples. Personalized Regression is a framework which uses the introduction of a regularizer that matches distance in covariate values to distance in regression parameters. By focusing on learning patterns, this method can produce interpretable models of controllable granularity from patient-specific to pan-cancer.
Reconstructs individual haplotypes from next generation sequencing (NGS) data. HaploConduct is composed of two distinct parts: SAVAGE and POLYTE. SAVAGE focuses on intra-host virus strains and uses FM-index based data structures or ad-hoc consensus reference sequence to build overlap graphs from patient sample data. POLYTE creates haplotigs for diploid and polyploid genome with an iterative scheme where in each iteration reads or contigs are joined.
Determines the probability of detecting a minimum number of families or cases with variants in the same gene. MendelProb is an R package that can be used to determine the probability of identifying at least N families or unrelated cases with variants in the same gene. The software can serve for grant proposals and for designing Mendelian disease sequencing studies (candidate gene, exome and whole genome).
CILP / Correlation by Individual Level Product
Determines the degree of correlation between two traits from a variable of interest. CILP is based on a flexible linear modeling framework that builds and uses a vector of products between the two traits after normalization. It was utilized to align ‘correlation quantitative trait locus (QTL)’, defined as single nucleotide polymorphisms (SNPs) that affect the magnitude of the correlation between two mRNA transcripts.
Analyzes single nucleotide variants (SNVs) and circulating nucleic acids (CNAs) of multiple cancer exomes. superFreq annotates and filters SNVs and short indels, calls CNAs and tracks clones over multiple samples from the same individual. It suits for detecting and tracking somatic mutations in exomes and can be applied to examine breast cancer metastasis, lung cancer xenografts and myeloid leukaemia.
FAD / Fast Amplicon Denoising
Proposes a method dedicated to long Pacific Biosciences amplicons denoising. FAD is a method that mainly focuses on low-noise scenarios with the aim of being efficient in analyzing short amplicons of high quality sequencing with a good read-per-template coverage. This program first de-replicates reads to sort them by abundance while ignoring all those that do not occur at least twice.