Offers a complete workflow for de novo analysis of users’ own raw 16S rRNA gene amplicon datasets for the sake of comparison with existing data. IMNGS is an innovative platform that uniformly and systematically screens for and processes all prokaryotic 16S rRNA gene amplicon datasets available in sequence read archive (SRA) and uses them to build sample-specific sequence databases and OTU-based profiles. Via a web interface, this integrative sequence resource can easily be queried by users.
Uses MinHash locality-sensitive hashing to reduce large sequences to a representative sketch and rapidly estimate pairwise distances between genomes or metagenomes. Using Mash, we explored several use cases, including a 5,000-fold size reduction and clustering of all ~55,000 NCBI RefSeq genomes in 46 CPU hours. The resulting 93 MB sketch database includes all RefSeq genomes, effectively delineates known species boundaries, reconstructs approximate phylogenies, and can be searched in seconds using assembled genomes or raw sequencing runs from Illumina, Pacific Biosciences, and Oxford Nanopore. For metagenomics, Mash scales to thousands of samples and can replicate Human Microbiome Project and Global Ocean Survey results in a fraction of the time.
Allows users to measure bacterial strain-level gene content, single nucleotide polymorphism (SNPs) and species abundance from shotgun metagenomes. MIDAS is able to categorize genetic variants into strains to analyze large-scale population-genetic of metagenomes. The application provides a computational pipeline that combines a taxonomic profiling and an alignment of both pan-genome and whole-genome to permits users to compare over 30,000 reference genomes.
Offers a reference-free metagenomic binning method for identifying bacteria species and quantifying their distribution. MetaGen clusters short contigs for samples with low coverage and distinguish species with high sequence similarities. Besides, MetaGen can also estimate the relative abundance of cultured and uncultured species simultaneously, which provides a way to study distributional changes in microbial colonies dynamically and spatially.
Allows users to explore the biogeography of genes from marine planktonic organisms. Ocean Gene Atlas is a web application permitting researchers to query protein or nucleotide sequences against global ocean reference gene catalogs. It supplies a submission interface to collect a nucleic or protein sequence query. It then executes data mining procedures on dedicated high-performance hardware, and returns interactive result panels for data exploration.
Parses microbial profiles and, because gene copy number (GCN) estimates are pre-computed for all taxa in the reference taxonomy, rapidly corrects GCN bias. The CopyRighter bioinformatic tools permits rapid correction of GCN in microbial surveys, resulting in improved estimates of microbial abundance, alpha and beta diversity.
Consists of a modeling framework that integrates dynamic flux balance analysis with diffusion on a lattice. COMETS implements a dynamic flux balance analysis (FBA) algorithm on a lattice, making it possible to track the spatio-temporal dynamics of multiple microbial species in complex environments with complete genome scale resolution. This dynamic flux balance analysis (dFBA) allows users to perform time-dependent metabolic simulations of microbial ecosystems, bridging the gap between stoichiometric and environmental modeling.
Phylocom integration, community analyses, null-models, traits and evolution. A software package that provides a comprehensive set of tools for analyzing the phylogenetic and trait diversity of ecological communities. The package calculates phylogenetic diversity metrics, performs trait comparative analyses, manipulates phenotypic and phylogenetic data, and performs tests for phylogenetic signal in trait distributions, community structure and species interactions.
Investigates presence/absence probability of metabolic pathways in Metagenome-Assembled Genomes (MAGs). MetaPOAP is composed of three scripts that includes functions able to perform false negative estimates, by checking potential missing genes in a pathway, as well as false positive estimates evaluating errors of genes assignation to a MAG. This application can be used for facilitating the forecasting of metabolic potential.
Identifies significant co-occurrence patterns by finding sparse solutions to a system with a deficient rank. To be specific, we construct the system using log ratios of count or proportion data and solve the system using the l1-norm shrinkage method. Our comprehensive simulation studies show that REBACCA (i) achieves higher accuracy in general than the existing methods when a sparse condition is satisfied; (ii) controls the false positives at a pre-specified level, while other methods fail in various cases and (iii) runs considerably faster than the existing comparable method.
A Bayesian mixture model framework for resolving complex metagenomic mixtures. metaMix is designed to address interpretation issues associated with closely related strains in the sample, low abundance organisms and absence of genomes from the reference database.
A multilevel regularized regression method to simultaneously identify taxa and construct networks. The present methodology offers a new perspective on disease association and community structure that would be particularly valuable in tackling such extensive datasets. The proposed approach can also be applied to most RNA-seq data directly provided they are normalized.
Cleans PSI-BLAST generated profiles of erroneous extensions caused by domain insertions. Hangout is a procedure that clean the profile and prepare it for consequent remote homology searches with various tools, such as PSI-BLAST and HHsearch. The software is useful for large-scale bioinformatics efforts initiated from defined structure domains and requiring uncorrupt sequence profiles for subsequent analysis.
Provides a database engine for whole-metagenome sequencing data. Amordad exploits alignment-free principles to describe metagenomes as points in a high-dimensional geometric space. It sorts the metagenome comparison problem in a geometric context and utilizes an indexing strategy that merges random hashing with a regular nearest neighbor graph. This software’s main goal is to support rapid indexing and retrieval even as data volumes reach massive scales.
Provides a set of geographic utilities for sequencing-based microbial ecology studies. Although the geographic location of samples is an important aspect of environmental microbiology, none of the major software packages used in processing microbiome data include utilities that allow users to map and explore the spatial dimension of their data. phylogeo solves this problem by providing a set of plotting and mapping functions that can be used to visualize the geographic distribution of samples, to look at the relatedness of microbiomes using ecological distance, and to map the geographic distribution of particular sequences.
Generates "policy prescriptions" for microbiome engineering. MDPbiome builds a model suggesting a “prescription” of external perturbations that should be applied to a given microbiome, and will result in its navigation through a subset of healthy or acceptable states, avoiding disease or other undesirable states, for finally reaching a goal state. The software can evaluate multiple microbiome states and enables a variety of different questions to be asked of the same dataset. It can be applied to various temporal metagenomics datasets.
Subsamples reads from the sequencing file and calculates four different statistics: k-mer frequency, 16S abundance, prokaryotic- and viral-read abundance. These metrics are used to create a RandomForest decision tree to classify the sequencing data. PARTIE provides mechanisms for both supervised and unsupervised classification. It is being used to routinely reclassify data sets from the sequence read archive (SRA).
Permits users to analyze data from Oxford Nanopore technology (ONT) MinION sequencer. Dyss is a statistical model, derived from a traditional signal-processing algorithm, able to propose a constant-time classifier for any background DNA profile. This tool is developed to handle DNA reads of low relative abundance and provide a modelling of selective sequencing of target DNA.
Automatizes microbial communities modeling and computational design. FLYCOP is a framework intending to be used with multiple microbial scenarios with distinct goals, while avoiding manual checking steps. It enables the incorporation and analysis of genome-scale metabolic models (GEMs) describing partners in the community. This program aims to assist users in understanding the organization of microbial communities at systems level.
Identifies association between phenotypes and microbial taxa. BDMMA is a method using the Dirichlet-multinomial (DM) regression with integrated batch effects intending to diminish false discovery rates. It censuses metagenomic data’s features, considers their dependences between microbial taxa, highlights associations between microbial taxa and targeted covariates and models batch effects.
Allows users to test the mediation effect of the human microbiome. MedTest is a program assisting researchers to detect the structured mediators and can be applied to any genomics data with different structures (e.g., linkage disequilibrium (LD) structure for genetic data). It focuses on identifying mediation effect by using an ensemble of distance measures and can recognize specific taxa or operational taxonomic units (OTUs) accounting for the mediation effect.
A framework that uses distributed computational resources for gene quantification in metagenomes. Tentacle is implemented using a dynamic master-worker approach in which DNA fragments are streamed via a network and processed in parallel on worker nodes. Tentacle is modular, extensible, and comes with support for six commonly used sequence aligners. It is easy to adapt Tentacle to different applications in metagenomics and easy to integrate into existing workflows.
Analyses internal transcribed spacer (ITS) sequences to deduce highly variable ITS1 and ITS2 subregions. ITSx is based on the hidden Markov model (HMM). It can investigate the sequences in the default orientation. This tool repeats the search in the reverse complementary orientation to account for incorrectly cast sequences. It was tested on thirteen data sets of known, full-length ITS sequences from a total of nine major eukaryotic groups.
Detects significant non-random patterns of co-occurrence (copresence and mutual exclusion) in incidence and abundance data. CoNet serves to open new opportunities for future targeted mechanistic studies of the microbial ecology of the human microbiome. It has been designed with (microbial) ecological data in mind, but can be applied in general to infer relationships between objects observed in different samples (for example between genes present or absent across organisms).
Reconstructs haplotypes from complex microbiomes. Hansel/Gretel is composed of a data structure for the storage and manipulation of evidence (Hansel), and an algorithm for the recovery of haplotypes from a metahaplome (Gretel). It is able to recover and rank haplotypes using evidence of pairs of single nucleotide polymorphisms (SNPs) observed on sequenced reads. It can extract haplotypes from metagenomic data of microbial communities and can be applied to analogous haplotyping problems.
Allows characterization of human-associated bacterial and phage communities by their inferred relationships. This network-based approach provides a basic understanding of the network dynamics associated with phage and bacterial communities on and in the human body.
Allows users to identify and analyze small-subunit (SSU) rRNA gene fragments from shotgun metagenomic sequences. SSUsearch is a standalone software that can manage large datasets. The application allows users to perform a SSU rRNA gene fragment search and unsupervised operational taxonomic unit (OTU) analysis. It is able to process diversity analysis with copy number correction on multiple variable regions.
Analyzes the time series data of microbial community profiles. MetaMIS is based on a Lotka-Volterra model and can interpret interaction networks. It works well with high level of missing data and the influence of rare microbes is ignored to estimate interaction information. The tool can be used in comparative studies thank to its capability to organize multiple interaction networks into a consensus network. It allows researchers to analyse interactive relations conveniently and to visualize network topology.
Serves for maintaining, persisting and searching complete matrices, built on top of BugMat. findNeighbour comprises two components: (1) an OpenMP parallelized C++ application derived from BugMat maintains an in-memory distance matrix derived from mapped genomic data; and (2) a database allows storage and querying of arbitrary meta-data about the sequence. This database permits storage of quality information about the sequence, such as the number of bases called in the sequence.
Builds and persists sparse distance matrices given a set of sequences. findNeighbour2 compresses sequences and stores them both in RAM and on disc following reference-based compression. It allows addition of samples, testing whether an identifier is present, determining distance between a pair of samples, or determining neighbors of one sample.
Serves for building large distance matrices. bugMat performs one-off distance matrix construction using a two phases computation. It first builds an index of variant bases in the sequence and then performs multi-threaded computation of pairwise distances between every sample pair with shared memory to store all data. It includes series of technical optimizations supporting the implementation of the variation model.
Generates consensus taxonomy of targeted amplicon sequence data. CONSTAX is a program that functions independently of operational taxonomic unit (OTU)-picking method to merge taxonomy assignments from multiple classifier programs into an improved consensus taxonomy. It generates several output files that can be used for subsequent community analysis. The software improves taxonomy assignments of environmental OTUs.
Serves computationally demanding jobs under high stress. Visibiome is a microbiome search engine that boasts various architectural features to be scalable to many simultaneous user requests. It also features a comprehensive set of prepared samples against which user samples can be immediately compared removing the need to self-curate databases. Visibiome is available as a web server, as source code or as a pre-configured virtual machine.
Performs similarity searches of short sequences against the ‘‘nt’’ nucleotide database provided by NCBI and, out of every hit, extracts the textual metadata field. Seqenv is a software that determines the types of environment in which a given sequence has previously been found. The strategy adopted in seqenv is to take input sequences and match them against the NCBI’s database using the time-tested BLAST search algorithm.
Permits to identify the “core microbiome” associated with a given habitat. COREMIC uses presence/absence data to perform a complementary analysis different from that of existing methods. It allows the development of a working hypothesis in the search for microbes well suited for a habitat or host-microbe interaction. The tool can also be used to confirm laboratory studies that have identified target microbes that might be important symbionts or thought to be associated with a specific habitat.
Gathers DNA sequences via clustering. STARS uses K-means algorithm and GSP methods to perform cluster analyses of DNA sequences. This software assesses the differentiation capability of DNA sequences with molecular markers employed in phylogenetic analyses. It selects via three criteria: (1) the marker must code for proteins, (2) the marker should have been used in a wide range of the tree of life and (3) the marker should possess a homogeneous length and minimum number of reported copies.
Makes simulation of horizontal gene transfers (HGTs) between the genomes of microbial communities. HgtSIM can integrates different degrees of similarity for transferred genes found in donor and recipient genomes. It is able to assess the recovery rate of HGTs from a simulated metagenomic shotgun-sequencing dataset after various sequence assembly processes. This tool can assist in development of robust pipelines that have maximal success in recovering HGT from complex metagenomic data.
Checks and collects defined hypervariable sequence segments (V1-V9) from bacterial, archaeal, and fungal small-subunit rRNA sequences. V-Xtractor is not sensitive to false-positives. It was created to simplify subsequent analysis in community assays. This tool employs a Hidden Markov Models method to proceed. It does not need prior multiple sequence alignments to obtain phylogenetically comparable regions.
Shares, validates, and documents mock community data resources. mockrobiota includes data set and sample metadata, expected composition data, and links to raw data for each mock community data set. It does not supply physical sample materials directly, but the data set metadata included for each mock community indicate whether physical sample materials are available. The tool currently requires expected observation data in the form of sequence annotations, e.g., taxonomy or gene annotations, but also references sequences in the form of accession numbers.
Characterizes microbial samples from nucleotide or protein sequences. Traitar provides phenotype classifiers to predict 67 traits related to the use of various substrates as carbon and energy sources, oxygen requirement, morphology, antibiotic susceptibility, proteolysis, and enzymatic activities. The software suggests protein families associated with the presence of particular phenotypes. It may help researchers in microbiology to pinpoint the traits of interest, reducing the amount of wet lab work required.
Creates sample identifiers that is unique across projects, project teams, and institutions with some properties: short; correctable with respect to common types of transcription errors; opaque and compatible with existing standards without reliance on centralized infrastructure. cual-id allows users to assign universally unique identifiers (UUIDs), that are globally unique to their samples. It generates human-friendly 4- to 12-character identifiers that map to their UUIDs and are unique within a project.
Clusters reads with the same barcode into groups of reads that were drawn from a single long fragment. Minerva is based on the utilization of barcode co-occurrence information and kmer overlaps. It draws heavily from the field of topic modeling in Natural Language Processing (NLP). This tool offers a partial solution to the barcode deconvolution problem for metagenomics. It can be useful to find structural variations and other genetic structures in the human genome.
Serves for paired and longitudinal microbiome analyses. q2-longitudinal consists in a plugin for the QIIME 2 microbiome analysis platform. It contains several functions allowing users to analyze longitudinal and paired-sample data as well as paired differences and distances, volatility analyses, first differencing or microbial interdependence test. It assists users in evaluating the magnitude of change within individual subjects and random effects models.
Facilitates usability of scikit-learn classifiers in microbiome data studies. q2-sample-classifier encompasses a wide range of supervised learning algorithms dealing with pattern recognition. The module can be accessed through a graphic interface or as a command-line function and supports many feature observations such as counts of amplicon sequence variants or shotgun metagenome sequencing methods.
Consists of an alignment-free functional binning and abundance estimation pipeline. Carnelian is a program that allows users to represent translated metagenomic reads (amino-acid sequences) into low-dimensional manifolds to construct a compact feature space. This tool has three main functions: it (i) translates reads (amino acid sequences) from whole metagenome sequencing studies; (ii) leverages the low-density even-coverage Opal-Gallager hashes to encode translated reads into lowdimensional manifolds; and (iii) allows the production of functional vectors containing effective read counts.
Assists in computing distance, identification, and placement of genome-skim queries on to a reference collection. Skmer is a python package that calculates genomic distance between two organisms represented by their k-mer collections obtained from the genome-skims. It also takes distance estimates to match a genome-skim query to a reference collection.
Provides a method for studying population-scale microbiome. tmap enables the adoption of topological data analysis (TDA) in microbiome data analysis pipelines, thus permitting users to interpret large-scale complex data. It also offers a network-based statistical method for enterotype analysis, driver species identification, and microbiome-wide association of host meta-data.