A classification system designed for metagenomics experiments that assigns taxonomic labels to short DNA reads. PhymmBL combines two components: (i) composition-directed taxonomic predictions from Phymm and (ii) basic local alignment search tool (BLAST)-based homology results. PhymmBL combines these to label each input sequence with its best guess as to the taxonomy of the source organism. Input sequences as short as 100 base pairs can be phylogenetically classified with PhymmBL more accurately than with any other existing method. PhymmBL predicts species, genus, family, order, class and phylum for each read, allowing users to arrange results according to levels of specificity relevant to their research goals.
A program for unsupervised binning of metagenomic contigs by using nucleotide composition, coverage data in multiple samples and linkage data from paired end reads. CONCOCT does unsupervised binning of metagenomic contigs by using nucleotide composition - kmer frequencies - and coverage data for multiple samples.
Integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. MetaBAT outperforms alternative methods in accuracy and computational efficiency on both synthetic and real metagenome datasets. It automatically forms hundreds of high quality genome bins on a very large assembly consisting millions of contigs in a matter of hours on a single node.
Allows to bin and annotate short paired-end reads. MetaCluster-TA is an assembly-assisted approach which, instead of annotating each read or assembled contig separately, bins similar reads/contigs into the same cluster and annotates the whole cluster. The software consists of three phases: (i) construction of long virtual contigs from assembly and probabilistic grouping of short reads, (ii) q-mer distribution estimation and clustering and (iii) cluster annotation and merging.
A software tool for binning assembled metagenomic sequences based on an Expectation-Maximization algorithm. Users could understand the underlying bins (genomes) of the microbes in their metagenomes by simply providing assembled metagenomic sequences and the reads coverage information or sequencing reads. For users' convenience MaxBin will report genome-related statistics, including estimated completeness, GC content and genome size in the binning summary page. Users could use MEGAN or similar software on MaxBin bins to find out the taxonomy of each bin after the binning process is finished.
Performs assignment for bacterial 16S and amplicon sequencing. HmmUFOtu is a pipeline that aims to assist users in determining microbial community composition and diversity. It classifies every read from submitted sequences within a known reference tree to then performs a phylogeny-based operational taxonomic units (OUT) clustering and produces an assignment for each read. This application is able to support a wide range of DNA substitutions models including GTR or HKY85.
Enables improvements of the contigs produced by existing binning tools. D2SBin calculates dissimilarity between contig and the bin’s center based on the Markov model of k-tuple sequence compositions. The software also gives credence to the relative sequence composition model over the direct application of absolute sequence composition. Besides, the software only depends on the k-tuples for generating a single metagenomic sample.
A Java based application which offers efficient and intuitive reference-independent visualization of metagenomic datasets from single samples for subsequent human-in-the-loop inspection and binning. The method is based on nonlinear dimension reduction of genomic signatures and exploits the superior pattern recognition capabilities of the human eye-brain system for cluster identification and delineation. We demonstrate the general applicability of VizBin for the analysis of metagenomic sequence data by presenting results from two cellulolytic microbial communities and one human-borne microbial consortium. The superior performance of our application compared to other analogous metagenomic visualization and binning methods is also presented.
A general framework automatically bin contigs into OTUs based upon sequence composition and coverage across multiple samples. The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison to state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is employing L1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accuracy.
Allows analysis of large sets of amplicon sequences and yields abundance tables of Operational Taxonomic Units (OTUs) with their taxonomic affiliation. FROGS is a set of 13 tools, designed for biologists and bioinformaticians, that processes amplicon reads coming from Illumina or Roche sequencing technologies. The software can produce accurate community compositions, including at fine scales (species or genus) and in large communities (>100 different species) with very heterogeneous abundances.
Metagenomes are often characterized by high levels of unknown reads with no similarity to any sequences in Genbank. Although these are often discarded from analysis, they contain a wealth of information for comparative metagenomics. crAss is a tool that enables fast and intuitive analysis of complete metagenomic data sets by counting the number of shared contigs between samples in a cross-assembly of all reads.
Allows to represent a shotgun metagenome from an arbitrary environment as a modified de Bruijn graph consisting of simplified components. MetaFast lies between the k-mer spectrum analysis and assembly and combines the best of these two alignment-free approaches: the speed of the former with the precision of the latter. Its independence of the reference allows to perform efficiently for both extensively studied and novel microbiota types. For multiple metagenomes, the resulting representation is used to obtain a pairwise similarity matrix. The dimensional structure of the metagenomic components preserved in our algorithm reflects the inherent subspecies-level diversity of microbiota. MetaFast is computationally efficient and especially promising for an analysis of metagenomes from novel environmental niches.
An assembly-assisted approach for reference-free metagenomic binning. MetaProb can deal with short and long reads in a novel probabilistic framework, by using probabilistic sequence signatures. We compared the binning performances over several short and long reads datasets against other state-of-art binning algorithms, showing that MetaProb achieves in most cases the best performances in terms of F-measure. The estimation of the number of species in a metagenomic sample can be performed with MetaProb, adding a degree of freedom in the analysis.
Investigates and catches the complex structure of the metagenomic datasets. BMC3C conducts clusterings on the datasets with different initializations or algorithms. It employs independent statistics of codon usage to represent contigs. This tool enables to can synergy the advantages of base clustering methods, and neutralizes or even avoids the disadvantages of them.
Allows users to batch processing of fasta and fastq files specific for amplicon sequencing studies. SEED simplifies clustering, quality filtering/ trimming, taxonomic identification, creation and description of molecular taxa and their phylogenetic placements and for quick assessment of basic microbial community statistics. Moreover, it includes a graphical user interface (GUI) to process data from Illumina, Ion Torrent and Sanger sequencing.
Separates short paired-end reads from different organisms in a metagenomic dataset. TOSS uses abundance levels to proceed to the separation of genomes. It is able to separate unique l-mers from repeats. The tool starts by constructing a graph of l-mers and performs the clustering of unique l-mers. It can be used for very short reads and is able to handle multiple genomes with arbitrary abundance levels and sequencing errors.
An automated binning tool that combines genomic signatures, marker genes and optional contig coverages within one or multiple samples, in order to visualize the metagenomes and to identify the reconstructed genomic fragments. We demonstrate the superior performance of MyCC compared to other binning tools including CONCOCT, GroopM, MaxBin and MetaBAT on both synthetic and real human gut communities with a small sample size (one to 11 samples), as well as on a large metagenome dataset (over 250 samples). Moreover, the visualization of metagenomes in MyCC aids in the reconstruction of genomes from distinct bins.
Uses tetranucleotide frequencies, differential coverage and read mapping information to bin assembled contigs. MetaWatt uses diamond blastx, hmmer and aragorn for quality control. Metawatt is very fast, runs on a normal pc or laptop and offers a graphical user interface for effective data exploration.
Aligns each mate-pair to produce a composite read with Phred-like quality scores. SHERA recalculates the error probability of each bp, given the base call data from both reads for nucleotides in the aligned region. It constitutes a practical program for metagenome sequencing and analysis. This package can be used with any existing or emerging short-read sequencing platforms capable of producing matepairs.
Allows to visualize genome bins. ICoVeR allows to curate bin assignments based on multiple binning algorithms. It was tested on the refinement disparate of genome bins automatically generated by other binning algorithms for an anaerobic digestion metagenomic dataset. The tool renders the bin refinement process faster and more replicable. It permits to capture the provenance of changes derived in the course of an exploratory task.
Allows to model clusters of sequences. PHYSCIMM uses interpolated Markov models (IMMs). It was tested by clustering sequencing reads from an in vitro-simulated microbial community. The tool partitions the sequences using supervised Phymm classifications before the unsupervised iterative IMM clustering stage. It can be useful in many bioinformatics applications. SCIMM and PHYSCIMM will be valuable tools for researchers seeking to determine the relationships between sequencing reads from a metagenomics project.
A taxonomy-independent approach to cluster environmental shotgun reads, by considering k-mer frequency in reads and Markov properties of the inferred OTUs. Tested on twelve simulated datasets, MBBC reliably estimated the species number, the genome size, and the relative abundance of each species, independent of whether there are errors in reads. Tested on multiple experimental datasets, MBBC outperformed two state-of-the-art taxonomy-independent methods, in terms of the accuracy of the estimated species number, genome sizes, and percentages of correctly assigned reads, among other metrics.
Allows rapid estimation of pairwise dissimilarity between metagenomes. Though we applied this technique to gut microbiota, it should be useful for arbitrary metagenomes, even metagenomes with novel microbiota. Dissimilarity measure based on k-mer spectrum provides a wider perspective in comparison with the ones based on the alignment against reference sequence sets. It helps not to miss possible outstanding features of metagenomic composition, particularly related to the presence of an unknown bacteria, virus or eukaryote, as well as to technical artifacts (sample contamination, reads of non-biological origin, etc.) at the early stages of bioinformatic analysis. Our method is complementary to reference-based approaches and can be easily integrated into metagenomic analysis pipelines.
An alignment-free supervised metagenomic classification method. The intrinsic correlation of oligonucleotides provides the feature set, which is selected dynamically using a kernel partial least squares algorithm, and the feature matrices extracted with this set are sequentially employed to train classifiers by support vector machine (SVM). The alignment-free supervised classification method DectICO can accurately classify metagenomic samples without dependence on known microbial genomes. Selecting the intrinsic correlation of oligonucleotides (ICO) dynamically offers better stability and generality compared with sequence-composition-based classification algorithms. DectICO provides new insights in metagenomic sample classification.
Automatically identify clusters in the feature map using the already-available labelled samples (seeds). S-GSOM is an algorithm that consists of three core procedures: (1) the very small amounts of available or selected seeds are combined with other unlabeled samples; (2) the combined samples are presented to GSOM for training in which the seeds are treated the same as the unlabeled data; and (3) S-GSOM performs an extra phase, the cluster identification phase, as post-processing.
A binning tool that primarily uses differential coverage to obtain high fidelity population genomes from related metagenomes. GroopM automatically bins contigs into discrete population genomes based primarily on co-varying coverage profiles across multiple related metagenomes, for example temporal or spatial series of a given ecosystem.
A taxonomic assignment program that produces accurate assignments with a precision of 80% or more also for low-ranking taxa from metagenome samples. PPS+ is a fully automated successor of the PhyloPythiaS software. It automatically determines the most relevant taxa to be modeled and suitable training sequences directly from the input sample, which are then used to generate a sample-specific structured output SVM taxonomic classifier for the taxonomic binning of a sample. This enables its use for researchers without experience in the field or time to search for suitable training sequences for the manual construction of well-matching taxonomic classifier to a particular metagenome sequence sample. PPS+ is best suited for the analysis of large NGS metagenome samples with assembled contigs (> 1kb) carrying marker genes or datasets including the high quality longer PacBio consensus reads.
Allows taxonomic profiling and abundance estimation. SLIMM utilizes coverage information of individual genomes, such as the number of reference genomes with mapping reads, the total number of reads and the average read length, to filter out those that are unlikely to be in the sample. The software is also able to identify organisms with high recall rate.
Assists users to detect associations between microbiota and continuous clinical variables. LassoGLMMforMicrobiomes is a program designed to execute several tasks: screening associated clinical variables, incorporating repeated measures of individuals, and addressing the large number of species found in the microbiome. Moreover, it is able to determine which microbes are associated with continuous repeated clinical measures.
Provides assistance for clustering high-dimensional sequence data. LSH-SNN proceeds through two main steps: hashing and indexing first the data space and the creation of links between sequences in to output clusters. This software computes the l-mer distribution of each sequence. It can scale to datasets containing millions of sequences and doesn’t need the predetermination of the number of output clusters.
Allows metagenomic binning and avoids read alignment without loss of accuracy. GATTACA is a program that clusters contigs according to their coverage profiles across a large cohort of metagenomic samples. It enables analysis of single metagenomic samples. This tool also offers functionalities for detecting publicly available metagenomic data that can be incorporated into the set of reference metagenomes.
Performs common tasks in metagenomic data analysis from raw read quality control to bin extraction and analysis. MetaWRAP provides a collection of modules, each being a standalone program addressing one aspect of WMG data processing or analysis, including read quality control (QC), assembly, visualization, taxonomic profiling, and binning. Users can follow the intuitive workflow or use only specific functions. Its modularity gives the investigator flexibility in their analysis approach.
A two-phase algorithm for the binning of metagenomic reads without using reference genomes. Instead of directly clustering reads, the main idea of BiMeta is to provide an additional preprocessing phase in which reads potentially belonging to the same cluster are grouped and each group is presented by a so-called seed of non-overlapping reads. The idea is motivated by a careful observation of the l-mer frequency distributions on sets of non-overlapping reads extracted from microbial genomes. BiMeta demonstrates to be able to achieve higher performance than the state-of-the-art binning algorithms on both simulated and real metagenomic datasets. Another strength of BiMeta is that it can work well with both short and long reads.
Extracts local ‘texture’ changes from nucleotide sequence data in image processing. MrGBP aims to extract local changes in numerical representations of genetic sequence data. To proceed, it employs the multi-resolution local binary patterns (MLBP) method that offers a viable alternative feature space to textual representations of sequence data. The tool can be used to capture the genomic signature changes followed by dimensionality reduction steps to visualise the data in a lower dimension.
Computes a large collection of distances classically used in ecology to compare communities. Simka is a method able to calculate a full range of distances enabling the comparison of any number of datasets. This method uses Multiple k-mer Count algorithm (MKC), a strategy that counts k-mers with state-of-the-art time, memory and disk performances. It was also able to capture the essential underlying biological structure with or without the k-mer solid filter.
Calculates performance metrics and comparative visualizations. AMBER is an evaluation package for the comparative assessment of genome reconstructions from metagenome benchmark data sets. It facilitates the assessment of genome binning programs on benchmark metagenome data sets, for bioinformaticians aiming to optimize data processing pipelines and method developers. It is effective in several convenient output formats, allowing in-depth comparisons of binnings by different programs, software versions, or with varying parameter settings.
Carries out clustering on a simplified subset of contigs to maximize scaling according to metagenomic complexity from individual metagenome assemblies. Autometa is an algorithm that bins microbial genomes de novo from single shotgun metagenomes using sequence homology, coverage, and nucleotide composition to distinguish between contigs. The presence of marker genes can be used to estimate the genome completeness of bins, as well as the level of contamination, as each marker should only be detected once per bin.
Permits to analyze large species-wide bacterial population genomic datasets. PopPUNK is a suite of algorithms based upon the estimation of pairwise distances between isolates both in terms of divergence between their shared sequences and differences in their gene content. It can divide collections into sequence clusters, genomotypes or strains, depending on the nature of the pathogen.
Solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. bwtCluster is an algorithmic framework that exploits multiple cores by a parallel traversal of the suffix-link tree of the sample. Compared with others space-efficient algorithms, bwtCluster is competitive both in space and in time with the state of the art.
A web server for the taxonomic assignment of metagenome sequences. PhyloPythiaS is a fast and accurate sequence composition-based classifier that utilizes the hierarchical relationships between clades. Taxonomic assignments with the web server can be made with a generic model, or with sample-specific models that users can specify and create. Several interactive visualization modes and multiple download formats allow quick and convenient analysis and downstream processing of taxonomic assignments.
Phylogenetically classifies variable-length DNA sequence fragments. PhyloPythia is a method that uses sequence composition to phylogenetically characterize sequence fragments. The software allows the phylogenetic classification of genomic fragments ≥ 1–3 kb for all taxonomic ranks considered (domain, phylum, class, order and genus). PhyloPythia can also achieve this for fragments originating from new organisms. It was used PhyloPythia to analyse three metagenomes: the Sargasso Sea sample and two samples of Enhanced Biological Phosphorus Removal (EBPR)-sludge used in industrial wastewater processing.
Assists users for comparing metagenomes and identifying habitat-specific sequences. HabiSign is a program that utilizes patterns of tetra-nucleotide usage in microbial genomes to bring out the differences in the composition of both diverse and related microbial communities. This tool includes features for detecting subsets of sequences that are specific to given metagenomic samples.
Allows user to obtain the specified species from next-generation sequencing (NGS) short reads. MetaObtainer uses overlap information to group short reads and then uses composition information to obtain specified species. It was compared with TOSS (another NGS reads classification tool) and tested on some synthetic datasets with different numbers of species, phylogenetic distances between species, abundance ratios, and sequencing error rates. The results show that the tool can perform well with large-scale datasets on personal computers with acceptable time.
0 - 0 of 0
1 - 8 of 8