1 - 50 of 66 results


A pipeline for constructing operational taxonomic units (OTUs) de novo from next-generation reads that achieves high accuracy in biological sequence recovery and improves richness estimates on mock communities. UPARSE works by quality-filtering reads, trimming them to a fixed length, optionally discarding singleton reads and then clustering the remaining reads. UPARSE reports OTU sequences with ≤1% incorrect bases in artificial microbial community tests, compared with >3% incorrect bases commonly reported by other methods. The improved accuracy results in far fewer OTUs, consistently closer to the expected number of species in a community.

dbOTU / Distribution-based OTU calling

Provides an algorithm to inform the creation of OTUs for large next-generation sequencing studies employing the distribution of 16S rRNA sequences. dbOTU was implemented following three different ways. The third version is based on the Levenshtein edit distance and uses the number of single-position insertions, deletions, or substitutions required to modify one sequence into another, as an approximation for the sequence dissimilarity, with the aim of increasing its efficiency.


A clustering method that exploits USEARCH to assign sequences to clusters. UCLUST is superior to CD-HIT. It is usually significantly faster, uses significantly less memory, can cluster at lower identities and is more sensitive. While CD-HIT often fails to identify the closest cluster, or overlooks that a match is possible (false negative), UCLUST rarely misses a match and in most cases finds the best possible match. UCLUST also enables rapid clustering of much larger numbers of sequences.

CROP / Clustering 16S rRNA for OTU Prediction

Provides a clustering tool that automatically determines the best clustering result for 16S rRNA sequences at different phylogenetic levels. Our study shows that CROP gives accurate clustering results, both in terms of the number of clusters and their abundance levels, for various types of 16S rRNA datasets. In contrast, the standard hierarchical clustering strategy, even with the preclustering process and the average linkage method, still frequently overestimates the number of operational taxonomic units (OTUs) in the presence of sequencing errors, resulting in an underestimation of the abundance level of the underlying OTUs. By applying our method to several datasets, we demonstrate that CROP is robust against sequencing errors and that it produces more accurate results than conventional hierarchical clustering methods.

RDP Classifier / Ribosomal Database Project Classifier

Provides rapid taxonomic placement and summary data based on rRNA sequence data. For each high-throughput experiments, the RDP Classifier can include the number of input sequences belonging to each taxon. For query sequences from regions of bacterial diversity with less-defined taxonomy, the RDP Classifier tends to provide classification results with low confidence estimates. It can also be adapted to additional phylogenetically coherent bacterial taxonomies.


forum (1)
An algorithm for hierarchical clustering analysis of massive sequence data. To avoid confusion, we note that ESPRIT-Tree is not a program for determining phylogenetic trees, but rather for producing hierarchical clusters of sequences based on sequence similarity, using a tree-like data structure. We extended the concept of space partition used by previous methods for handling sequence data of varying lengths. By assuming that sequence data lives in a pseudometric space, we created a distance-based partition of the data without explicitly defining an inner-product operator to divide the space, and organized the partition results in a pseudometric based partition tree. By repeatedly applying the triangular inequality, a fast closest-pair searching algorithm was developed within the ESPRIT-Tree framework. An efficient method for dynamic insertion and deletion of tree nodes were also developed.


A fast clustering tool specifically designed for clustering highly-similar DNA sequences. Given a set of sequences and a sequence similarity threshold, DNACLUST creates clusters whose radius is guaranteed not to exceed the specified threshold. Underlying DNACLUST is a greedy clustering strategy that owes its performance to novel sequence alignment and k-mer based filtering algorithms. DNACLUST can also produce multiple sequence alignments for every cluster, allowing users to manually inspect clustering results, and enabling more detailed analyses of the clustered data.

CLUSTOM-CLOUD / CLUSTering 16S NGS sequences by Overlap Minimization

A distributed clustering program that can efficiently and accurately cluster 16S sequences under distributed and cloud-computing environments. CLUSTOM-CLOUD is a significant upgrade to its predecessor, CLUSTOM. The enhancements include: (i) implementation of k-mer transformation, (ii) removal of duplicate sequences (dereplication), and importantly (iii) the implementation of IMDG technology to store data directly into RAM rather than hard disks of individual nodes. Importantly, CLUSTOM-CLOUD inherits the high accuracy of its ancestor CLUSTOM, as also confirmed by the comparative exercise.


Divides a set of amplicon reads into clusters. OTUCLUST is a sequence-clustering application that performs sequence dereplication and chimera removal. This method is based on a strategy in which the clusters are constructed incrementally by comparing an abundance-ordered list of input sequences against the representative set of already-chosen sequences. The procedure is composed by three main steps: (i) dereplication and abundance estimation, (ii) denovo chimera removal (optional, with UCHIME) and (iii) clustering using the dereplicated sequences as centroids.


Solves the problems of arbitrary global clustering thresholds and centroid selection induced input-order dependency, and creates robust and more natural Operational Taxonomic Units (OTUs) than current greedy, de novo, scalable clustering algorithms. The purpose of Swarm is to provide a novel clustering algorithm that handles massive sets of amplicons. Results of traditional clustering algorithms are strongly input-order dependent, and rely on an arbitrary global clustering threshold. Swarm results are resilient to input-order changes and rely on a small local linking threshold, representing the maximum number of differences between two amplicons. Swarm forms stable, high-resolution clusters, with a high yield of biological information.


A classification method for 16S rDNA sequence samples that uses the natural structure of microbial community data encoded by a phylogenetic tree. We showed that using the phylogenetic information leads to an improved classification accuracy compared with the state-of-the-art classification algorithms. Unlike many popular classification methods, which consider features (or operational taxonomic unit (OTU) frequencies) in isolation, our method takes advantage of the similarities between OTUs encoded by the phylogenetic tree.

LDM / Linear Decomposition Model

Allows investigation of microbial composition data. LDM links the principal components (PCs) of a pre-specified distance matrix with linear combinations of operational taxonomic units (OTUs). It is able to conduct both the global test of any association between microbial composition and arbitrary variables and OTU-specific tests in a coherent way. This tool consists of an ordination-based linear model that can account for the complex designs found in microbiome studies.


Provides a solution to use results of operational taxonomic unit (OTU)-level significance tests to identify and visualize branches in a phylogenetic tree. SigTree permits to convert the one-sided p-values to two-sided p-values. It applies the p-value correction, and converts back to one-sided adjusted p-values so that directional interpretation is preserved. This tool supplies a convenient interface to a reliable statistical framework allowing 368 meaningful statements regarding significance of the response.

FUNGuild / Fungi fUNctional Guild

Used to taxonomically parse fungal operational taxonomic units (OTUs) by ecological guild independent of sequencing platform or analysis pipeline. FUNGuild is a two-component tool consisting of a community-annotated database and a bioinformatics script that parses fungal OTUs into guilds based on their taxonomic assignments. FUNGuild provides a way for researchers to comprehensively examine their datasets from an ecological perspective, with an accuracy and speed not feasible without bioinformatics computing power.


Filters unclassified and/or rare operational taxonomic units from 16S rRNA gene sequence libraries by screening against consensus structural models for small-subunit (SSU) rRNA. SSUnique promotes the exploration of unclassified diversity in microbiome research and enables the discovery of substantial novel taxonomic lineages through the analysis of a large variety of existing data sets. SSUnique contains visualization tools for exploring phylogenetic novelty in microbiome data, especially useful for very large data sets.


Classifies ribosomal RNA sequences in terms of their taxonomy and operational taxonomic unit (OTU) classification. MAPseq uses a reference set of full-length ribosomal RNA sequences for which known taxonomies are known, and for which a set of high quality OTU clusters has been previously generated. It provides sequence read mapping against hierarchically clustered and annotated reference sequences. This tool can be applied to individual samples but it can also be used to analyze very large and diverse sequence collections.


Enables quantitative visualizations, statistical testing, multivariate analysis, supervised learning, factor analysis, multivariable regression, network analysis and diversity estimates. Calypso is an easy-to-use online software suite that allows non-expert users to mine, interpret and compare taxonomic information from metagenomic or 16S rDNA datasets. It has a focus on multivariate statistical approaches that can identify complex environment-microbiome associations. Comprehensive help pages, tutorials and videos are provided via a wiki page.

microPITA / microbiomes: Picking Interesting Taxonomic Abundance

Picks interesting taxonomic abundance. microPITA is a computational tool enabling sample selection in two-stage (tiered) studies. Using two-stage designs can more efficiently allocate resources, reducing study costs, and maximizing the use of samples. A selection of samples can be performed to target various microbial communities including: (i) samples with the most diverse community (maximum diversity), (ii) samples dominated by specific microbes (targeted feature), (iii) samples with microbial communities representative of the survey (representative dissimilarity) or (iv) samples with the most extreme microbial communities in the survey (most dissimilar).


Infers Operational Taxonomic Units (OTUs) from massive 16S rRNA sequences with high accuracy and low computational complexity. DBH is a clustering method that consists of two distinct elements: (i) based on the DataBase (DB) graph theory, a seed selection strategy is introduced to reduce the read errors and (ii) a greedy heuristic clustering procedure is employed to decrease the computational burden, avoiding the large memory required for storing seeds and/or distance matrix. This method can also efficiently handle large-scale datasets.

NINJA-OPS / NINJA Is Not Just Another - OTU Picking Solution

Takes advantage of the Burrows-Wheeler (BW) alignment using an artificial reference chromosome composed of concatenated reference sequences, the “concatesome” as the BW input. NINJA-OPS also allows for convenient quality control of data, such as fast reverse complementing, base pair trimming, and a specialized denoising transformation. This method can transform an entire MiSeq run into a QIIME-formatted BIOM table in under 10 minutes on laptop, achieving higher accuracy and more exact matches than USEARCH. It implements several pre-filtering methods that elicit substantial speedup when coupled with existing tools.

MTV-LMM / Microbial community Temporal Variability Linear Mixed Model

Helps quantify the dependency of operational taxonomic units (OTUs) on past occurrences. MTV-LMM is an algorithm developed to detect temporal dependencies. It also has the ability to quantify the dynamics’ consistency across individuals. This method can serve both as a feature selection method (selecting only the OTUs affected by time) and as a prediction model. It can ultimately estimate the strength of ecological interactions within the microbial community.

PhyloToAST / Phylogenetic Tools for Analysis of Species-level Taxa

Distributes BLAST-based OTU picking across computing clusters. PhyloToAST provides several improved/new visualization methods, tools for filtering and sub-setting results files, simple name lookup for OTU IDs, and finally, exposes the API used to build all of these tools for interested developers. In addition, PhyloToAST enables easy reproducibility by displaying the azimuth and angle during the interactive 3D plotting mode, and allowing users to input those values in later sessions.


Performs parallel hierarchical clustering of sequences. ESPRIT-Forest is algorithm with a cluster version. The software inherits the same pipeline of ESPRIT and ESPRIT-Tree, which performs pre-processing, hierarchical clustering and statistical analysis. The algorithm organizes sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then uses a new multiple-pair merging criterion to construct clusters in parallel using multiple threads.


Provides pre-defined workflows for metagenomic data analysis and disease prediction modeling based on the Galaxy platform. MetaDP is an automated software for 16S rRNA sequencing data analysis, including data quality control, operational taxonomic unit clustering, diversity analysis, and disease risk prediction modeling. It provides predefined workflows and can be used without registration. It begins with a straightforward process whereby a user uploads sequencing data and the analysis mainly includes three parts: data pre-processing, traditional metagenomic data analysis, and disease prediction.

MICCA / MICrobial Community Analysis

Provides accurate results reaching a good compromise among modularity and usability. MICCA, a software pipeline for the processing of amplicon metagenomic datasets that efficiently combines quality filtering, clustering of Operational Taxonomic Units (OTUs), taxonomy assignment and phylogenetic tree inference. It provides estimates of the number of OTUs and of other common ecological indices that are more accurate and robust than currently available pipelines. Analysis of public metagenomic datasets shows that the higher consistency of results improves understanding of the structure of environmental and human associated microbial communities.

16S Classifier

A Random Forest based tool which is developed to carry out fast, efficient and accurate taxonomic classification of 16S rRNA sequences. 16S Classifier has the unique ability to classify small Hypervariable Regions of 16S rRNA. It displayed precision values of up to 0.91 on training datasets and the precision values of up to 0.98 on the test dataset. On real metagenomic datasets, it showed up to 99.7% accuracy at the phylum level and up to 99.0% accuracy at the genus level.