Searches and clusters algorithms that can be orders of magnitude. USEARCH is a sequence analysis software which combines different algorithms into a single package. This software searches in database for top global hits and provides several NGS read processing features such as dereplication, paired read overlapping, quality filtering, FASTQ file statistics or chimeric sequence filtering.
A widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques has been developed to allow efficient clustering of such datasets.
A clustering method that exploits USEARCH to assign sequences to clusters. UCLUST is superior to CD-HIT. It is usually significantly faster, uses significantly less memory, can cluster at lower identities and is more sensitive. While CD-HIT often fails to identify the closest cluster, or overlooks that a match is possible (false negative), UCLUST rarely misses a match and in most cases finds the best possible match. UCLUST also enables rapid clustering of much larger numbers of sequences.
Provides access to a variety of public and in-house bioinformatics tools. The MPI Bioinformatics Toolkit integrates a selected set of most useful methods for the analysis of protein sequences and structures. It offers more of 50 interconnected tools, so that the results of one tool can be forwarded to other tools. It also includes a useful platform for teaching bioinformatic enquiry to students in the life sciences.
Allows to infer the structure of a genetic sequence group. Kpax is a method for aligning multiple sequences using the Markov chain Monte Carlo (MCMC) algorithm for the Bayesian approach and a genetic algorithm for MAP estimation by using both synthetic and real datasets. This tool is therefore able to generate files on a viral protein. It can be applied on clustering protein sequences.
Clusters protein structures based on their sequence and structure similarities. ModClus consist of the following steps: (i) the first chain in the list seeds the first cluster; (ii) the next chain is compared by sequence and/or structure to all chains in each of the existing clusters and either joins the first sufficiently similar cluster or seeds a new cluster. The clustering depends on user-defined thresholds for structure and sequence similarities. ModClus allows clustering both a list of chains or from a chain.
Fits noisy kinetic data. Kfits aims to differentiate and isolate signal from outliers. It has been tested on two very different datasets obtained by light scattering or by ThT fluorescence. This tool allows other types of kinetic measurements to be supported. It is useful for lab studying the kinetic processes and particularly protein aggregation.
Automatically searches sequence in the fields of discovery and clustering of ‘X-rich proteins’. KAPPA extracts and compares cysteine patterns by means of a quantitative similarity index called K-score. It can detect proteins matching to a given reference cysteine pattern by employing an ab initio sequence search function. This tool identifies any type of protein displaying a key amino acid pattern.
Filters putative homology clusters of amino acid sequences by using machine learning algorithms based on annotated orthology clusters. The OGCleaner is designed for homology cluster filtering by looking at a cluster as a whole and not only pairwise sequence comparison. This tool can be used to remove low-quality putative homology clusters for higher quality phylogenetic tree reconstruction and other bioinformatic analyses.
Detects spatially clustered substitution in protein phylogenies. evoclust3d combines the P3D measure of spatial substitution clustering with ancestral sequence reconstruction to identify substitution clustering at specific branches of phylogenetic trees. To assess the utility of this approach, a large-scale screen of vertebrate protein families was performed in order to detect branch-specific substitution clustering, and compared results to predictions of positive selection by the branch-site test.
Permits to efficiently cluster extremely large sequencing data for de novo operational taxonomic units (OTUs) picking. DACE is a scalable parallel DP-means algorithm with a distance preserving random projection (LSH) method for data partitioning. DACE is able to outperformed most state-of-the-art programs in terms of both accuracy and efficiency and, could be an ideal tool for clustering very large sequencing data.
A clustering algorithm to obtain partial protein models which is based on the granular clustering paradigm. The general principles of GC are as follows: primitive information granules are created from the input data elements; clustering is carried out by growing information granules; clustering is stopped when enough data condensation is achieved. Gc is especially useful in applications where partial models covering different fragments of the protein sequence, possibly providing alternative conformations, are sufficient. The GC approach operates in a much larger search space, since the decoy data are initially granulated down to the level of a single residue.
Creates a graph representing functional relevance of proteins considering their known functional, proteomics, and transcriptional features. iGFP offers a method to determine the function of the protein groups as well as function of individual proteins by iteratively updating grouping of proteins and functional assignments. It processes by using the single term that considers the coherence of the gene ontology (GO) term’s annotation.
Clusters protein databases by aggregating nearfull-length homologs. FastaHerder is an application that gathers sets of protein sequences and mines these clusters. This web app adds two clustering steps using a high threshold of sequence identity. This strategy allows, in addition, the ‘‘coclustering’’ of a query sequence to a preclustered database. FastaHerder is very restrictive in requiring the protein similarity to be full length.
Permits users to determine informative partitioning of a given temporal phosphoproteomics data. CLUE is a knowledge-based approach that uses a hypothesis testing approach. This method utilizes known kinase-substrate annotations from curated phosphoproteomics databases to first estimate the optimal number of clusters within a dataset and then identifies the enriched kinase(s) associated with each cluster.
Processes and prepares metagenomics, genomics and population genomics nucleotide sequence data.. VSEARCH is an alternative to the USEARCH tool. It includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching, clustering by similarity, chimera detection, dereplication, pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication.
Identifies sequences from databases that share motifs similar to a query active site profile. DASP3 is a modification of previously published software, Deacon Active Site Profiler (DASP). DASP3 is significantly more efficient and versatile than DASP, a requirement for the iterative processes used to cluster proteins into functionally relevant groups. It produces better separation between true positives and false positives and shows improved ability to accurately and efficiently cluster the Peroxire-doxin (Prx) superfamily into functionally relevant groups using two recently developed iterative processes. As an automated algorithm, DASP3 identifies functional groups better than previous versions of the software and rivals expert manual curation in the Structure-Function Linkage Database (SFLD).
A web server for automatic classification of protein sequences. It uses an alignment free approach to compute local similarities among sequences. This method is particularly useful for comparing multi-domain protein sequences which are difficult to align. It can handle multiple sequences of varied lengths, resulting in clusters with high functional and domain architectural similarities.