A widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques has been developed to allow efficient clustering of such datasets.
Searches and clusters algorithms that can be orders of magnitude. USEARCH is a sequence analysis software which combines different algorithms into a single package. This software searches in database for top global hits and provides several NGS read processing features such as dereplication, paired read overlapping, quality filtering, FASTQ file statistics or chimeric sequence filtering.
Provides assistance for long read clustering. CARNAC-LR is an implementation of a clustering algorithm into a pipeline. This software’s process is composed of two main steps: first, it searches an upper bound of the number clusters k to then proceed to the refinement of the boundaries of each disjoint community in order to fulfil the partition condition. It can also generate partitions and select community founding nodes.
An efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance.
An exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman–Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision.
Provides an ultra-fast and memory-efficient solution to clustering and assembling short reads produced by RAD-seq. First, Rainbow clusters reads using a spaced seed method. Then, Rainbow implements a heterozygote calling like strategy to divide potential groups into haplotypes in a top-down manner. And along a guided tree, it iteratively merges sibling leaves in a bottom-up manner if they are similar enough.
Enumerates all similar pairs from a string pool in terms of edit distance. SlideSort is a method based on a pattern growth algorithm that can effectively narrow down the search by finding chains of common k-mers. A main application of the software is hierarchical sequence clustering, which can be used, for instance, in correcting errors in short reads and preprocessing for metagenome mapping. It was evaluated on large datasets of short reads.
Aligns two read mapping profiles of next-generation sequencing outputs for non-coding RNAs. SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. The tool exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5’-end processing and 3’-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain.
Automatically combines transcriptomes from difference sources, such as assembly and annotation, into a compact and unified reference. necklace is applicable to any species with an incomplete reference genome. It aligns and counts reads in preparation for testing for differential gene expression and differential transcript usage analysis. This tool provides the following steps: initial assembly, clustering transcripts into gene groupings, reassembly to build the superTranscriptome and finally alignment and counting of mapped reads in preparation for differential expression testing.
A collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing. Next-Generation sequencing machines usually produce FASTA or FASTQ files, containing multiple short-reads sequences (possibly with quality information). The main processing of such FASTA/FASTQ files is mapping (aka aligning) the sequences to reference genomes or other databases using specialized programs. Example of such mapping programs are: Blat, SHRiMP, LastZ, MAQ and many many others.
A toolkit for processing and analysing RAD sequencing data. The tools are designed to process de novo RAD data, that is, data from species without a reference genome. RADtools integrates RADpools for separating raw Illumina reads into separate pools, RADtags for clustering the reads for each pool candidate RAD tags for that pool, RADmarkers for clustering tags across all pools into candidate loci with alleles, RADMIDS for designing a set of MIDS for use in RAD adapters. These RAD methods have great potential for creating genomic scaffolds to assist in genome assembly and for identifying thousands of sequence variants to aid in detection of major as well as minor quantitative traits.
Investigates sequences to generate expression sequence tags (ESTs) or full length (FL)-cDNAs gene-oriented clusters. EasyCluster assists users in estimating effects produced by adding or removing specific ESTs, allows a graphical browsing of the created clusters and can also be used for splicing isoforms identification. This application aims to be used by users which are not accustomed with command-line based software.
Discovers SNP and characterizes plant germplasm. GBS-SNP-CROP adopts a clustering strategy to build a population-tailored “Mock Reference” from the same GBS data used for downstream SNP calling and genotyping. It may be used to augment the results of alternative analyses, whether or not a reference is available. The tool may complement other reference-based pipelines by extracting more information per sequencing dollar spent. GBS-SNPCROP may be useful even in this case, able to detect large numbers of additional high-quality SNPs missed by the tag-based and read length-restricted approach of TASSEL-GBS.
Analyses sequence. Afcluster allows to perform assembly with reduced resources and a minimal loss of quality. It allows soft expectation maximization (EM) clustering, in which case each sequence is only assigned to each cluster with some probability. This method gives some estimate of the clustering accuracy without the overhead of the consensus clustering. The ability to simultaneously assign the same sequence to several clusters is also useful when splitting a sample before performing assembly.
Consists of a clustering algorithm approach suitable for small sample size and high-dimensional datasets. 2D-EM consist of two steps: it (i) reformats a feature vector to a matrix form and, (ii) conducts the clustering. The software can perform clustering at high dimensional space (compared to the number of samples) by effectively incorporating data distribution information via its covariance matrix. It avoids the singularity issue by folding a feature vector into a feature matrix.
Consists of a divisive hierarchical maximum likelihood clustering method. DRAGON is a top-down procedure which does not find pairs, but instead takes out one sample at a time, maximally increasing the likelihood function. The software can correctly cluster data with distinct topologies. It was validated by performing analyses using synthetic and biological data.
Refines the read clustering using information of shared splice sites. EasyCluster2 can solve potential mapping errors at exon-exon junctions using dynamic programming approach in regions surrounding splice sites. This second version of EasyCluster allows researchers to manage genome scale transcriptome data and produce reliable gene-oriented clusters from more than 450 reads.
Enables deconvolution of mouse and human sequence reads form xenograft sequence data. XenofilteR offers solution for the problem of intermingled murine host and human cells in tumor xenografts. The software calculates the edit distance for each read that maps to both the human and mouse reference genomes. It can be applied to both DNA and RNA sequencing.
A family of alignment-free measures, called Dq-type, that incorporate quality value information and k-mers counts for the comparison of reads data. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. All these measures are implemented in a software tool called QCluster.
Discovers tight clusters of low density. clusterdv is based on the density valleys between data points and gives importance to distance than to gaps between clusters. It produces a hierarchical tree of “putative” cluster centres and employs an intuitive metric, the separability index or separability index (SI) to rank their importance. This tool can recognize groupings on real-world data that correspond to real natural phenomena.
Topics (9): De novo sequencing analysis, Metagenomic sequencing analysis, Homo sapiens, Sus scrofa, Jaw Abnormalities, Musculoskeletal Abnormalities, Craniofacial Abnormalities, Stomatognathic System Abnormalities, Maxillofacial Abnormalities