Cluster analysis software tools | Population genetics data analysis
A series of methods in population genetics use multilocus genotype data to assign individuals membership in latent clusters. These methods belong to a broad class of mixed-membership models, such as latent Dirichlet allocation used to analyze text corpora. Inference from mixed-membership models can produce different output matrices when repeatedly applied to the same inputs, and the number of latent clusters is a parameter that is often varied in the analysis pipeline. For these reasons, quantifying, visualizing, and annotating the output from mixed-membership models are bottlenecks for investigators across multiple disciplines from ecology to text data mining.
A program that deals with label switching and multimodality problems in population-genetic cluster analyses. CLUMPP permutes the clusters output by independent runs of clustering programs such as structure, so that they match up as closely as possible. The user has the option of choosing one of three algorithms for aligning replicates, with a tradeoff of speed and similarity to the optimal alignment.
Automates the post-processing of results of model-based population structure analyses. For analysing multiple independent runs at a single K value, Clumpak identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software Clumpp. Next, Clumpak identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp and simplifying the comparison of clustering results across different K values. Clumpak incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. Clumpak simplifies the use of model-based analyses of population structure in population genetics and molecular ecology.
Provides a general method for visualizing estimated membership coefficients. Subpopulations are represented as colours, and individuals are depicted as bars partitioned into coloured segments that correspond to membership coefficients in the subgroups. Distruct can also be used to display subpopulation assignment probabilities when individuals are assumed to have ancestry in only one group. Various options enable the user to control left-to-right printing order of populations, bottom-to-top printing order of clusters, colors, and other graphical details.
A freely available software package for post-processing output from clustering inference using population genetic data. pong combines a network-graphical approach for analyzing and visualizing membership in latent clusters with an interactive D3.js-based visualization. pong outpaces current solutions by more than an order of magnitude in runtime while providing a user-friendly, interactive visualization of population structure that is more accurate than those produced by current tools. Thus, pong enables unprecedented levels of scale and accuracy in the analysis of population structure from multilocus genotype data.
Allows users to cluster genomic and transcriptomic data with no limitation on genome size or phylogenetic distances. Cnidaria aims to detect different types of biological specimens. For instance, this tool has been applied to cluster over 160 genomic and transcriptomic datasets.
Assists users with model selection in model-based clustering of mixed-data with missing values. VarSelLCM was developed for biological problems such as clustering of cytological diagnosis or human population genomics. This method permits users to analyze continuous, categorical, integer or mixed data. It also includes a shiny application that simplifies interpretation of the results.
Discovers the mean in a partition distribution. mean_partition is based on an adapted dynamical version of the Hungarian algorithm. It employs Monte Carlo techniques and the dynamic matrix inverse to proceed. This tool computes and investigates a consensus for a large set of partitions, even when the number of elements and/or the number of clusters is high.