Microsatellite identification software tools | Whole-genome sequencing data analysis
Short tandem repeats (STRs), or microsatellites, are the class of repeat sequences that have repeat units of up to 6 bp directly adjacent to each other. STRs are generally more polymorphic than other kinds of variation such as sequence copy number and single-nucleotide polymorphisms. The length variability of STRs is associated with phenotypic variation in many species. These disorders are commonly caused by repeat expansion. Analysing the variation of STRs, and particularly long STRs, is an important step to understand their variability across individuals and the mechanisms that lead to their instability.
Allows genotyping, haplotyping, and phasing of short tandem repeats (STRs) from whole genome sequencing (WGS) data. HipSTR uses a multitude of inference techniques that integrate additional information about the haplotype in which the STR resides. The software enables specific detection of de novo STR variants. It is scalable and apt to the analysis of large-scale sequencing data and liable tool for genotyping STRs from Illumina sequencing data.
To characterize the mutational spectrum of somatic SVs in cancer, it is important to identify both simple (e.g., deletion, insertion, and inversion) and complex SVs at base-pair resolution. Meerkat predicts both germline and somatic SVs directly from short read data, focusing on complex events.
Permits simple sequence repeats simple sequence repeats (SSR) discovery and comprehensive statistical analysis, especially for large genomes. GMATo can be used for microsatellite sequence identification from any given DNA sequences or genomes at any size.
Finds all perfect simple sequence repeats (SSRs) in a given sequence. SSRIT provides a web app and a standalone version. This searching routine can be used to identify SSRs in different types of genomic DNA sequences, varying in size from several hundred nucleotides (BAC-end reads) up to 1 Mb of long contigs assembled from fully sequenced Bacterial Artificial Chromosome (BAC) and P1-derived Artificial Chromosome (PAC). It needs a sequence in FASTA format.
Performs short tandem repeat (STR) profiling in whole-genome sequencing data sets. lobSTR is an algorithm that consists of three steps: it (1) scans genomic libraries, flags informative reads that fully encompass STR loci, and characterizes their STR sequence, (2) uses a divide-and conquer strategy that anchors the nonrepetitive flanking regions of STR reads to the genome for revealing the STR position and length, and finally it (3) allelotypes the STRs.
Allows users to model each variable number tandem repeats (VNTR), count repeat units, and detect sequence variation. adVNTR reports for any target VNTR in a donor an estimate of repeat unit (RU) counts and points mutations within the RUs. It trains Hidden Markov Models (HMMs) for each target VNTR locus, which provide the following advantages: (1) matching any portions of the unique flanking regions for read alignment; (2) separating homopolymer runs from other indels helping with frameshift detection; and (3) each VNTR can be modeled individually.
Detects perfect microsatellites and compound microsatellites in nucleotide sequences. MISA can predict perfect compound microsatellites that contains multiple occurrences of more than one simple sequence motif. This software is based on two Perl scripts that serves as interface modules for the program-to-program data interchange to design primers flanking of the microsatellite loci. It can exploit the NCBI database to find sequences by defining the corresponding accession numbers as input.
An approach that uses a 'kmer' strategy to assemble misaligned sequence reads for predicting insertions, deletions, inversions, tandem duplications and translocations at base-pair resolution in targeted resequencing data. Variants are predicted by realigning an assembled consensus sequence created from sequence reads that were abnormally aligned to the reference genome. Using targeted resequencing data from tumor specimens with orthogonally validated SV, non-tumor samples and whole-genome sequencing data, BreaKmer had a 97.4% overall sensitivity for known events and predicted 17 positively validated, novel variants.
Determines genotypes for microsatellite repeats in high-throughput sequencing data. RepeatSeq is based on a Bayesian model selection that is build on an empirically derived error model including sequence and read properties. This enables the assignment of the most probable genotype and deals with the reference length of the repeat, the repeat unit size and the average base quality of the mapped reads.
Identifies short tandem repeat (STR) variations in paired-end next generation sequencing (NGS) data. STRViper exploits Bayesian inference to process VTR variation by leveraging diverging fragment sizes and by recognizing the causes of such. This approach only needs STRs of interest in one or more fragments and so can be used to work on STRs that are longer than the reads generated. It also predicts the polymorphic repeats across a population of genomes and exposes several polymorphic repeats.
Assists users in discovering short tandem repeat (STR) through a scan of short read shotgun sequence data. BaitSTR consists of a set of programs that (i) collapses reads carrying STRs into a set of candidate loci and (ii) extends these candidates through a local assembly process to characterize the flanking sequence. Each step is executed using a separate module, for marker discovery and development.
Permits users to automatically discover structural variations (SVs). Tardis is a toolkit that integrates read pair, read depth, and split read (using soft clipped mappings) sequence signatures to discover several types of SV, while resolving ambiguities among different putative SVs. This application is suitable for cloud use as the memory footprint is low. It is also capable of characterizing deletions, small novel insertions, tandem duplications, inversions, and mobile element retrotransposition.
Detects microsatellite arrays, design primers, and tag primers using an automated routine. msatcommander locates microsatellite arrays within user-selected repeat classes by making correspond regular expression pattern within each DNA sequence. It employs alphabetical, noncomplementary designation, as well as repeat sequences to discover repeat sequences. This tool considers only primer pairs when they are at least 10-bp distant from the start and stop positions of the detected array.
A method that identifies SVs and their precise breakpoints from whole-genome resequencing data. PRISM uses a split-alignment approach informed by the mapping of paired-end reads, hence enabling breakpoint identification of multiple SV types, including arbitrary-sized inversions, deletions and tandem duplications.
Permits genotyping deletions and tandem duplications from paired-end whole genome sequencing (WGS) data. SV2 consists of a supervised support vector machine (SVM) classifier that employs read depth, discordant paired-ends, and split-reads to work. It includes variant calls from multiple structural variant discovery algorithms into a unified call set with low rates of false discoveries. This tool aims to ease genotyping, likelihood estimation and analysis of structural variation (SV) association.
Allows simple sequence repeats (SSR) discovery and locus development from 454-generated raw data. HighSSR is a microsatellite prediction framework that can facilitate: the recognition of SSR motifs, the parsing of multiplex identifier (MID) tagged sequences for identification of multiplexed samples, the identification of unique SSR loci within a sample and the development of polymerase chain reaction (PCR) primers for the recovered loci. It can be applied to cluster reads made with platforms such as Illumina HiSeq 2000/2500 and Ion Torrent PGM.
Identifies inherited alleles of microsatellite loci from next-generation sequencing (NGS) data, using a discretized Gaussian mixture model combined with a rules-based approach. GenoTan is a program that also employs an homopolymer decomposition, to estimate error bias toward deletion or insertion in homopolymer runs. The software was designed to detected microsatellite variants shorter than read lengths.
A method for targeted profiling of short tandem repeats (STRs) that reports a full spectrum of all observed genomic variants along with their respective abundance. TSSV can accurately profile and characterize STRs without the use of a complete reference genome, and therefore minimizes biases introduced during the alignment and downstream analysis. TSSV scans sequencing data for reads that fully or partially encompass loci of interest based on the detection of unique flanking sequences. Subsequently, TSSV characterizes the sequence between a pair of non-repetitive flanking regions and reports statistics on known and novel alleles for each locus of interest.
Studies microsatellite variation within all individuals of a population simultaneously. PopSTR starts by determining a set of informative reads for each marker/individual pair and computing various attributes for the reads. It allows users to specify a minimum number of flanking bases needed on each side of the repeat. This tool can map read pairs to a graph reference permitting to align both of the read containing the microsatellite.
Extracts perfect, imperfect and compound SSRs/Microsatellites from DNA sequences. IMEx is based on a simple algorithm that scans the entire the entire DNA sequence and reports the microsatellites in a single run. This software is able to produce data such as nucleotide composition, coding/non-coding information, number of iterations, imperfection and more about the microsatellites along with the alignments.