Allows users to analyze T-cell receptor (TCR) repertoire sequences produced by deep sequencing. Decombinator is a program that employs a string matching algorithm to search the FASTQ files produced by high-throughput sequencing (HTS) machines for rearranged TCR sequence. This pipeline contains a central algorithm searches for 'tag' sequences, the presence of which indicates the inclusion of particular V or J genes in a recombination.
Identifies and delimits the variable (V), diversity (D) and joining (J) genes and alleles within immunoglobulin (Ig) or T-cell receptor (TR) nucleotide rearranged sequences. IMGT/V-QUEST delimits framework and complementarity determining regions. It then displays a graphical 2D representation of the variable region. It can also interact with additional IMGT tools for more detailed analysis via an option on the Search page for a detailed analysis, and via a link on the result page for the phylogenetic analysis of the variable region of input sequence.
Allows users to analyze T-cell antigen receptor (TCR) sequencing data. MiTCR is a program permitting the study of hundreds of millions of raw high-throughput sequencing reads containing sequences encoding human or mouse a or TCR chains. It also allows the extraction of -cell clones from next generation sequencing (NGS) data.
Allows users to perform analysis of antibody variable domain repertoires. VDJFasta consists of a suitable tool useful for mammalian repertoire sequences obtained either by Sanger.
Produces high-quality microbial genome assemblies on a laptop computer without any parameter tuning. A5 is a program that automates all the steps to generate bacterial genome assemblies from raw Illumina data. It consists of five steps: (1) read cleaning; (2) contig assembly; (3) crude scaffolding; (4) misassembly correction; and (5) final scaffolding.
ABySS / Assembly By Short Sequences
Enables the assembly of a human genome, using short reads from a high-throughput sequencing platform. ABySS consists of a parallelized sequence assembler that allows parallel computation of the assembly algorithm across a network of commodity computers. This algorithm proceeds in two stages: (1) it generates all possible substrings of length k (termed k-mers) form the sequence reads; and (2) it uses mate-pair information to extend contigs by resolving ambiguities in contig overlaps.
Provides a whole‐genome shotgun assembler that can generate high‐quality genome assemblies using short reads (~100bp) such as those produced by the new generation of sequencers. The ALLPATHS-LG assemblies are not necessarily linear, but instead are presented in the form of a graph. This graph representation retains ambiguities, such as those arising from polymorphism, uncorrected read errors, and unresolved repeats, thereby providing information that has been absent from previous genome assemblies. ALLPATHS‐LG requires high sequence coverage of the genome in order to compensate for the shortness of the reads. The precise coverage required depends on the length and quality of the paired reads, but typically is of the order 100x or above.
Provides tools and class interfaces for the assembly of DNA reads. AMOS includes modular assembly pipelines, as well as tools for overlapping, consensus generation, contigging, and assembly manipulation. The AMOS pipeline config file can be modified by users to add additional processing steps. The software includes a number of conversion utilities allowing to process data from a variety of input sources and to output the data in commonly used assembly formats.
Identifies allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences. Celera assembler is an algorithm to produce a set of haploid consensus sequences rather than a single consensus sequence. It uses a dynamic windowing approach and detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation. Celera assembler also assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele.
Detects structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph. CloudBrush is a conservative assembler that nevertheless can generate precise contigs that avoid error propagation in downstream analysis with moderate N50 contig lengths. It can be implemented as a modular pipeline, allowing to be easily extended when improved algorithms have been developed.
Uses for Next-Generation Sequencing (NGS) technology. CongrPE is a de novo assembly algorithm which was based on Contig growing of Paired-End Read mechanism. The assembly included four step: pre-process, seed contig generation, seed contig growing and scaffold. Current congrPE is a draft version without parameter setting except insert size. The result contigs are import in file result.fa and the output of programme gives the N50 result.
Uses the Hadoop/MapReduce distributed computing framework to enable de novo assembly of large genomes. Similar to other leading short read assemblers, Contrail relies on the graph-theoretic framework of de Bruijn graphs (DBG). However, unlike these programs, Contrail uses Hadoop to parallelize the assembly across many tens or hundreds of computers, effectively removing memory concerns and making assembly feasible for even the largest genomes. It has extensions to efficiently compute a traditional overlap-graph based assembly of large genomes within Hadoop.
DSP / Denovo Solid Pipeline
Optimizes the number of read-correction runs and computational resources via dynamic programming. Denovo Solid Pipeline is a semi-automated pipeline for short genome assembly using SOLiD sequencing data. This package has the advantage over the currently available pipeline of generating more contigs suitable for further assembly steps, increasing the chances of detecting sequencing errors and/or polymorphic sites.
Exploits the pairing information issued from inserts of potentially any length. Edena determines suited overlaps cutoffs according to the contextual coverage, reducing thus the need for manual parameterization. It relies on simultaneously exploiting the information provided by short and long paired-end sequence reads to discover paths in the assembly graph. The tool works for inserts of potentially any length by constraining possible paths.
Corrects and congregates errors in short reads. EULER-SR assembler replaces the higher measure tree optimization of the A-Bruijn graph by the maximum branching optimization on de Bruijn graphs. It provides the contigs and the repeat graph of the assembled genome, link the contigs by repeats and direct finishing effort. Subsequently, it improves use of mate-pairs when they become available, as tested on 454 Life Sciences and Illumina.
Conducts next generation sequencing (NGS) investigation. Geneious provides visual sequence alignment and editing, sequence assembly, comprehensive molecular cloning and phylogenetic analysis. It increases process efficiency and improves data organization. This tool enables the importation and conversion of a vast range of data types and offers a solution to customize researchers’ algorithms.
Assembles de novo genomes from fragments of DNA that specifically attacks the question of scalability. Gossamer is an extension of a prototype based on the succinct representation of de Bruijn assembly graphs as a bitmap or set of integers. It assembles base-space paired reads such as those from an Illumina sequencing platform. The tool operates in a series of explicit passes to give the user control of the assembly process.
Assists in assembling de novo short read sequencing data. JR-Assembler is a program that permits users to extend a read by other whole reads. It utilizes a dynamic back trimming process to avoid extension termination due to sequencing errors. Moreover, this tool processes in five steps: raw read processing, seed selection, seed extension, repeat detection, and contig merging.
Assembles short reads of next generation sequencing (NGS) technologies at low coverage. LOCAS uses a mismatch sensitive overlap-layout-consensus approach. It assembles homologous regions in a homology-guided manner while it performs de novo assemblies of insertions and highly polymorphic target regions subsequently to an alignment-consensus approach. The tool contains an extension, SUPERLOCAS, which provides some additional features for resequencing projects.
MaSuRCA / Maryland Super-Read Celera Assembler
Allows variable read lengths while tolerating a significant level of sequencing error. MaSuRCa is a whole genome assembly software that can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads. It also combines two methods, the de Bruijn graph and Overlap-Layout-Consensus (OLC). It is able to transform large numbers of paired-end reads into a much smaller number of longer ‘superreads’.
Consists of an assembler that relies on a traversal of a subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high-quality extensions. Meraculous incorporates a low-memory hash structure to access the deBruijn graph, allowing a small memory footprint compared with other short-read assemblers. Moreover, this tool avoids an explicit error correction step, instead relying on base quality scores.
Assembles a human genome on a desktop computer in a day. Minia is based on a de Bruijn graph. It combines the Bloom filter, the critical false positives structure and the marking structure. The tool performs a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours. It constructs a set of contigs to determine neighbors of each node. Minia produces results of similar contiguity and accuracy to other de Bruijn assemblers.
MIRA / Mimicking Intelligent Read Assembly
Consists of an iterative multiple-pass system that focuses on observed data. MIRA is a program that enables users to utilize basic algorithms for both branches of the assembly system. This tool searches for patterns on a symbolic level in an alignment to identify differences in repetitive sequences in a genome assembly. It subsequently tags the bases, allowing discrimination of repeats.
Takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. PASHA is a parallelized short read assembler using de Bruijn graphs (DBG) that is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. It employs a sorted vector data structure, instead of a hash-map, to store k-mers and their graph-related information.
Integrates features for managing large datasets and produces contiguous assemblies. PE-Assembler is a method based on an extension approach. It aims to reconstruct the sample genome from a paired-end read library and can accept multiple paired-end read libraries of different insert sizes. Theses reads can assist in solving ambiguities that cannot be resolved using a single paired-end read library.
QSRA / Quality-value guided Short Read Assembler
Consists of a quality-value guided de novo short read assembler. QSRA is program built upon the VCAKE algorithm. The software includes the option to output suspected repeated regions to a separate file, aiding in repeat-related analysis.
Assembles reads obtained with new sequencing technologies (Illumina, 454, SOLiD) using MPI 2.2. Ray allows to reduce the number of contigs and the number of errors. It can serve as a basis to develop an assembler that can be of universal utilization. The tool can calculate assemblies in parallel using message passing interface. Ray performs very well on mixed datasets and helps to assemble genomes using high-throughput sequencing.
SGA / String Graph Assembler
Assembles large genomes from high coverage short read data. SGA implements a set of assembly algorithms based on the FM-index. It corrects base calling errors in the reads, assembles contigs from the corrected reads, uses paired end and/or mate pair data to build scaffolds from the contigs. This tool returns a visual report that allows to display the properties of the genome and data quality.
Assembles short-read (25–40-mer) data with high accuracy and speed. SHARCGS allows to exploit novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy. Its efficiency was tested on BAC inserts from three eukaryotic species, on two yeast chromosomes, and on two bacterial genomes (Haemophilus influenzae, Escherichia coli).
Addresses the problem of higher margin of sequencing errors and other artifacts. SHORTY is targeted for de novo assembly of microreads with mate pair information and sequencing errors. It has features in addressing the short-read assembly problem. Header file descriptions, paths and some issues were changed in the code to correct some errors in running process on sample data with the tool.
Constructs de novo draft assembly for the human-sized genomes. SOAPdenovo is specially designed to assemble Illumina GA short reads and is able to resolve longer repeat regions in contig assembly. SOAPdenovo is made up of six modules that handle read error correction, de Bruijn graph (DBG) construction, contig assembly, paired-end (PE) reads mapping, scaffold construction, and gap closure. It was used as a basis for the MEGAHIT software.
Contains sparse graph approach to de novo genome assembly. SparseAssembler consistently produces comparable results to the current state-of-the-art de Bruijn graph-based assemblers. It demands considerably smaller amounts of computer memory, using both simulated and real data. This approach can be extended for a sparse string graph as well, by selecting a sparse subset of the reads when constructing the overlap graph.
Provides a de novo assembler for short DNA sequence reads. SSAKE is designed to help leverage the information from short sequences reads by assembling them into contigs and scaffolds that can be used to characterize novel sequencing targets. SSAKE assembles whole reads (not k-mers) and as such, is well-suited for structural variant assembly/detection. SSAKE is written in PERL and runs on Linux. SSAKE cycles through short sequence reads stored in a hash table and progressively searches through a prefix tree for extension candidates. The algorithm assembled 25 to 300 bp (genome, transcriptome, amplicon) reads from viral, bacterial and fungal genomes. SSAKE is lightweight, simple to setup & run and robust.
SUTTA / Scoring-and-Unfolding Trimmed Tree Assembler
Provides a de novo DNA sequence assembler based on global search-methods in order to contain the complexity of the assembly problem. SUTTA algorithm was developed addressing the following issues: developing better ways to dynamically evaluate and validate layouts, formulating the assembly problem more faithfully, devising superior and accurate algorithms, taming the complexity of the algorithms and finally, a theoretical framework for further studies along with practical tools for future sequencing technologies. Because of the generality and flexibility of the scheme, SUTTA is capable of agnostically adapting to various rapidly evolving technologies. It also allows concurrent assembly and validation of multiple layouts, thus providing a flexible framework that combines short- and long-range information from different technologies. SUTTA's binaries are freely available to non-profit institutions for research and educational purposes.
Implements a read fragment assembly algorithm for de-novo genome of short reads generated by sequencing machines. Taipan uses a combination of the greedy extension and the overlap graph method. A performance evaluation using real Illumina datasets shows that Taipan can achieve assembly qualities comparable to the graph-based approaches within a reasonable execution time. The Taipan source code algorithm is freely available for download.
Extends long paths through a series of read-overlap graphs and evaluates them based. Telescoper uses short- and long-insert libraries in an integrated way throughout the assembly process. It produces more continuous assemblies than the other algorithms considered. The tool can be used as a finishing algorithm to extend contigs into repetitive regions and produce better assemblies for telomeres.
Assembles genetic sequences. VCAKE is able to assemble millions of small nucleotide reads even in the presence of sequencing error. It extends the seed sequence one base at a time using the most commonly represented base from these matching reads, provided that the set of reads passes certain conditions. The tool may be a step towards the assembly of larger bacterial genomes from short reads, particularly with the development of superior base calling or paired end technology.
Manipulates de Bruijn graphs (DBG) for genomic sequence assembly. Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454. It takes in short read sequences, removes errors, then produces high quality unique contigs uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs. Velvet represents an approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.
AGORA / Assembly Guided by Optical Restriction Alignment
Uses optical map information directly within the de Bruijn graph framework to help produce an accurate assembly of a genome that is consistent with the optical map information provided.
The software implements a greedy algorithm and uses graph theory to link and orient assembled existing contigs quickly and accurately using mate pair information.
FPSAC / Fast Phylogenetic Scaffolding of Ancient Contigs
Sequencing ancient genomes raises specific problems, because of the decay and fragmentation of ancient DNA among others, making the scaffolding of ancient contigs challenging. FPSAC is a general method to combine both sequencing and computational reconstruction.
Closes gaps with a preassembled contig set or a long read set (i.e., error-corrected PacBio reads). GMcloser uses likelihood-based classifiers calculated from the alignment statistics between scaffolds, contigs, and paired-end reads to correctly assign contigs or long reads to gap regions of scaffolds, thereby achieving accurate and efficient gap closure. An accompanied package, GMvalue, is a tool to determine misassembly sites in contigs and scaffolds.
GRASS / GeneRic ASembly Scaffolder
A generic algorithm for scaffolding next-generation sequencing assemblies.
A program for scaffolding contigs produced by fragment assemblers using mate pair data such as those generated by ABI SOLiD or Illumina Genome Analyzer.
Opera / Optimal Paired-End Read Assembler
A scalable, exact algorithm for the scaffold assembly of large, repeat-rich genomes, with consistent improvement over state-of-the-art programs for scaffold correctness and contiguity. OPERA provides a rigorous framework for scaffolding of repetitive sequences and a systematic approach for combining data from different second-generation (Illumina, Ion Torrent) and third-generation (PacBio, ONT) sequencing technologies. OPERA efficiently scaffolds large genomes with provable scaffold properties, providing an avenue for systematic augmentation and improvement of 1000s of existing draft eukaryotic genome assemblies.
The abundance of repeat elements in genomes can impede the assembly of a single sequence. The tool Scaffold_builder was designed to generate scaffolds (super contigs of sequences joined by N-bases) using the homology provided by a closely related reference sequence. Scaffold_builder is an advanced wrapper for Nucmer, written in Python that resolves several situations that may arise when mapping contigs to the reference genome.
A stand-alone scaffolding tool for NGS data. It can be used together with virtually any genome assembler and any NGS read mapper that supports SAM format.
Predicts relative positions and orientations of the contigs, yielding a directed contig graph. SLIQ provides a set of simple linear inequalities derived from the geometry of contigs on the line. It produces a reduced subset of reliable mate pairs and thus a sparser graph which results in a simpler optimization problem for the scaffolding algorithm. The output of this scaffolder can either be used as draft scaffolds or as a reasonable starting point for refinement with more complex optimization procedures used in other scaffolders.
Assembler for mate pair/paired-end reads from high throughput sequencing platforms, e.g. Illumina and SOLiD.
A stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data. It is unique in offering the possibility to manually control the scaffolding process. By using the distance information of paired-end and/or matepair data, SSPACE is able to assess the order, distance and orientation of your contigs and combine them into scaffolds.
A next-generation sequencing suite of variant analysis tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in Whole Exome Capture Sequencing (WECS) data. SNPs may be called using the Atlas-SNP2 application and indels may be called using the Atlas-Indel2 application.
Performs genotype calling, genotype phasing, imputation of ungenotyped markers, and identity-by-descent segment detection. Beagle can be applied to thousands of samples across genome-wide single nucleotide polymorphism (SNP) data. It can retrieve short tracts of identity by descent (IBD). This tool utilizes composite reference haplotypes to model large genomic regions with a parsimonious statistical model.
Allows users to perform single nucleotide polymorphism (SNP) calling for both Illumina and SOLiD data. ComB aims to find different information from short color or nucleotide reads and can be used with large datasets. It includes several features: a Bayesian model assisting researchers to determine genome variants in color space; the inclusion of ambiguous reads; or sensitivity to block nucleotide polymorphisms (BNPs).
Provides a probabilistic tool for the discovery of single nucleotide variants in whole genome shotgun sequencing (WGSS) data. CoNAn-SNV integrates copy number data into a Bayesian mixture model framework by employing a reduced copy number space with six states. Its main goal is to infer single nucleotide variants (SNVs) that overlap copy number alterations. It is built on a method that models the notion that genomic regions of segmental duplication and amplification that generates an extended genotype space.
A computational approach that analyzes the depth-of-coverage of high-throughput DNA sequencing reads, and can integrate paired-end and breakpoint junction analysis based CNV-analysis approaches, to infer locus copy-number genotypes. CopySeq can ascertain genomic structural variation in specific gene families as well as at a genome-wide scale, where it may enable the quantitative evaluation of CNVs in genome-wide association studies involving high-throughput sequencing.
Allows de novo genome assembly and multisample variant calling. Cortex is a modular set of multi-threaded programs for manipulating assembly graphs. Linked de Bruijn Graph (LdBG) data structure and associated algorithms are implemented as part of the software. It was used for two tasks where long-range information is likely to be beneficial: finding large differences from a reference and analysis of genomic context for drug resistance genes, which was validated using a PacBio reference assembled for the sample.
CRISP / Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms
Detects SNPs and short indels from high-throughput sequencing of pooled DNA samples. CRISP has been primarily developed to analyze data from "artificial" DNA pools, i.e. pools generated by equi-molar pooling of DNA from multiple individual samples. CRISP leverages sequence data from multiple such pools to detect both rare and common variants. Note that the method is not designed for variant detection from a single pool. CRISP was developed for targeted disease association studies in humans but may work well for other applications.
FamSeq / Family-based Sequencing program
A computational tool for calculating probability of variants in family-based sequencing data. It is still challenging to call rare variants. In family-based sequencing studies, information from all family members should be utilized to more accurately identify new germline mutations. FamSeq serves this purpose by providing the probability of an individual carrying a variant given his/her entire family’s raw measurements. FamSeq accommodates de novo mutations and can perform variant calling at chrX.
A Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment.
GAMES / Genomic Analysis of Mutations Extracted by Sequencing
A pipeline aiming to serve as an efficient middleman between data deluge and investigators. GAMES attains multiple levels of filtering and annotation, such as aligning the reads to a reference genome, performing quality control and mutational analysis, integrating results with genome annotations and sorting each mismatch/deletion according to a range of parameters. Variations are matched to known polymorphisms. The prediction of functional mutations is achieved by using different approaches. Overall GAMES enables an effective complexity reduction in large-scale DNA-sequencing projects.
Detects nucleotide polymorphism. glfMultiples is a variant caller for next-generation sequencing data (NGS). The software considers, for each possible position, a series of potential polymorphisms that include transitions and transversions from the reference base and bi-allelic polymorphisms where neither of the alleles present in the sample is the reference base. Detection of polymorphic sites takes into account the maximized likelihood but also an overall prior for each type of polymorphism.
A computer program for phasing observed genotypes and imputing missing genotypes. IMPUTE increases accuracy and combines information across multiple reference panels while remaining computationally feasible. IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made.
A sensitive and robust approach for calling single-nucleotide variants (SNVs) from high-coverage sequencing datasets, based on a formal model for biases in sequencing error rates. LoFreq adapts automatically to sequencing run and position-specific sequencing biases and can call SNVs at a frequency lower than the average sequencing error rate in a dataset. LoFreq’s robustness, sensitivity and specificity were validated using several simulated and real datasets (viral, bacterial and human) and on two experimental platforms (Fluidigm and Sequenom).
It can resolve long haplotypes or infer missing genotypes in samples of unrelated individuals. Specifically, MACH can estimate haplotypes, impute missing genotypes in a variety of populations, using the HapMap sample or another set of densely genotyped individuals as a reference, analyze shotgun re-sequencing data from high-throughput technologies now being developed, and carry out simple tests of association.
Serves for prioritizing candidate variants in family-based studies of inherited disease. MendelScan can perform variant scoring, linkage mapping for family exome sequencing, shared identity-by-descent (IBD) mapping and rare-heterozygote-rule-out (RHRO) mapping.
Finds medium sized (10-50bp) indels from high throughput sequencing datasets. MoDIL is able to identify small variants by using high clone coverage of short-read sequencing technologies. This software proceeds by comparing the distribution of insert sizes in the sequenced library to the distribution of the observed mapped distance at a certain genomic location.
NGSEP / Next Generation Sequencing Eclipse Plugin
Permits analysis of high throughput sequencing (HTS) data. NGSEP is an integrated framework whose main functionality is the variants detector, allowing researchers to make integrated discovery of single nucleotide variants (SNVs), small and large indels and regions with copy number variation (CNVs). The software also provides modules for read alignment, sorting, merging, functional annotation of variants, filtering and quality statistics.
A tool designed for efficient and accurate variant-detection in high-throughput sequencing data. By using local realignment of reads and local assembly it achieves both high sensitivity and high specificity. Platypus can detect SNPs, MNPs, short indels, replacements and (using the assembly option) deletions up to several kb. It has been extensively tested on whole-genome, exon-capture, and targeted capture data.
Calls single nucleotide polymorphisms (SNPs) and short indels for both Ion Torrent and 454 resequencing data. PyroHMMvar is a method that has two distinct features: (i) an HMM to formulate homopolymer errors and which can distinguish real signals from sequencing errors and thus improve the alignment of reads against the reference and (ii) a graph data structure that merges multiple aligned reads at a given locus into a weighted alignment graph. PyroHMMvar is also available as part of the toolkit PyroTools.
Detects and allows interactive visualization of single-nucleotide polymorphisms (SNPs). QualitySNPng combines SNP detection and genotyping with interactive visualization of the results. This software provides a graphical user interface with pre-set filter options that is configurable for specific needs. It is appropriate to use in marker SNP identification or to analyze RNA-seq data with up to several million reads per transcript to genotype a mixture of a hundred accessions.
Provides assistance for making estimation of allele frequency and single nucleotide polymorphism (SNP) calling. realSFS is built on a fast hack of samtools software that enables to send the genotype likelihood directly in-memory to the program part of the code.
Allows integrated analysis of next-generation sequencing (NGS) data. RUbioSeq is a multi-platform application that uses well established tools to implement pipelines for DNA-seq, CNAseq, bisulfite-seq and ChIP-seq experiments. The software incorporates a graphical user interface (GUI), designed for interdisciplinary research groups where bioinformaticians and biomedical researchers work together. The modular structure permits easy adaptation and extension.
Estimates allele frequency and call variants in heterogeneous samples. RVD2 improves upon current classifiers and has higher sensitivity and specificity over a wide range of median read depth and minor allele fraction. It is able to use multiple cores in parallel, which can significantly improve time efficiency. The tool does not address identification of indels, structural variants (SV) or copy number variants (CNV). Those mutations typically require specific data analysis models and tests that are different than those for single nucleotide variants
Performs genotype calling using next-generation sequencing (NGS) data from multiple unrelated individuals. SeqEM is an adaptive approach that does not require prior estimates of genotype frequencies or nucleotide read error but rather is driven by the data. The software leverages information from NGS data for multiple individuals by using the Expectation-Maximization (EM) algorithm to numerically maximize the observed data likelihood with respect to genotype frequencies and the nucleotide-read error rate.
Permits integrative single nucleotide polymorphism (SNP) analysis in next generation sequencing (NGS) data with large cohorts. SNPTools is a pipeline that can perform variant site discovery, genotype likelihood estimation, and genotype/haplotype inference from population NGS data. The software is composed of four modules: (1) effective base depth (EBD) calculation, (2) SNP site discovery, genotype likelihood (GL) estimation and (4) genotype/haplotype imputation. It is flexible in using inputs from other GL generation and imputation engines.
A statistical tool for calling common and rare variants in analysis of pool or individual next-generation sequencing data. SNVer reports one single overall p-value for evaluating the significance of a candidate locus being a variant, based on which multiplicity control can be obtained. Loci with any (low) coverage can be tested and depth of coverage will be quantitatively factored into final significance calculation. SNVer runs very fast, making it feasible for analysis of whole-exome sequencing data, or even whole-genome sequencing data.
Detects somatic single nucleotide variant (SNV) detection using next generation sequencing (NGS) data. SNVMix is a probabilistic approach based on a Binomial mixture model, that can infer SNVs from aligned NGS data obtained from tumors. The software encodes base and mapping qualities by using them to probabilistically weight the contribution of each nucleotide to the posterior probability of a SNV call.
A method based on Bayes’ theorem (the reverse probability model) to call consensus genotype by carefully considering the data quality, alignment, and recurring experimental errors.
Allows users to detect single nucleotide polymorphisms (SNPs). SolSNP is a variant calling tool exploiting a modified Kolmogorov-Smirnov statistic and data filtering to analyse next-generation sequencing (NGS) alignment data. This program can be used for extracting phylogenetically informative SNPs from alignment of both coding and non-coding regions.
A program to screen for sequence variants (SNPs, deletions) in sequence data generated by high-throughput-sequencing platforms.
A software tool for analyzing de novo mutations from familial and somatic tissue sequencing data. DeNovoGear uses likelihood-based error modeling to reduce the false positive rate of mutation discovery in exome analysis and fragment information to identify the parental origin of germ-line mutations.
EBCall / Empirical Baysian mutation Calling
Detects somatic mutations. EBCall is a statistical framework using a massively parallel sequencing of the cancer genome and explicitly taking into account prior information of sequencing errors. The software is able to detect a series of somatic mutations that have allele frequencies of <10% with a high degree of accuracy, thereby identifying sub-clonal structures of cancer cells that cannot otherwise be found.
Enables detection of somatic mutations in paired sequence data. JointSNVMix implements a probabilistic graphical model to analyze sequence data from tumor/normal pairs. The software includes the JointSNVMix1/2 classifiers, as well as four methods. The classifiers JointSNVMix1 and JointSNVMix2 are generative probabilistic models that describe the joint emission of the allelic count data observed at position i in the normal and tumor samples.
Enables somatic single nucleotide variant (SNV) detection using next generation sequencing (NGS) data from matched tumor/normal samples. mutationSeq implements four standard machine learning algorithms: random forest, Bayesian additive regression tree, support vector machine and logistic regression.
Identifies somatic mutations with very low allele-fractions in impure and heterogenous cancer samples. MuTect is built on a Bayesian classifier that requires a few supporting reads and proposes tuned filters to assure high specificity. The software’s sensitivity is able to predict low-allele fraction events that uniquely arrange the tool to analyze samples with low purity or with complex subclonal structure.
Polymutt / Polymorphism and Mutation discovery
Implements a likelihood-based framework for calling single nucleotide variants and detecting de novo point mutation events in families for next-generation sequencing (NGS) data. Polymutt simplifies detection and genotyping of single nucleotide polymorphisms (SNPs). It aims to facilitate the study of families, rare variants and de novo mutation events. It also intends to transform sequence data into accurate genotypes.
Allows heuristic calling of somatic and germline single nucleotide variants (SNV) from next-generation sequencing (NGS) data. qSNP is a heuristics-based single nucleotide variant caller that can perform somatic mutation calling in samples with low tumor content. Its performance was assessed in samples of varying purity, that were generated by mixing a tumor cell line and its matched normal sample at varying proportions and sequencing each mixture.
Allows users to interact with high-throughput sequencing data. SAMtools permits the manipulation of alignments in the SAM/BAM/CRAM formats: reading, writing, editing, indexing, viewing and converting SAM/BAM/CRAM format. It limits the mapping quality of reads with excessive mismatches and applies base alignment quality to fix alignment errors. This tool can sort and merge alignments, remove polymerase chain reaction (PCR) duplicates or generate per-position information.
The Somatic Indel Detector can be run in two modes: single sample and paired sample. In the former mode, exactly one input bam file should be given, and indels in that sample are called. In the paired mode, the calls are made in the tumor sample, but in addition to that the differential signal is sought between the two samples (e.g. somatic indels present in tumor cell DNA but not in the normal tissue DNA). In the paired mode, the genotyper makes an initial call in the tumor sample in the same way as it would in the single sample mode; the call, however, is then compared to the normal sample.
Discovers somatic single nucleotide variations (SNVs) in tumors. SomaticSniper makes a Bayesian comparison of the genotype likelihoods in the tumor and normal to work. It computes the probability that two samples have identical genotypes in both samples. This tool is useful for false positive reduction techniques, such as base quality recalibration. It provides statistical and empirical filters to increase the validation rate.
Provides analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. Strelka is a variant calling method intending to increase the efficiency of variant calling for both germline and somatic analysis. It includes a germline caller which supplies read-backed phasing, an alignment-based haplotyping approach at each variant locus, and the ability of analyzing input sequencing data using a mixture-model indel error estimation method.
A platform-independent mutation caller for targeted, exome, and whole-genome resequencing data generated on Illumina, SOLiD, Life/PGM, Roche/454, and similar instruments. The newest version, VarScan 2, is written in Java, so it runs on most operating systems. It can be used to detect different types of variation: 1) germline variants (SNPs and indels) in individual samples or pools of samples, 2) multi-sample variants (shared or private) in multi-sample datasets (with mpileup), 3) somatic mutations, LOH events, and germline variants in tumor-normal pairs and 4) somatic copy number alterations (CNAs) in tumor-normal exome data.
Estimate sample composition accurately or the level of contamination of a disease sample without genotyping. Virmid is a probabilistic method for Single Nucleotide Variation (SNV) calling. This application increases genotyping accuracy, especially somatic mutation profiling, by rigorously integrating the sample composition parameter into the SNV calling model. The robustness of this application makes it applicable for identifying mutations in other challenging cases.
A Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs.
HeurAA / Heuristic Amplicon Aligner
Consists of a data processing program designed for the needs of diagnostic re-sequencing of amplicon reads generated by high coverage next generation sequencers. HeurAA was developed for a specific clinical application area, targeted resequencing of multiple genes in multiple patients, using a combination of exact string matching and exhaustive dynamic programming algorithms. It can detect indels practically without size limits provided that the reads contain a part (at least 30 bps) of the reference sequence that is sufficient for accurate identification.
Permits users to call short indels in next generation sequencing data. Indelocator is a somatic variant caller that collects various read count and alignment quality-related statistics around putative indel events.
Calls indels from next-generation paired-end sequencing data. SOAPindel allows to perform full local de novo assembly of regions where reads seem to map poorly because of an excess of paired-end read where only one of mates maps. It proceeds first by gathering all unmapped reads at their expected genomic positions and then execute a local assembly of the regions with a high proportion of such reads and finishes by aligning these assemblies to the reference.
SPLINTER / Short indel Prediction by Large deviation Inference and Nonlinear True frequency Estimation by Recursion
Detects and quantifies short indels and substitutions in large pools. SPLINTER allows accurate detection and quantification of short insertions, deletions, and substitutions by integrating information from the synthetic DNA library to tune SPLINTER and quantify specificity and sensitivity for every experiment in order to accurately detect and quantify indels and substitutions.
SV-M / Structural Variant Machine
Detects indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. SV-M is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. The key benefit of using a discriminative model is to learn to distinguish between true and false candidates based on a Sanger validated ground truth, thereby reducing the false positive rate among predicted indels.
Detects and displays approximate tandem repeats (ATRs) in genomic sequences. ATRHUNTER is based on a flexible statistical model which allows a wide range of definitions of approximate tandem repeats. This software uses a two-phased algorithm that consists of a screening step followed by a candidate verification step. The screening phase allows to generate a list of candidate regions that may include ATRs that are subsequently verified.