1 - 50 of 134 results

GATK / Genome Analysis ToolKit

star_border star_border star_border star_border star_border
star star star star star
Focuses on variant discovery and genotyping. GATK provides a toolkit, developed at the Broad Institute, composed of several tools and ables to support projects of any size. The application compiles an assortment of command line allowing one to analyze of high-throughput sequencing (HTS) data in various formats such as SAM, BAM, CRAM or VCF. The website includes multiple documentation for guiding users.


star_border star_border star_border star_border star_border
star star star star star
Allows users to interact with high-throughput sequencing data. SAMtools permits the manipulation of alignments in the SAM/BAM/CRAM formats: reading, writing, editing, indexing, viewing and converting SAM/BAM/CRAM format. It limits the mapping quality of reads with excessive mismatches and applies base alignment quality to fix alignment errors. This tool can sort and merge alignments, remove polymerase chain reaction (PCR) duplicates or generate per-position information.


star_border star_border star_border star_border star_border
star star star star star
Designed to process individually barcoded Restriction-site associated DNA sequencing (RADseq) data (with double cut sites) into informative single nucleotide polymorphisms (SNPs)/Indels for population-level analyses. dDocent uses data reduction techniques and other stand-alone software packages to perform quality trimming and adapter removal, de novo assembly of RAD loci, read mapping, SNP and Indel calling, and baseline data filtering. Double-digest RAD data from population pairings of three different marine fishes were used to compare dDocent with Stacks, the first generally available, widely used pipeline for analysis of RADseq data. dDocent consistently identified more SNPs shared across greater numbers of individuals and with higher levels of coverage.

GATK-Queue / Genome Analysis Toolkit-Queue

A command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things: local realignment around indels; emitting raw SNP calls; emitting indels, masking the SNPs at indels; annotating SNPs using chip data; labeling suspicious calls based on filters; creating a summary report with statistics. Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources. With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.


Allows de novo genome assembly and multisample variant calling. Cortex is a modular set of multi-threaded programs for manipulating assembly graphs. Linked de Bruijn Graph (LdBG) data structure and associated algorithms are implemented as part of the software. It was used for two tasks where long-range information is likely to be beneficial: finding large differences from a reference and analysis of genomic context for drug resistance genes, which was validated using a PacBio reference assembled for the sample.

MAQ / Mapping and Assembly with Quality

Builds mapping assemblies from short reads generated by the next-generation sequencing machines. Maq is particularly designed for Illumina-Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLiD data. Maq first aligns reads to reference sequences and then calls the consensus. At the mapping stage, maq performs ungapped alignment. For single-end reads, maq is able to find all hits with up to 2 or 3 mismatches, depending on a command-line option; for paired-end reads, it always finds all paired hits with one of the two reads containing up to 1 mismatch. At the assembling stage, maq calls the consensus based on a statistical model.


A platform-independent mutation caller for targeted, exome, and whole-genome resequencing data generated on Illumina, SOLiD, Life/PGM, Roche/454, and similar instruments. The newest version, VarScan 2, is written in Java, so it runs on most operating systems. It can be used to detect different types of variation: 1) germline variants (SNPs and indels) in individual samples or pools of samples, 2) multi-sample variants (shared or private) in multi-sample datasets (with mpileup), 3) somatic mutations, LOH events, and germline variants in tumor-normal pairs and 4) somatic copy number alterations (CNAs) in tumor-normal exome data.


Calculates the probability that a given site is polymorphic. POLYBAYES identifies polymorphic locations by evaluating the likelihood of nucleotide heterogeneity within cross-sections of a multiple alignment. The anchored alignment, paralogue filtering and single nucleotide polymorphism (SNP) detection are accessed through a single program. The tool does not require base-perfect reference sequence to be effective and will work well with draft-quality sequences that have begun to dominate sequence production.


Examines epigenomic and transcriptomic next generation sequencing (NGS) data. Octopus-toolkit can be used for antibody- or enzyme-mediated experiments and studies for the quantification of gene expression. It can accelerate the data mining of public epigenomic and transcriptomic NGS data for basic biomedical research. This tool provides a private and a public mode: one to process the user’s own data, and the other to analyze public NGS data by retrieving raw files from the GEO database.


A variant caller and small genome assembler. The heart of DISCOVAR is a de novo genome assembler, one that is accurate enough to produce assemblies that can be used for variant calling given a reference sequence. DISCOVAR can also generate de novo assemblies for small genomes, but consider using DISCOVAR de novo instead which can assemble genomes up to mammalian size. DISCOVAR provides a more complete inventory of an individual’s genetic variants than had been previously possible. As such, it adds to the tools that can be used to probe the genetic basis of disease. It may be particularly useful in cases where targeted or exome sequencing fails to find causal mutations.


Provides analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. Strelka is a variant calling method building upon the innovative Strelka somatic variant caller to improve upon aspects of variant calling for both germline and somatic analysis. The germline caller employs an efficient tiered haplotype model to improve accuracy and provide read-backed phasing, adaptively selecting between assembly and a faster alignment-based haplotyping approach at each variant locus. The germline caller also analyzes input sequencing data using a mixture-model indel error estimation method to improve robustness to indel noise.


Facilitates study of high-dimensional genomic and proteomic data by offering a comprehensive set of procedures for False discovery rate (FDR) estimation. fdrtool provides readily interpretable graphical output, and can be applied to very large scale (in the order of millions of hypotheses) multiple testing problems. It contains functions for non-parametric density estimation, for monotone regression, for computing the greatest convex minorant and the least concave majorant, for the half-normal and correlation distributions, and for computing empirical higher criticism scores and the corresponding decision threshold.

FamSeq / Family-based Sequencing program

A computational tool for calculating probability of variants in family-based sequencing data. It is still challenging to call rare variants. In family-based sequencing studies, information from all family members should be utilized to more accurately identify new germline mutations. FamSeq serves this purpose by providing the probability of an individual carrying a variant given his/her entire family’s raw measurements. FamSeq accommodates de novo mutations and can perform variant calling at chrX.


Allows read alignment as well as single nucleotide polymorphism (SNP) detection and annotation. MAQGene launches the MAQ software and assembles a customized summary of the location and specific features of sequence variants of the mutant genome compared to a wild-type reference genome. The software also provides the option to compare any input whole genome sequencing (WGS) reads to any wild-type available reference genome with general-feature format (GFF) coding exon annotations files.


A sensitive and robust approach for calling single-nucleotide variants (SNVs) from high-coverage sequencing datasets, based on a formal model for biases in sequencing error rates. LoFreq adapts automatically to sequencing run and position-specific sequencing biases and can call SNVs at a frequency lower than the average sequencing error rate in a dataset. LoFreq’s robustness, sensitivity and specificity were validated using several simulated and real datasets (viral, bacterial and human) and on two experimental platforms (Fluidigm and Sequenom).


A variant detector and graphical alignment viewer for next-generation sequencing data in the SAM/BAM format, which is capable of pooling data from multiple source files. The variant detector takes advantage of SAM-specific annotations, and produces detailed output suitable for genotyping and identification of somatic mutations. The assembly viewer can display reads in the context of either a user-provided or automatically generated reference sequence, retrieve genome annotation features from a UCSC genome annotation database, display histograms of non-reference allele frequencies, and predict protein-coding changes caused by SNPs.


A method for quick and robust variant detection in low-mappability regions. We showed that whereas variant calls at individual sites can be uncertain, clusters of related sites can carry reliable information. In particular, clusters can give confidence to the presence of variants and also help to better estimate their allelic abundance. We showed that analysis of variant clusters in a human genome can reveal up to hundreds of thousands of elements that have hitherto been cumbersome and impractical to study. We also extend the thesaurus approach to enhance detection of DNA changes across matched samples. In other words, we implement a personalized filtering strategy taking thesaurus annotations into account. This contribution removes low mapping quality from the list of difficulties in the analysis of matched sample and thus enables, for the first time, to use short-read sequencing data to describe the landscape of mutations in sequence-similar regions of the human genome. The implementation is designed to be general-purpose and extensible in order to accommodate several use-cases, in particular the genomics of cancer and of familial diseases.


Extracts causative variants in familial and sporadic genetic diseases. VariantMaster implements a methodology to evaluate the status (presence or absence) of a variant in familial or case-control contexts. The software allows users to identify causative variants in familial, sporadic germline, and somatic genetic disorders, including cancers. It also allows for the search of causative variants in one or more recurrently mutated genes in a pool of unrelated individuals sharing the same phenotype.


A statistical method for both of genotyping and single nucleotide polymorphism (SNP) detection using multi-sample next-generation sequencing (NGS) data. Instead of pooling the multi-sample data as single-sample or pooled sequencing data, we build the statistical model to integrate information across different samples and genomic sites, to make the genotype-call and identify SNP at each locus for each sample. The performance of ebGenotyping is investigated via simulations and real data analysis. It is shown that our method makes less genotype-call errors, and with the parameter estimates from the ECM algorithm, it attains high detection power with false discovery rate (FDR) being well controlled.


A method for single nucleotide variant calling and genotyping of large cohorts that have been sequenced at low coverage. Reveel introduces a novel technique for leveraging linkage disequilibrium that deviates from previous Markov-based models, and which is aimed at computational efficiency as well as accuracy in capturing linkage disequilibrium patterns present in rare haplotypes. We evaluate Reveel's performance through extensive simulations as well as real data from the 1000 Genomes Project, and show that it achieves higher accuracy in low-frequency allele discovery and substantially lower computation cost than previous state-of-the-art methods.

CSI Phylogeny / Call SNPs & Infer Phylogeny

Identifies variations in whole genome sequencing (WGS) reads and conducts phylogenetic analysis of isolates. CSI Phylogeny is a webserver which calls and filters the single nucleotide polymorphisms (SNPs), does site validation and infers a phylogeny based on the concatenated alignment of the SNPs. The method was evaluated on three bacterial data sets and sequenced on three different platforms (Illumina, 454, Ion Torrent) and overcomes the systematic biases caused by the sequencers.


Filters spurious variants caused by mouse reads in patient-derived xenografts (PDXs) and caused by paralogous sequences in primary tumors. Mapexr is an R package that implements MAPEX (the Mouse And Paralog EXterminator), a BLASTN-based algorithm for filtering variants. This algorithm is designed to fit into a standard tumor variant calling pipeline and flag variants which may arise from mis-alignment of mouse reads or from paralogous sequences. The software can be a useful component for many tumor variant-calling pipelines.


Presents methods to discover and genotype single-nucleotide polymorphism (SNP) sites from low-coverage sequencing data, making use of shared haplotype (linkage disequilibrium) information. QCALL proposes two methods. In the first method, non-linkage disequilibrium analysis (NLDA), a dynamic programming algorithm was applied. In the second method, linkage disequilibrium analysis (LDA), shared haplotype structure was used to estimate posterior probabilities of SNPs and genotypes. QCALL with NLDA and LDA methods detects shared variants from multiple samples better than analyzing individual samples independently. In particular, the genotype accuracy is substantially improved.


A package for phylogenomic analyses of data collected from conserved genomic loci using targeted enrichment. PHYLUCE allows the assembly of raw read data to contigs, the identification of ultra-conserved elements (UCE) contigs, parallel alignment generation, alignment trimming, and alignment data summary methods in preparation for analysis and alignment and SNP calling using UCE or other types of raw-read data. As it stands, the PHYLUCE package is useful for analyzing both data collected from UCE loci and also data collection from other types of loci for phylogenomic studies at the species, population, and individual levels.


Estimates allele frequency and call variants in heterogeneous samples. RVD2 improves upon current classifiers and has higher sensitivity and specificity over a wide range of median read depth and minor allele fraction. It is able to use multiple cores in parallel, which can significantly improve time efficiency. The tool does not address identification of indels, structural variants (SV) or copy number variants (CNV). Those mutations typically require specific data analysis models and tests that are different than those for single nucleotide variants


Detects single nucleotide polymorphisms (SNPs) from restriction-site associated DNA sequencing, or genotype-by-sequencing (RAD-seq/GBS) or whole genome sequencing (WGS) reads. Heap was designed to identify SNPs from short read sequences of diploid species. The software performs read filtering to obtain high quality reads, determines each sample’s genotype in every site that passes quality filtering and then performs SNP calling by comparing the genotypes between sample and reference genome. It is applicable not only to SNP calling with high read coverage but also to that with low read coverage.

CsSNP / Comparative segments SNP

A web tool based on the Blat, Blast, and Perl programs to detect comparative segments SNPs and to show the detail information of SNPs. CsSNP contains the reference genomic sequences and coding sequences of 60 plant species, and also provides new opportunities for the users to detect SNPs easily. CsSNP is provided a convenient tool for nonprofessional users to find comparative segments SNPs in their own sequences, and give the users the information and the analysis of SNPs, and display these data in a dynamic map.


Performs genotype calling using next-generation sequencing (NGS) data from multiple unrelated individuals. SeqEM is an adaptive approach that does not require prior estimates of genotype frequencies or nucleotide read error but rather is driven by the data. The software leverages information from NGS data for multiple individuals by using the Expectation-Maximization (EM) algorithm to numerically maximize the observed data likelihood with respect to genotype frequencies and the nucleotide-read error rate.


Allows to measure the impact of a gene mutations in antibiotic resistance (AR) genes and their potential effect on AR of bacterial strains. MutaNET enables statistical analysis on different places in the genome. This tool scores the potential impact of mutations on gene expression and protein function of a given genome. It compares the mutational impact on coding regions, promoters, and transcription factor binding sites (TFBS) using refined scoring schemes. It consists of several analysis steps: a mutation calling pipeline, a statistical comparison of mutations, and a generation of the underlying gene regulatory network (GRN).


A fast and easy desktop GUI tool for the identification of genomic variants from pooled sequencing and individual sequencing data. Using SNVerGUI, users can perform sophisticated variant detection by simply configuring several parameters in a friendly graphical user interface. Compared with other methods for variant calling, our approach is unique in that it is applicable to both individual and pooled sequencing data. SNVerGUI supports commonly used input and output file formats that allows SNVerGUI to be seamlessly integrated into common NGS data analysis pipelines.