Focuses on variant discovery and genotyping. GATK provides a toolkit, developed at the Broad Institute, composed of several tools and able to support projects of any size. The application compiles an assortment of command line allowing one to analyze of high-throughput sequencing (HTS) data in various formats such as SAM, BAM, CRAM or VCF. The website includes multiple documentation for guiding users.
Allows users to interact with high-throughput sequencing data. SAMtools permits the manipulation of alignments in the SAM/BAM/CRAM formats: reading, writing, editing, indexing, viewing and converting SAM/BAM/CRAM format. It limits the mapping quality of reads with excessive mismatches and applies base alignment quality to fix alignment errors. This tool can sort and merge alignments, remove polymerase chain reaction (PCR) duplicates or generate per-position information.
Gives access to many free software tools for sequence analysis. EMBOSS aims to serve the molecular biology community. It permits the creation and the release of software in an open source spirit. This tool is useful for sequence analysis into a seamless whole. It is free of charge and is available in open source.
A software suite for the comparison, manipulation and annotation of genomic features in browser extensible data (BED) and general feature format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets.
Identifies and adjusts errors in sequencing reads by using k-mer coverage. Quake differentiates k-mers trusted to be in the genome and k-mers that are untrustworthy artifacts of sequencing errors. The software exploits read quality values and determine types of errors by generating nucleotide to nucleotide error rates. It can be deployed on large datasets containing billions of read if a set of corrections makes all k-mers trusted.
Simplifies variant annotation and filtering. Bystro is able to handle sequencing experiments on the scale of thousands of whole-genome samples and tens of millions of variants online in a web browser. It integrates search engine for filtering variants and samples from these experiments, and it enables real-time (sub-second), nuanced variant filtering, both across all samples and per sample, using simple phrases and interactive, web-based filters. It assists users to find alleles of interest in any sequencing experiment.
Annotates and filtrates variant files. VarAFT allows the comparison of several individuals and the collection of relevant information about the variations. It includes a coverage analysis module to easily visualize regions that are poorly covered though tables and dynamic charts. With VarAFT, users can annote variant (VCF) files, combine multiple samples from various individuals, prioritize list of variants by multi-filtering parameters. Additionnaly, users can perform a coverage analysis and quality check from any BAM file.
Permits users to parse, analyze and manipulate VCF files. VCFtools is a software package for composed of two modules: the first is a general API that allows various operations to be performed on VCF files, including format validation, merging, comparing, intersecting, making complements and basic overall statistics; the second module analyze single-nucleotide polymorphism (SNP) data in VCF format, assisting researchers to estimate allele frequencies, levels of linkage disequilibrium and various quality control (QC) metrics.
Assists users in manipulating high-throughput sequencing (HTS) data and formats. Picard is a Java toolkit that provides a set of command line scripts. It comprises Java-based utilities that manipulate SAM files, and a Java API for creating new programs that reads and writes SAM files. Both SAM text format and SAM binary (BAM) format are supported. It also works with next generation sequencing (NGS).
A statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data. BCFtools can manipulate variant calls in the variant call format (VCF) and its binary counterpart BCF. It also can discover somatic and germline mutations with appropriate input data, efficiently estimate site allele frequency, allele frequency spectrum and linkage disequilibrium, and test Hardy–Weinberg equilibrium and association.
A software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit. The BamTools C++ API/library has been successfully integrated into a variety of applications. It provides the BAM file support for several utilities in the BEDtools suite.
A suite of software tools for manipulating data common to next-generation sequencing experiments, such as FASTQ, BED and BAM format files. With modules that operate from FASTQ pre-processing through BAM post-processing and RPKM calculations, NGSUtils compliments existing tools and provides unique functionality that helps each step of an NGS data analysis pipeline. NGSUtils covers different aspects of NGS data analysis, including pre-processing, post-processing, filtering, format conversion and final result calculations. NGSUtils provides a stable and modular platform for data management and analysis.
Permits quality control of Next-Generation-Sequencing (NGS) tumor-normal experiments. NGS-Bits is separate into four steps: (1) gather information from raw reads, (2) map reads, (3) extract variant lists, and (4) combine result from precedent steps to then add quality control (QC) metrics for tumor-normal experiments. This tool includes all stages of single-sample NGS data analysis and adds special QC metrics for DNA sequencing of tumor-normal pairs.
Analyzes or annotates VCF files and organizes tools that perform diverse analyses using VCF files. VCF-kit adds essential utilities to process and analyze VCF files, including primer generation for variant validation, dendrogram production, genotype imputation from sequence data in linkage studies, and additional tools. It can be used to produce a phylogenetic tree from a VCF. The tool centralizes a collection of tools and scripts using variant call format.
Offers an assortment of tools suited for sequence analysis. Japsa is an open source package that gathers more than 20 tools including a java library and an API. The application provides a wide range of functionalities that allows users to split multiple sequences files, to perform real-time identification of antibiotic resistance gene with Oxford Nanopore sequencing as well as to normalize the branch length of a phylogeny.
Provides the ability to filter variants based upon variant annotation. cyvcf is a high-performance library that provides researchers with an intuitive Python interface for manipulating VCF files. This method permits to interrogate the details of each sample’s genotype information, and rapidly compute both variant and sample level statistics. It also offers full programmatic flexibility that can come with minimal performance penalties owing to the careful design.
A C ++ read filtering and profiling tool for use with BAM, CRAM and SAM sequencing files. VariantBam provides a flexible framework for extracting sequencing reads or read-pairs that satisfy combinations of rules, defined by any number of genomic intervals or variant sites. We have implemented filters based on alignment data, sequence motifs, regional coverage and base quality. VariantBam enables efficient storage of sequencing data while preserving the most relevant information for downstream analysis. It is easy to compile and run, and is extensively documented with a number of use cases and examples.
Allows users to reformate and filter bioinformatics files. JVARKIT aims to simplify the grammar employed to filter bioinformatic file, for rendering possible to write a loop or a custom function. JVARKIT is a set of more than 100 java-based tools for bioinformatics.
Provides a library written in Nim programming language that suits for simple and scripting-like syntax. hts-nim is a garbage-collected language, compiles to C and its syntax is similar to python. This library can be useful for parsing genomics data files.
Filters spurious variants caused by mouse reads in patient-derived xenografts (PDXs) and caused by paralogous sequences in primary tumors. Mapexr is an R package that implements MAPEX (the Mouse And Paralog EXterminator), a BLASTN-based algorithm for filtering variants. This algorithm is designed to fit into a standard tumor variant calling pipeline and flag variants which may arise from mis-alignment of mouse reads or from paralogous sequences. The software can be a useful component for many tumor variant-calling pipelines.
Permits to manage large genomic variation data derived from next-generation sequencing (NGS) analyses or high-throughput genotyping. Gigwa is a web application that permits users to filter data in real-time, based on variant features and individuals’ genotypes. It also supplies the means to export filtered data in several popular formats, thus facilitating connectivity with many existing visualization engines.
A collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing. Next-Generation sequencing machines usually produce FASTA or FASTQ files, containing multiple short-reads sequences (possibly with quality information). The main processing of such FASTA/FASTQ files is mapping (aka aligning) the sequences to reference genomes or other databases using specialized programs. Example of such mapping programs are: Blat, SHRiMP, LastZ, MAQ and many many others.
Allows users to analyze, filter, annotate or transform biological sequence data. FAST is able to realize automated sampling, permutations and bootstrapping of sequences and sites and compute a population genetic statistics. It can assist empower non-biologist programmers to develop and communicate bioinformatics workflows for scientific investigations and publishing.
Analyzes raw sequencing data from several next generation sequencing (NGS) platforms. MutAid is a pipeline performing six different steps: (i) quality control and filtering; (ii) mapping reads to reference genome; (iii) variant detection, effect prediction and cross-referencing and lastly (iv) and then produces a summary of all information generated. It can be used to interpret mutational variants from various data generated by targeted gene-panel sequencing or whole genome sequencing.
Improves the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample.
Allows users to filter, convert and combine multiple data files produced by high-throughput technologies. HTDP aims to aid global, real-time processing of large data sets using GUI. The software provides unlimited filtering and data reduction capabilities, also using itemized filtering conditions from external files. It can be used for conversion between different standard formats that are commonly used for high-throughput data.
Filters candidate variants according to the given criteria. FMFilter can handle compound heterozygous and de novo models properly. It offers options to make filtering according to genotype quality, read depth, gene name, mutation type and custom annotated population frequency. This tool can find out compound heterozygous mutations. It provides an alternative for the analysis of next-generation sequencing data collected by Mendelian's disease research.
Allows isolated testing of the effect of low complexity reads removal. RepeatSoaker is a post-alignment filtering tool that filters out reads overlapping with a user provided template file which contains genomic coordinates of low complexity regions. This application was designed to be aligner-independent. It processes the aligned data in BAM format, removes low complexity reads, and outputs a cleaned BAM file and filtering statistics.
Facilitates the design, optimization, and tracking of barcoded oligonucleotides. XSTK is useful for projects that require highly multiplexed polymerase chain reaction (PCR) and DNA sequencing. It builds a list of all possible DNA sequences of a specified length and then progressively culls sequences that may interfere with primary PCR amplification and/or sequencing steps.
Assists users for managing, filtering, comparing and annotating genomic position (GP) files. PileLine is a flexible command-line toolbox that provides several functionalities, including (i) full standard annotation with human dbSNP, HGNC Gene Symbol and Ensembl IDs, (ii) custom annotation through standard files, (iii) generation of SIFT, Firestar and PolyPhen compatible outputs, and (v) a genotyping quality control (QC) test for estimating performance metrics on detecting homo/heterozygote variants.
Provides a valuable alternative for the generation of a customized sequence dataset from general feature formatted file. gff2sequence is an open-source program which allows the extraction of gene features from an annotation file while controlling for several quality filters and maintaining a user friendly graphical environment. This software works with gene annotation data and can also be used to extract generic features from any multifasta nucleotide sequence.
Detects and filters the misaligned reads of SAM format. Such filtration can reduce false positives in alignment and the following variant analysis. Cross-validation between two simulated datasets processed with SAMSVM yielded accuracies that ranged from 0.89 to 0.97 with F-scores ranging from 0.77 to 0.94 in 14 groups characterized by different mutation rates from 0.001 to 0.1, indicating that the model built using SAMSVM was accurate in misalignment detection. Application of SAMSVM to actual sequencing data resulted in filtration of misaligned reads and correction of variant calling.
Filters out false positives from a set of single nucleotide polymorphism (SNP) calls. SWEEP uses the ubiquitous false-positive SNP calls and transforms them from a weakness to a strength by using their information to pull out the true SNPs that are polymorphic between genotypes of interest. User only needs to supply sorted and indexed bam files and the reference genome used to map sequence reads. SWEEP is also applicable for other allopolyploid crops.
Returns a reformatted file if the input file violates the user defined format requirements. Fasta-O-Matic can be employed as a general pre-processing tool in bioinformatics workflows. It consists of a quality control script useful for a variety of downstream bioinformatics tools. This tool represents a sanity check for bioinformatic core facilities that tends to repeat common analysis steps on FASTA files received from disparate sources.
Performs quality control, reformatting, filtering, and trimming of FASTQ formatted sequence datasets. fatsQ_brew does not rely on any modules that are not currently contained within the Perl Core. It was evaluated on its execution time in trimming FASTQ data.
Produces improved results of variable length amplicons from HTAS. AMPtk is a bioinformatic pipeline developed to specifically address the quality issues identified by using spike-in mock communities. It analyzes variable length amplicon studies such as the fungal ITS1 or ITS2 molecular barcodes. This method provides the scientific community with a necessary tool to study fungal community diversity.
Provides several programs allowing users to perform both common and uncommon tasks with FASTQ files. fastq-tools is a toolkit that provides tools for (1) finding reads matching a regular-expression, (2) counting k-mer occurances, (3) performing local alignment against every FASTQ sequence, (4) sample reads with or without replacement, (5) sorting FASTQ files and (6) filtering reads with identical sequences.
Provides several simple Perl scripts for high throughput genomic and transcriptomic data. NGS-TOOLBOX permits, among others, to calculate the total nucleotide composition and the sequence length distribution. It can remove identical sequences, specified adapter sequences and low-complexity sequences from a dataset. This software can also generate reverse complementary sequences and split large sequence files into smaller parts specified by different parameters.
A quick and extremely permissive method to read and write VCF files. vcflib provides a variety of functions for VCF manipulation: comparison, format conversion, filtering and subsetting, annotation, samples, ordering, variant representation, genotype manipulation, interpretation and classification of variants. Piping provides a convenient method to interface with other libraries (vcf-tools, BedTools, GATK, htslib, bcftools, freebayes) which interface via VCF files, allowing the composition of an immense variety of processing functions.
Provides utility modules for bioinformatics. UBU permits users to translate from genome to transcriptome coordinates, to filter reads from a paired end SAM or BAM file, to convert a SAM/BAM file content to FASTQ, to format a single FASTQ file or to count splice junctions in a SAM or BAM file. It also outputs summary statistics per reference for a SAM/BAM file.
Allows users to interact with files associated with next-generation sequencing (NGS). qMule is composed of three modules: Aligner Compare confronts 2 BAMs aligned from the same FASTQ and separates out reads that are different between the BAMs; BamMismatchCounts provides a tally of how many mismatches were in each read for reads that mapped full-length; and MafFilter that searches for QCMG-annotated MAF files.
Produces combined genotype calls from data obtained through the GATK UnifiedGenotyper software. MultiVCFAnalyzer can filter numerous files and can be applied to various domains such as single nucleotide polymorphism (SNP) effect analyzesor phylogenetic reconstruction. Besides, the program generates outputs which aims to be easy to manipulate for further analysis steps including checking or publication.
Assists in processing FASTA files containing DNA and protein sequences. SEDA is an application that allows users to (i) filter sequences based on different criteria (including text patterns), (ii) translate nucleic acid sequences into amino acid sequences, (iii) execute Blast analyses, (iv) remove duplicated sequences, and (v) sort, merge, split or reformat files.
Permits to get access to high-throughput sequencing data (HTS) formats. Htsjdk does not support latest Variant Call Format Specification, for example VCFv4.3 and BCFv2.2. It can be useful to manipulate data in HTS fields.
A little utility to expose the file format conversion in BioPython in a convenient way. Seqmagick can be used to query information about sequence files, convert between types, and modify sequence files. All functions are accessed through subcommands. Most commands support gzip (files ending in .gz) and bzip (files ending in .bz2 or .bz) compressed inputs and outputs.
Selects reads from a BAM files based on a user-supplied query. qbamfilter is an application incorporated into the majority of AdamaJava tools as a library to provide filtering of BAM records. Reads that match the query are written to a new BAM file and reads that do not are dropped or optionally written to a different BAM file. For the library use-case, only BAM records that pass the query string are accepted for further processing for this application.
Assists users in manipulating grammar of data. dplyr provides a consistent set of verbs that helps to solve the most common data manipulation challenges. This method was developed to: (i) identify the most important data manipulation verbs and make them easy to use from R, (ii) provide blazing performance for in-memory data, and (iii) use the same interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.