Duplicate read removal software tools | High-throughput sequencing data analysis
The presence of duplicates introduced by PCR amplification is a major issue in paired short reads from next-generation sequencing platforms. These duplicates might have a serious impact on research applications, such as scaffolding in whole-genome sequencing and discovering large-scale genome variations, and are usually removed.
Allows users to interact with high-throughput sequencing data. SAMtools permits the manipulation of alignments in the SAM/BAM/CRAM formats: reading, writing, editing, indexing, viewing and converting SAM/BAM/CRAM format. It limits the mapping quality of reads with excessive mismatches and applies base alignment quality to fix alignment errors. This tool can sort and merge alignments, remove polymerase chain reaction (PCR) duplicates or generate per-position information.
Builds genetic maps and conducts population genomics and phylogeography. Stacks is a software system developed to work with restriction enzyme-based data, such as RAD-seq. The software produces core population genomic summary statistics and single nucleotide polymorphism (SNP)-by-SNP statistical tests. It aims to be a key resource to empower researchers to efficiently perform ecological and evolutionary genomic studies in model organisms and particularly in organisms with minimal or no genomic resources.
A flexible and easy to use interface that programmers of many levels of experience can use to access information in the popular and common SAM/BAM format. bio-samtools 2 provides new classes for describing genomic regions and genetic variants, allows the easy addition of newly developed SAMtools features and can produce publication-quality visualizations of data with minimal effort by the coder.
Demonstrates the value of properly accounting for errors in unique molecular identifiers (UMIs). UMI-tools removes PCR duplicates and implements a number of different UMI deduplication schemes. It can extract, remove and append UMI sequences from fastq reads. Compared with previous method, this one is superior at estimating the true number of unique molecules. The simulations provide an insight into the impact on quantification accuracy and indicate that application of an error-aware method is even more important with higher sequencing depth.
Permits next-generation sequencing (NGS) analysis to reconstruct ancient genomes. EAGER is able to perform several raw read pre-processing steps, including the initial analysis of raw sequencing reads using FastQC to assess the basic quality of the generated NGS data. It can be used to generate summary reports with the most important statistics including mapping and genotyping of all processed samples.
Examines epigenomic and transcriptomic next generation sequencing (NGS) data. Octopus-toolkit can be used for antibody- or enzyme-mediated experiments and studies for the quantification of gene expression. It can accelerate the data mining of public epigenomic and transcriptomic NGS data for basic biomedical research. This tool provides a private and a public mode: one to process the user’s own data, and the other to analyze public NGS data by retrieving raw files from the GEO database.
Assists users in manipulating high-throughput sequencing (HTS) data and formats. Picard is a Java toolkit that provides a set of command line scripts. It comprises Java-based utilities that manipulate SAM files, and a Java API for creating new programs that reads and writes SAM files. Both SAM text format and SAM binary (BAM) format are supported. It also works with next generation sequencing (NGS).
Removes duplicated and near-duplicated reads with support for both SingleEnd and Paired-End datasets. ParDRe uses a bitwise approach to compare DNA strings, and employs both multithreading and message passing interface (MPI).
A package for input, quality assessment, manipulation and output of high-throughput sequencing data. ShortRead extends Bioconductor with tools useful in the initial stages of short-read DNA sequence analysis. Main functionalities include data input, quality assessment, data transformation and access to downstream analysis opportunities. It is an important gateway to use of Bioconductor for processing high-throughput DNA sequence data. ShortRead data structures allow convenient manipulation of data, such as filtering reads based on sequence characteristics.
A software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit. The BamTools C++ API/library has been successfully integrated into a variety of applications. It provides the BAM file support for several utilities in the BEDtools suite.
Marks duplicates and extracts discordant and split reads from SAM files. SAMBLASTER is able to mark duplicates in a single pass over a SAM file in which all alignments for the same read-id are grouped together. The software can extract reads directly from the SAM output of an aligner, such as BWA-MEM.
Provides a long mate pair (LMP) analyzing and reading tool. NextClip consists of two main components: the first processes mate pair files, produces summary statistics and arranges reads for using in scaffolding. The other is a pipeline component to use when the assembly is partially complete or a close reference. This software can also create a report that furnishes an appreciation of library quality and simple separation of reads suitable for scaffolding.
Identifies and groups identical and near-identical reads. Fulcrum is a read collapser that returns a single consensus sequence. The software aims to simplify the problem of comparing N reads in a dataset to every other read in the set. It was designed to speed de novo sequencing and assembly efforts in which an N ×N comparison of reads is necessary, and can also be used as a first step in read mapping for polymorphism detection.
Estimates the amount of sampling-induced read duplication for evaluating whether a dataset is amenable to de-duplication and for amending the overcorrection. DupRecover is a maximum likelihood (ML) estimator that suits for sampling-induced read duplication in deep sequencing experiments. This quantitative method facilitates accurate estimation of variant allele fraction and copy number variation (CNV).
Identifies duplicate reads based on the flowgram. The distance calculation in JATAC is a more robust way of finding duplicates, as it first identifies read pairs with different homopolymer lengths at low distances. This behaviour closely models the 454 sequencing chemistry where substitution errors are less common than indels. JATAC’s improved duplicate identification comes at a computational price, and its speed depends on the number of reads and the degree of duplication
Automates and standardizes the analyses of RAD-seq data for phylogenetic inference. Users of RADIS can let their raw Illumina data be processed up to phylogenetic tree inference, or stop (and restart) the process at some point. Different values for key parameters can be explored in a single analysis (e.g. loci building, sample/loci selection), making possible a thorough exploration of data. RADIS relies on Stacks for demultiplexing of data, removing PCR duplicates and building individual and catalog loci. Scripts have been specifically written for trimming of reads and loci/sample selection. Finally, RAxML is used for phylogenetic inferences, though other software may be utilised.
Permits quality control of Next-Generation-Sequencing (NGS) tumor-normal experiments. NGS-Bits is separate into four steps: (1) gather information from raw reads, (2) map reads, (3) extract variant lists, and (4) combine result from precedent steps to then add quality control (QC) metrics for tumor-normal experiments. This tool includes all stages of single-sample NGS data analysis and adds special QC metrics for DNA sequencing of tumor-normal pairs.
Processes single-end and paired-end reads from FASTQ/FASTA datasets. MarDRe removes near-duplicate reads by using a de novo MapReduce method. It permits to avoid the analysis of not necessary reads. This method reduces the time of subsequent procedures with the dataset. The software was compared to ParDRe and the results show that MarDRe can scale reasonably well using all the available cores.
Provides an approach for integrating transcript assemblies. Mikado is a pipeline that prepares, serializes and picks annotations files to allow users to obtain a set of gene models. these models are filtered according to individual requirements by using different algorithms for defining loci, scoring transcripts, to ultimately determine a representative transcript for each locus. This software aims to improve precision over the original assemblies, with minimal drops in recall.
Enables genotyping and variant annotation of resequencing data produced by second generation next generation sequencing (NGS) technologies. CoVaCS is an automated system that provides tools for variant calling and annotation along with a pipeline for the analysis of whole genome shotgun (WGS), whole exome sequencing (WES) and targeted resequencing data (TGS). The software allows non-specialists to perform all steps from quality trimming to variant annotation.