Error correction software tools | High-throughput sequencing data analysis
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies.
Allows correction of inherent in long, single-molecule sequences using short, high-identity sequences. PBcR is a program that trims and corrects individual long-read sequences by first mapping short-read sequences to them, and computes an accurate hybrid consensus sequence. The corrected PBcR reads can then be exported for other application or can be de novo assembled alone in combination with other data.
Removes stereotypical background artifacts within high-throughput sequencing (HTS) data. iDES intends to assist users in cell-free DNA (cfDNA) molecules recovery while limiting losses of haploid genomic equivalents (hGEs). This program merges a molecular barcoding approach with in silico removing of highly stereotypical background artifacts with the aim of increasing the efficiency of the capture of sequencing-based circulating tumor DNA (ctDNA) detection.
Assists users for de novo sequencing, consensus and variant calling on data from Oxford Nanopore Technologies' MinION platform. PoreSeq is an open source program and Python library. It provides features as: (i) de novo error correction without reference using overlap alignment, (ii) reference error correction, (iii) scoring known sequence variants on a given dataset and (iv) straightforward subdivision of processing for cluster and parallel tasks.
A package for error-correcting DNA barcodes. Hamming allows one run of a massively parallel pyrosequencer to process up to 1544 samples simultaneously. The tagged barcoding strategy can be used to obtain sequences from hundreds of samples in a single sequencing run, and to perform phylogenetic analyses of microbial communities from pyrosequencing data. The combination of error-correcting barcodes and massively parallel sequencing rapidly revolutionizes our understanding of microbial habitats located throughout our biosphere, as well as those associated with our human bodies.
Computes an improved consensus sequence for the assembly. LQS uses accurate short-read data and/or Pacific Biosciences circular consensus reads to correct error-prone long reads sufficiently for assembly. This approach uses three steps: (i) overlapping is recognized between corrected and readings with a multiple alignment process, (ii) corrective readings are merged using the Celera Assembler, and (iii) the assembly is improve using a probabilistic model of the signal level data.
Aligns sequences with or without local alignment. MECAT is a program based on a pseudolinear alignment scoring algorithm that exploits distance difference factors (DDFs). The software is composed of four modules: (i) a single molecule real time (SRMT) reads pairwise mapper, (ii) an SMRT reads reference mapper; (iii) a noise corrector and (iv) a pipeline for hierarchical assembly which is an extension version of the CANU pipeline.
Implements iterative error correction by using modules from String Graph Assembler (SGA). SGA-ICE is an iterative error correction pipeline that runs SGA in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads.
Allows alignment of long reads for consensus and assembly. FALCON is a set of tools based on a hierarchical genome assembly process. The software follows a "Hierarchical Genome Assembly Process" constituted of several steps for generating a genome assembly from a set of sequencing reads. Each step is accomplished with different command line tools implementing different sets of algorithms to accomplish the work.
Assists users to assemble noisy single-molecule sequences. Canu introduces several features including computational resource discovery, adaptive k-mer weighting, automated error rate estimation, sparse graph construction, and graphical fragment assembly (GFA) outputs. This pipeline consists of three different stages: correction, trimming, and assembly. Moreover, this tool can auto-detect available resources and configure itself to maximize resource utilization.
Assembles large genomes from high coverage short read data. SGA implements a set of assembly algorithms based on the FM-index. It corrects base calling errors in the reads, assembles contigs from the corrected reads, uses paired end and/or mate pair data to build scaffolds from the contigs. This tool returns a visual report that allows to display the properties of the genome and data quality.
Serves for constructing assembly graphs from polished contigs. ABruijn assembler can assemble long error-prone reads using de Bruijn graphs. It generates inaccurate overlapping contigs (i.e., contigs with potential assembly errors) and combines these initial contigs into an accurate assembly graph that encodes all possible assemblies consistent with the reads.
A user-friendly way to inspect NGS datasets obtained from the sequencing of genetic markers in microbial communities. The error calculation functionality enables the evaluation of the overall sequencing quality and can further be used to assess the outcome of NGS data processing pipelines. The interactive plots in NGS-eval quickly illustrate the read coordinates where the errors occur. High frequency of errors at specific positions can be useful for detecting novel (common) sequence variants and identifying the differences between the strains that are present in the sample and that are used as reference sequences.
Identifies and adjusts errors in sequencing reads by using k-mer coverage. Quake differentiates k-mers trusted to be in the genome and k-mers that are untrustworthy artifacts of sequencing errors. The software exploits read quality values and determine types of errors by generating nucleotide to nucleotide error rates. It can be deployed on large datasets containing billions of read if a set of corrections makes all k-mers trusted.
Corrects errors related to Bloom filter for high-throughput sequencing reads. BLESS employs the quality score distribution of input reads to fix errors in solid k-mers, k-mers that exist multiple times in reads. It also uses a histogram to determine the threshold for solid k-mers. It is also useful for investigating quality scores or counting k-mers using k-mer counter (KMC).
A de novo transcriptome assembler that takes advantage of techniques employed in Cufflinks to overcome limitations of the existing de novo assemblers. When tested on dog, human and mouse RNA-seq data, Bridger assembled more full-length reference transcripts while reporting considerably fewer candidate transcripts, hence greatly reducing false positive transcripts in comparison with the state-of-the-art assemblers. It runs substantially faster and requires much less memory space than most assemblers. More interestingly, Bridger reaches a comparable level of sensitivity and accuracy with Cufflinks.
Demonstrates the value of properly accounting for errors in unique molecular identifiers (UMIs). UMI-tools removes PCR duplicates and implements a number of different UMI deduplication schemes. It can extract, remove and append UMI sequences from fastq reads. Compared with previous method, this one is superior at estimating the true number of unique molecules. The simulations provide an insight into the impact on quantification accuracy and indicate that application of an error-aware method is even more important with higher sequencing depth.
Serves as a memory-efficient sequencing error corrector. Lighter uses sampling rather than counting to obtain solid k-mers and exploits a sampling technique with only two bloom filters. By exploiting the set of kmers obtained from the genome, this software manages to adjust the reads containing sequence errors. It offers a parallelization feature and doesn’t use any secondary storage.
Adjusts short-reads from next-generation sequencing (NGS) platforms such as Illumina's Genome Analyzer II. ECHO is built on a probabilistic framework and addresses a quality score to each corrected base for downstream analysis. It retains information contained in the reads and performs error correction by finding overlaps between reads and by using a maximum a posteriori procedure.
Provides a toolkit for improving the quality of genome assemblies created via an assembly software. PAGIT compiled four tools: (i) ABACAS which classifies and orientates contigs and estimates the sizes of gaps between them; (ii) IMAGE uses paired-end reads to extend contigs and close gaps within the scaffolds; (iii) ICORN for identifying and correcting small errors in consensus sequences and; (iv) RATT for help annotation. The software was mainly created to analyze parasite genomes of up to about 300 Mb.
Performs error correction in RNA-Seq data. SEECER is a method based on profile a hidden Markov Model (HMMs). This method does not require a reference genome. It can handle non-uniform coverage and alternative splicing, both key challenges when performing RNA-Seq. This application is applicable to de novo RNA-Seq because it does not rely on a reference genome.
Topics (11): De novo sequencing analysis, Paeniclostridium sordellii, Homo sapiens, Clostridioides difficile, Clostridium novyi, Clostridium perfringens, Infection, Wounds and Injuries, Drug-Related Side Effects and Adverse Reactions, Sterols, Steroids