Error correction software tools | High-throughput sequencing data analysis
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies.
Provides a de novo transcriptome assembler for short RNA-seq reads. Oases congregates unmapped RNA-seq reads into full length transcripts. It enables reconstruction with different k-values via dynamic cutoffs. This software adds as features an array of hash lengths, a dynamic filtering of noise, a resolution of alternative splicing (AS) events and merging of multiple assemblies.
Identifies allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences. Celera assembler is an algorithm to produce a set of haploid consensus sequences rather than a single consensus sequence. It uses a dynamic windowing approach and detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation. Celera assembler also assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele.
Identifies and adjusts errors in sequencing reads by using k-mer coverage. Quake differentiates k-mers trusted to be in the genome and k-mers that are untrustworthy artifacts of sequencing errors. The software exploits read quality values and determine types of errors by generating nucleotide to nucleotide error rates. It can be deployed on large datasets containing billions of read if a set of corrections makes all k-mers trusted.
Provides a toolkit for improving the quality of genome assemblies created via an assembly software. PAGIT compiled four tools: (i) ABACAS which classifies and orientates contigs and estimates the sizes of gaps between them; (ii) IMAGE uses paired-end reads to extend contigs and close gaps within the scaffolds; (iii) ICORN for identifying and correcting small errors in consensus sequences and; (iv) RATT for help annotation. The software was mainly created to analyze parasite genomes of up to about 300 Mb.
Aligns deep coverage of short sequencing reads to correct errors in reference genome sequences and evaluate their accuracy. iCORN last version is based on SMALT (mapper), samtools, GATK, snp-o-matic and PERL scripts. It was shown that, after very few iterations, iCORN is efficient at correcting homopolymer errors that are often present in 454 data, thus potentially improving the ability to combine assemblies constructed using different sequencing technologies.
Helps users to decrease the curation time from months with manual curation to minutes with automated curation. AutoCurE is an automated tool for bacterial database curation in Excel facilitates checks between the downloaded genome folders, files and the genome reports to flag if any inconsistencies exist in the metadata, including genome names, BioProject/UID, RefSeq accession numbers, and sequence file descriptions. Moreover it allows users to identify and flag archaea genomes.
Remedies sequencing errors for high-throughput short-read sequencing data. SHREC constructs and traverses a generalized suffix tree built from the read data to work. It enables users to deal with low coverage and uneven error distribution. This tool can serve to do subsequent de novo short read assembly. It is able to return corrected reads which can still be of use during the assembly process.