Error correction software tools | High-throughput sequencing data analysis
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies.
Identifies allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences. Celera assembler is an algorithm to produce a set of haploid consensus sequences rather than a single consensus sequence. It uses a dynamic windowing approach and detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation. Celera assembler also assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele.
Provides a de novo transcriptome assembler for short RNA-seq reads. Oases congregates unmapped RNA-seq reads into full length transcripts. It enables reconstruction with different k-values via dynamic cutoffs. This software adds as features an array of hash lengths, a dynamic filtering of noise, a resolution of alternative splicing (AS) events and merging of multiple assemblies.
Assists users to assemble noisy single-molecule sequences. Canu introduces several features including computational resource discovery, adaptive k-mer weighting, automated error rate estimation, sparse graph construction, and graphical fragment assembly (GFA) outputs. This pipeline consists of three different stages: correction, trimming, and assembly. Moreover, this tool can auto-detect available resources and configure itself to maximize resource utilization.
Provides a toolkit for improving the quality of genome assemblies created via an assembly software. PAGIT compiled four tools: (i) ABACAS which classifies and orientates contigs and estimates the sizes of gaps between them; (ii) IMAGE uses paired-end reads to extend contigs and close gaps within the scaffolds; (iii) ICORN for identifying and correcting small errors in consensus sequences and; (iv) RATT for help annotation. The software was mainly created to analyze parasite genomes of up to about 300 Mb.
Aligns deep coverage of short sequencing reads to correct errors in reference genome sequences and evaluate their accuracy. iCORN last version is based on SMALT (mapper), samtools, GATK, snp-o-matic and PERL scripts. It was shown that, after very few iterations, iCORN is efficient at correcting homopolymer errors that are often present in 454 data, thus potentially improving the ability to combine assemblies constructed using different sequencing technologies.
Aims to find the best value of k for assembly. KmerGenie is a sampling approach that enables users to construct histograms with multiple orders of magnitude of performance. It works in several steps: (i) it computes the k-sea abundance histogram for each k value, (ii) estimates the number of distinct genomic k-seas in the dataset, and (iii) outputs the k-sea length that maximizes that number.
Identifies and adjusts errors in sequencing reads by using k-mer coverage. Quake differentiates k-mers trusted to be in the genome and k-mers that are untrustworthy artifacts of sequencing errors. The software exploits read quality values and determine types of errors by generating nucleotide to nucleotide error rates. It can be deployed on large datasets containing billions of read if a set of corrections makes all k-mers trusted.