1 - 50 of 102 results

Standalone hamming

A package for error-correcting DNA barcodes. Hamming allows one run of a massively parallel pyrosequencer to process up to 1544 samples simultaneously. The tagged barcoding strategy can be used to obtain sequences from hundreds of samples in a single sequencing run, and to perform phylogenetic analyses of microbial communities from pyrosequencing data. The combination of error-correcting barcodes and massively parallel sequencing rapidly revolutionizes our understanding of microbial habitats located throughout our biosphere, as well as those associated with our human bodies.

PBcR / PacBio Corrected Reads

An approach that utilizes short, high-identity sequences to correct the error inherent in long, single-molecule sequences. PBcR, implemented as part of the Celera Assembler, trims and corrects individual long-read sequences by first mapping short-read sequences to them and computing a highly accurate hybrid consensus sequence: improving read accuracy from as low as 80% to over 99.9%. The corrected, “hybrid” PBcR reads may then be de novo assembled alone, in combination with other data, or exported for other applications.

BLESS / BLoom-filter-based Error correction Solution for high-throughput Sequencing reads

A memory-efficient error correction method that uses a Bloom filter as the main data structure. We have developed a new version of BLESS to improve runtime and accuracy while maintaining a small memory usage. The new version, called BLESS 2, has an error correction algorithm that is more accurate than BLESS, and the algorithm has been parallelized using hybrid MPI and OpenMP programming. BLESS 2 was compared with five top-performing tools, and it was found to be the fastest when it was executed on two computing nodes using MPI, with each node containing twelve cores. Also, BLESS 2 showed at least 11 percent higher gain while retaining the memory efficiency of the previous version for large genomes.


Is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes. The program can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets.

PAGIT / Post-Assembly Genome-Improvement Toolkit

Provides a toolkit for improving the quality of genome assemblies created via an assembly software. PAGIT compiled four tools: (i) ABACAS which classifies and orientates contigs and estimates the sizes of gaps between them; (ii) IMAGE uses paired-end reads to extend contigs and close gaps within the scaffolds; (iii) ICORN for identifying and correcting small errors in consensus sequences and; (iv) RATT for help annotation. The software was mainly created to analyze parasite genomes of up to about 300 Mb.


Allows users to correct errors in short-read data from next-generation sequencing (NGS). Reptile is a scalable short-read error correction method broadly based on the k-mer spectrum approach, but that also uses adjoining k-mer information (called tiles) to reliably suggest error corrections. The software was evaluated on several Illumina/Solexa datasets. A distributed memory approach to parallelization of Reptile allows hardware (with any memory size per node) to be employed for error correction of any dataset using Reptile’s algorithm.

ShoRAH / Short Reads Assembly into Haplotypes

A computational method for quantifying genetic diversity in a mixed sample and for identifying the individual clones in the population, while accounting for sequencing errors. This approach provides the user also with an estimate of the quality of the reconstruction. Further, ShoRAH can reconstruct the global haplotypes and estimate their frequencies. ShoRAH was run on simulated data and on real data obtained in wet lab experiments to assess its reliability.

TRAPLINE / Transparent Reproducible and Automated PipeLINE

Serves for RNAseq data processing, evaluation and prediction. TRAPLINE guides researchers through the NGS data analysis process in a transparent and automated state-of-the-art pipeline. It can detect protein-protein interactions (PPIs), miRNA targets and alternatively splicing variants or promoter enriched sites. This tool includes different modules for several functions: (1) it scans the list of differentially expressed genes; (2) it includes modules for miRNA target prediction; and (3) a module is implemented to identify verified interactions between proteins of significantly upregulated and downregulated mRNAs.

UMI-tools / Unique Molecular Identifiers-tools

Demonstrates the value of properly accounting for errors in unique molecular identifiers (UMIs). UMI-tools removes PCR duplicates and implements a number of different UMI deduplication schemes. It can extract, remove and append UMI sequences from fastq reads. Compared with previous method, this one is superior at estimating the true number of unique molecules. The simulations provide an insight into the impact on quantification accuracy and indicate that application of an error-aware method is even more important with higher sequencing depth.


A method for correcting long and highly erroneous sequencing reads. LoRMA shows that efficient alignment free methods can be applied to highly erroneous long read data. The current approach needs alignments to take into account the global context of errors. Reads corrected by the new method have an error rate less than half of the error rate of reads corrected by previous self-correction methods. Furthermore, the throughput of the new method is 20% higher than previous self-correction methods with read sets having coverage at least 75×.

SGA-ICE / SGA-Iteratively Correcting Errors

star_border star_border star_border star_border star_border
star star star star star
Implements iterative error correction by using modules from String Graph Assembler (SGA). SGA-ICE is an iterative error correction pipeline that runs SGA in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads.

debarcer / De-Barcoding and Error Correction

Facilitates the use of barcoded data generated by SiMSenSeq. Debarcer is a package for working with next-generation sequencing (NGS) data that contains molecular barcodes. It processes raw .fastq files containing SiMSen-seq barcoded adaptor regions using a combination of standard bioinformatic tools such as bwa, perl and R, as well as Bio-SamTools, to extract information from alignment files. Debarcer collects the read data for each amplicon and barcode (a ‘sequence family’), and then, based on the alignment extracted from the .bam file, each base is indexed by genomic position.


A user-friendly way to inspect NGS datasets obtained from the sequencing of genetic markers in microbial communities. The error calculation functionality enables the evaluation of the overall sequencing quality and can further be used to assess the outcome of NGS data processing pipelines. The interactive plots in NGS-eval quickly illustrate the read coordinates where the errors occur. High frequency of errors at specific positions can be useful for detecting novel (common) sequence variants and identifying the differences between the strains that are present in the sample and that are used as reference sequences.


A massively parallelized and highly efficient error correction module for Illumina read data. Trowel both corrects erroneous base calls and boosts base qualities based on the k-mer spectrum. With high-quality k-mers and relevant base information, Trowel achieves high accuracy for different short read sequencing applications. The latency in the data path has been significantly reduced because of efficient data access and data structures. In performance evaluations, Trowel was highly competitive with other tools regardless of coverage, genome size read length and fragment size.


A hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to Jabba is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with maximal exact matches (MEMs) is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.


Corrects noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. CoLoRMap is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.

HECIL / Hybrid Error Correction with Iterative Learning

Allows decomposition of the workload into independent data-parallel tasks that can be executed simultaneously. HECIL is a hybrid correction framework that computes erroneous long reads based on optimal combinations of base quality and mapping identity of aligned short reads. This method performs significantly better for an overwhelming majority of evaluation metrics, even with limited amounts of short reads available for correction.

HALC / High Throughput Algorithm for Long Read Error Correction

Allows long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long-read region can be aligned to at least one contig region. It constructs a contig graph and, for each long read, references the other long reads’ alignments to find the most accurate alignment and correct it with the aligned contig regions. The tool can correct more bases in the long reads than the existing error correction algorithms while achieving comparable or higher accuracy.

UMI-Reducer / Unique Molecular Identifiers Reducer

Processes and differentiates polymerase chain reaction (PCR) duplicates from biological duplicates. UMI-Reducer uses Unique Molecular Identifiers (UMIs) and the mapping position of the read to identify and collapse reads that are technical duplicates. Remaining true biological reads are further used for bias-free estimate of mRNA abundance in the original lysate. This strategy is of particular use for libraries made from low amounts of starting material, which typically require additional cycles of PCR and therefore are most prone to PCR duplicate bias.

iCORN / Iterative Correction of Reference Nucleotides

Aligns deep coverage of short sequencing reads to correct errors in reference genome sequences and evaluate their accuracy. iCORN last version is based on SMALT (mapper), samtools, GATK, snp-o-matic and PERL scripts. It was shown that, after very few iterations, iCORN is efficient at correcting homopolymer errors that are often present in 454 data, thus potentially improving the ability to combine assemblies constructed using different sequencing technologies.


A profile homology search tool for PacBio reads. Frame-Pro is a tool using Hidden Markov Model (HMM) and directed acyclic graph to correct the errors in DNA sequencing reads. It can also provide output the profile alignments of the corrected sequences against characterized protein families. The results of Frame-Pro showed that this method enables more sensitive homology search and corrects more errors compared to a popular error correction tool that does not rely on hybrid sequencing.


A hashing algorithm tuned for processing DNA/RNA sequences. ntHash provides a fast way to compute multiple hash values for a given k-mer, without repeating the whole procedure for each value. To do so, a single hash value is computed from a given k-mer, and then each extra hash value is computed by few more multiplication, shifting and XOR operations on the initial hash value. This would be very useful for certain bioinformatics applications, such as those that utilize the Bloom filter data structure. Experimental results demonstrate a substantial speed improvement over conventional approaches, while retaining a near-ideal hash value distribution.

Karect / KAUST Assembly Read Error Correction Tool

An error correction technique based on multiple alignment. Karect supports substitution, insertion and deletion errors. It can handle non-uniform coverage as well as moderately covered areas of the sequenced genome. Experiments with data from Illumina, 454 FLX and Ion Torrent sequencing machines demonstrate that Karect is more accurate than previous methods, both in terms of correcting individual-bases errors (up to 10% increase in accuracy gain), and post de novo assembly quality (up to 10% increase in NGA50).