tutorial arrow
×
Submit new tools
Share tools covering the current topic. Provide easy-to-follow guidelines to improve their usability.

DNA sequence data compression software tools | Whole-genome sequencing data analysis

An unprecedented quantity of genome sequence data is currently being generated using next-generation sequencing platforms. This has necessitated the development of novel bioinformatics approaches and algorithms that not only facilitate a meaningful…
G T A T C G C T A
CaBLAST/CaBLAT
Desktop
G T A T C G C T A
Quartz
Desktop

Quartz QUAlity score Reduction at Terabyte scale

A de novo quality score compression tool based on traversing the k-mer…

A de novo quality score compression tool based on traversing the k-mer landscape of next-generation sequencing read datasets. Quartz preserves quality scores for probable variant locations and…

G T A T C G C T A
DNAzip
Desktop

DNAzip

A package to compress DNA sequence, using a reference genome. DNAzip uses a…

A package to compress DNA sequence, using a reference genome. DNAzip uses a series of compression techniques when, taken together, reduces the size of a single genome by orders of magnitude. It…

G T A T C G C T A
CRAM
Desktop

CRAM

A framework technology comprising file format and toolkit in which we combine…

A framework technology comprising file format and toolkit in which we combine highly efficient and tunable reference-based compression of sequence data with a data format that is directly available…

G T A T C G C T A
DeeZ
Desktop

DeeZ DeeNA-Zip

A tool for compressing SAM/BAM files, or more formally, a tool which does…

A tool for compressing SAM/BAM files, or more formally, a tool which does reference-based compression by local assembly. DeeZ were compared to other tools on bacterial RNA-seq data as well as human…

G T A T C G C T A
GDC
Desktop

GDC Genome Differential Compressor

A utility designed for compression of genome collections from the same species.…

A utility designed for compression of genome collections from the same species. The amount of such collections can be huge, e.g., a few (or tens) of gigabytes, so a need for a robust data compression…

G T A T C G C T A
BEETL
Desktop

BEETL Burrows-Wheeler Extended Tool Library

Large-scale compression of genomic sequence databases with the Burrows-Wheeler…

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. A tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid…

G T A T C G C T A
DSRC
Desktop

DSRC DNA Sequence Reads Compression

An application designed for compression of data files containing reads from DNA…

An application designed for compression of data files containing reads from DNA sequencing in FASTQ format. The amount of such files can be huge, e.g., a few (or tens) of gigabytes, so a need for a…

G T A T C G C T A
BEDOPS
Desktop

BEDOPS

A software suite for common genomic analysis tasks which offers improved…

A software suite for common genomic analysis tasks which offers improved flexibility, scalability and execution time characteristics over previously published packages. The suite includes a utility…

G T A T C G C T A
Khmer
Desktop

Khmer

Allows k-mer-based dataset analysis and transformations. Khmer is a set of…

Allows k-mer-based dataset analysis and transformations. Khmer is a set of command-line tools for working with DNA shotgun sequencing (SGS) data from genomes, transcriptomes, metagenomes, and single…

G T A T C G C T A
Compressed SAM format
Desktop

Compressed SAM format CSAM

A compression approach offering lossless and lossy compression for SAM files.…

A compression approach offering lossless and lossy compression for SAM files. The structures and techniques proposed are suitable for representing SAM files, as well as supporting fast access to the…

G T A T C G C T A
TwoPaCo
Desktop

TwoPaCo

A scalable low memory algorithm for constructing de Bruijn graphs from whole…

A scalable low memory algorithm for constructing de Bruijn graphs from whole genome sequences. TwoPaCo is based on identifying the positions of the genome which correspond to vertices of the…

G T A T C G C T A
GeneCodeq
Desktop

GeneCodeq

A Bayesian method inspired by coding theory for adjusting quality scores to…

A Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. GeneCodeq leverages a corpus…

G T A T C G C T A
Quip
Desktop

Quip

Compresses next-generation sequencing data in the FASTQ and SAM/BAM formats…

Compresses next-generation sequencing data in the FASTQ and SAM/BAM formats with extreme prejudice.

G T A T C G C T A
SECRAM
Desktop
G T A T C G C T A
GReEn
Desktop

GReEn Genome Resequencing Encoding

A compression tool recently proposed for compressing genome resequencing data…

A compression tool recently proposed for compressing genome resequencing data using a reference genome sequence.

G T A T C G C T A
NGC
Desktop

NGC

A compressor for aligned HTS sequencing data that enables the complete lossless…

A compressor for aligned HTS sequencing data that enables the complete lossless and lossy compression of mapped alignment data stored in SAM/BAM files.

G T A T C G C T A
CARGO
Desktop

CARGO Compressed ARchiving for GenOmics

A high-level framework to automatically generate software systems optimized for…

A high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our…

G T A T C G C T A
Boiler
Desktop

Boiler

A software tool for compressing and querying large collections of RNA-seq…

A software tool for compressing and querying large collections of RNA-seq alignments. Boiler discards most per-read data, keeping only a genomic coverage vector plus a few empirical distributions…

G T A T C G C T A
GTC
Desktop

GTC GenoType Compressor

Allows the compression of representation of genotypes supporting fast queries.…

Allows the compression of representation of genotypes supporting fast queries. The main feature of GTC is to offer much better compression and much faster queries in order to represent a collection…

G T A T C G C T A
ARSDA
Desktop

ARSDA Analyzing RNA-Seq Data

Alleviates the problem associated with storage, transmission and analysis of…

Alleviates the problem associated with storage, transmission and analysis of high-throughput sequencing (HTS) data. ARSDA can take as input .SRA files or .fastq files of many gigabytes, build an…

G T A T C G C T A
AFRESh
Desktop

AFRESh

Targets the effective representation of the raw genomic symbol streams of both…

Targets the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a…

G T A T C G C T A
SCALCE
Desktop

SCALCE Sequence Compression Algorithm using Locally Consistent Encoding

A tool for compressing FASTQ files. SCALCE is designed specifically for the…

A tool for compressing FASTQ files. SCALCE is designed specifically for the Illumina-generated FASTQ files, but supports any valid FASTQ with consistent read lengths. The SCALCE algorithm provides a…

G T A T C G C T A
GTRAC
Desktop

GTRAC GenoType Random Access Compressor

Allows for fast access of information of a specific variant or the genotype of…

Allows for fast access of information of a specific variant or the genotype of a sample/group of samples over the compressed Variant Call Format (VCF) file. GTRAC achieves compression rates…

G T A T C G C T A
NRGC
Desktop

NRGC Novel Referential Genome Compressor

A referential genome compression algorithm to effectively and efficiently…

A referential genome compression algorithm to effectively and efficiently compress the genomic sequences. We employ a scoring based placement technique to quantify large variations among the genomic…

G T A T C G C T A
smallWig
Desktop

smallWig

A lossless compression method for WIG data offering the best known compression…

A lossless compression method for WIG data offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics…

G T A T C G C T A
LFQC
Desktop

LFQC

A lossless non-reference based FASTQ compression algorithm that can elegantly…

A lossless non-reference based FASTQ compression algorithm that can elegantly run on commodity machines. LFQC is provisioned to run in in-core as well as out-of-core settings. The implementations are…

G T A T C G C T A
msbwt
Desktop

msbwt

A package for combining strings from sequencing into a data structure known as…

A package for combining strings from sequencing into a data structure known as the multi-string BWT (MSBWT).

G T A T C G C T A
MFCompress
Desktop

MFCompress

A package for FASTA and multi-FASTA files compression. MFCompress provides…

A package for FASTA and multi-FASTA files compression. MFCompress provides additional average compression gains of almost 50%, i.e. it potentially doubles the available storage, although at the cost…

G T A T C G C T A
Samcomp
Desktop

Samcomp

An algorithm for compression of FASTQ files. Samcomp performs reference based…

An algorithm for compression of FASTQ files. Samcomp performs reference based compression but requires previously aligned data in the SAM format instead. The tool is compared against existing…

G T A T C G C T A
fastqz
Desktop

fastqz

A compressor for the most common (Sanger format) FASTQ files, produced by DNA…

A compressor for the most common (Sanger format) FASTQ files, produced by DNA sequencing machines. Fastqz breaks the fastq file into three separate streams, it uses a compression method designed to…

G T A T C G C T A
fqzcomp
Desktop

fqzcomp

Compresses FASTQ files. fqzcomp uses a public domain byte-wise arithmetic…

Compresses FASTQ files. fqzcomp uses a public domain byte-wise arithmetic coder. The software split FASTQ data into sequence identifiers, basecalls and quality scores, compressing the streams…

G T A T C G C T A
GenomeZip
Desktop

GenomeZip

Compress human genomes based on entropy coding, using a reference genome and…

Compress human genomes based on entropy coding, using a reference genome and known Single Nucleotide Polymorphisms (SNPs). GenomeZip also explores several intrinsic features of genomes and…

G T A T C G C T A
Genomedata
Desktop

Genomedata

A format for efficient storage of multiple tracks of numeric data anchored to a…

A format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space…

G T A T C G C T A
HARC
Desktop

HARC HAsh-based Read Compressor

Reorders reads approximately according to their genome position and encodes…

Reorders reads approximately according to their genome position and encodes them to remove the redundancy between consecutive reads. HARC is an algorithm for read compression that does not require a…

G T A T C G C T A
FaStore
Desktop

FaStore

Allows to compress DNA sequences, quality scores and read identifiers. FaStore…

Allows to compress DNA sequences, quality scores and read identifiers. FaStore is a method which allows to compress quality scores and can also be used for compressing alignments in SAM format. This…

G T A T C G C T A
HiRGC
Desktop

HiRGC

Searches maximum matches from the integer sequence on a hash table by an…

Searches maximum matches from the integer sequence on a hash table by an advanced greedy matching strategy. HiRGC is a high-performance referential genome compression algorithm. Its speed and…

G T A T C G C T A
Quark
Desktop

Quark

Permits to make compression of high throughput RNA-seq reads. Quark locates…

Permits to make compression of high throughput RNA-seq reads. Quark locates regions of interest in the references that are specific to the particular RNA-seq experiment being compressed, and stores…

G T A T C G C T A
E²FM
Desktop

E²FM

Represents a full-text index optimized for compressing and encrypting entire…

Represents a full-text index optimized for compressing and encrypting entire collections of genomic sequences and for performing fast pattern-search queries. E2FM simplifies to perform operations…

G T A T C G C T A
ChIPWig
Desktop

ChIPWig

Serves to ChIP-seq Wig data. Wig is a standard file format, which in this…

Serves to ChIP-seq Wig data. Wig is a standard file format, which in this setting contains relevant read density information crucial for visualization and downstream processing. ChIPWig may be…

G T A T C G C T A
GQT
Desktop

GQT Genotype Query Tools

A command line software and a C API for indexing and querying large-scale…

A command line software and a C API for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes.…

G T A T C G C T A
ERGC
Desktop

ERGC Efficient Referential Genome Compressor

A genome compression tool. ERGC compresses a target genome using a reference…

A genome compression tool. ERGC compresses a target genome using a reference genome. It employs a divide and conquers strategy. At first it divides both the target and reference sequences into some…

G T A T C G C T A
QVZ
Desktop

QVZ Quality Value Zip

A lossy compressor for the quality values presented in genomic data files…

A lossy compressor for the quality values presented in genomic data files (e.g., FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression…

G T A T C G C T A
MINCE
Desktop

MINCE

A technique to boost the compression of sequencing data that is based on the…

A technique to boost the compression of sequencing data that is based on the concept of bucketing similar reads so that they appear nearby in the file. MINCE is a technique for encoding collections…

G T A T C G C T A
ORCOM
Desktop

ORCOM Overlapping Reads COmpression with Minimizers

A compressor of sequencing reads. ORCOM takes as an input FASTQ files (possibly…

A compressor of sequencing reads. ORCOM takes as an input FASTQ files (possibly gzipped) and stores the DNA symbols of each read in a highly-compressed form. Id and quality fields are not stored.…

G T A T C G C T A
iDoComp
Desktop

iDoComp

A compressor of assembled genomes presented in FASTA format that compresses an…

A compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression. In terms of compression…

G T A T C G C T A
CompMap
Desktop

CompMap

A reference-based compression program to speed up read mapping to related…

A reference-based compression program to speed up read mapping to related reference sequences. It is designed to eliminate repeat subsequences based on reference-base compression in the input…

G T A T C G C T A
SAMZIP
Desktop

SAMZIP

Allows to encode Sequence Alignment/Map (SAM) files. SAMZIP is an encoding…

Allows to encode Sequence Alignment/Map (SAM) files. SAMZIP is an encoding scheme specifically designed to work on SAM files. The scheme exploits two important characteristic of SAM files to improve…

G T A T C G C T A
MetaCRAM
Desktop

MetaCRAM

A de novo, parallelized software suite specialized for FASTA and FASTQ format…

A de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. MetaCRAM integrates algorithms for taxonomy identification and…

G T A T C G C T A
KIC
Desktop

KIC K-mer Index Compressor

A FASTQ compressor based on a new integer-mapped k-mer indexing method. KIC…

A FASTQ compressor based on a new integer-mapped k-mer indexing method. KIC offers high compression ratio on sequence data, outstanding user-friendliness with graphic user interfaces, and proven…

G T A T C G C T A
MAFCO
Desktop

MAFCO

A lossless compression tool specifically designed to compress MAF (Multiple…

A lossless compression tool specifically designed to compress MAF (Multiple Alignment Format) files. Compared to gzip, the proposed tool attains a compression gain from ≈ 34% to ≈ 57%, depending…

G T A T C G C T A
CODOC
Desktop

CODOC

An open file format and API for the lossless and lossy compression of…

An open file format and API for the lossless and lossy compression of depth-of-coverage (DOC) signals stemming from high-throughput sequencing (HTS) experiments.

G T A T C G C T A
CWig
Desktop

CWig Compressed representation of Wiggle

A format and toolkit for storing and analysing genome-wide density signal data.

A format and toolkit for storing and analysing genome-wide density signal data.

G T A T C G C T A
LEON
Desktop

LEON

An all-in-one software for FASTQ file compression that handles DNA, header and…

An all-in-one software for FASTQ file compression that handles DNA, header and quality scores. LEON uses the same data structure for both DNA and quality scores compression, a de Bruijn Graph…

G T A T C G C T A
ALAPY…
Desktop

ALAPY Compressor

Utilizes lossless compression algorithm for files and optimized for the latest…

Utilizes lossless compression algorithm for files and optimized for the latest sequencing machines from Illumina. ALAPY Compressor is a cross-platform software tool used for efficient compression of…

G T A T C G C T A
CoGI
Desktop

CoGI Compressing Genomes as an Image

An approach for genome compression, which transforms the genomic sequences to a…

An approach for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary…

G T A T C G C T A
bzip2
Desktop

bzip2

Compresses files using the Burrows-Wheeler block sorting text compression…

Compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. bzip2 is a data compressor able to replace each file by a compressed version of itself, with…

G T A T C G C T A
TGC
Desktop

TGC Thousands Genomes Compressor

Estimates the boundaries of compression ratio for human genome compression. TGC…

Estimates the boundaries of compression ratio for human genome compression. TGC can be also used as a very effective tool for compression Variant Call Format (VCF) files. The success of our algorithm…

G T A T C G C T A
G-SQZ
Desktop
Web

G-SQZ

A Huffman coding-based sequencing-reads-specific representation scheme that…

A Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets,…

G T A T C G C T A
BARCODE
Desktop

BARCODE

Achieves highly efficient compression by using a reference genome, but…

Achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. BARCODE runs an order…

G T A T C G C T A
KungFq
Desktop

KungFq

Compresses FASTQ files, decompress them and access single reads in the…

Compresses FASTQ files, decompress them and access single reads in the compressed ones. KungFQ is based on dividing the reads in blocks and superblocks and computes statistics over each superblocks…

G T A T C G C T A
DNA-COMPACT
Desktop

DNA-COMPACT DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique

Takes advantage of dictionary-based and statistics-based algorithms to deal…

Takes advantage of dictionary-based and statistics-based algorithms to deal with the genome compression for scenarios with and without reference sequences. DNA-COMPACT is a framework and a two-pass…

G T A T C G C T A
HUGO
Desktop

HUGO Hierarchical mUlti-reference Genome cOmpression

A compression algorithm for aligned reads in the sorted Sequence Alignment/Map…

A compression algorithm for aligned reads in the sorted Sequence Alignment/Map format. HUGO first aligns short reads against a reference genome and stores exactly mapped reads for compression. For…

G T A T C G C T A
QualComp
Desktop

QualComp

A lossy compression algorithm for the quality scores presented in a FASTQ file.…

A lossy compression algorithm for the quality scores presented in a FASTQ file. QualComp allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to…

G T A T C G C T A
AQUa
Desktop

AQUa

Compressed quality scores of a FASTQ file. AQUa is based on the AFRESh…

Compressed quality scores of a FASTQ file. AQUa is based on the AFRESh framework, which supports many features: (i) single-pass encoding, (ii) Context-Adaptive Binary Arithmetic Coding (CABAC), (iii)…

G T A T C G C T A
SOLiDzipper
Desktop

SOLiDzipper

A fast encoding method that can efficiently encode and decode NGS data. The…

A fast encoding method that can efficiently encode and decode NGS data. The basic strategy of SOLiDzipper is to divide and encode. NGS data files contain both the sequence and non-sequence…

G T A T C G C T A
cbc
Desktop

cbc

A program for compression and decompression of aligned reads presented in a SAM…

A program for compression and decompression of aligned reads presented in a SAM file. Note that the purpose of this algorithm is to compress the necessary information to reconstruct the reads…

G T A T C G C T A
SACO
Desktop

SACO

A lossless compression tool for the sequences alignments found in the MAF…

A lossless compression tool for the sequences alignments found in the MAF files. SACO is based on a mixture of finite-context models. Contrarily a recent approach, it addresses both the DNA bases and…

G T A T C G C T A
Gzip
Desktop

Gzip GNU zip

Reduces the size of the named files using Lempel-Ziv coding. Gzip is designed…

Reduces the size of the named files using Lempel-Ziv coding. Gzip is designed to be a replacement for compress. The software automatically detects the input format and replaces each file by one with…

G T A T C G C T A
Path encoding
Desktop

Path encoding

A technique for compressing short-read sequence files. It uses a reference (any…

A technique for compressing short-read sequence files. It uses a reference (any gzipped multi-FASTA file) to build a statistical model of the sequences, which is adaptively updated during compression.

G T A T C G C T A
LW-FQZip
Desktop

LW-FQZip

Permits to compress FASTQ data. LW-FQZip is a lossless light-weight…

Permits to compress FASTQ data. LW-FQZip is a lossless light-weight reference-based compression algorithm. The data are first split into metadata, short reads and quality scores, respectively and…

G T A T C G C T A
GRS
Desktop

GRS

A novel compression tool for efficient storage of Genome Re-Sequencing data.

A novel compression tool for efficient storage of Genome Re-Sequencing data.

G T A T C G C T A
DELIMINATE
Desktop

DELIMINATE

A practical implementation of a novel compression approach that can rapidly…

A practical implementation of a novel compression approach that can rapidly compress FASTA files containing genomic sequence data in a loss-less fashion.

G T A T C G C T A
SNPack
Desktop

SNPack

An algorithm and file format for compressing and retrieving SNP data,…

An algorithm and file format for compressing and retrieving SNP data, specifically designed for large-scale association studies.

G T A T C G C T A
paraDSRC
Desktop

paraDSRC

A high-performance tool for compressing next generation sequencing data using…

A high-performance tool for compressing next generation sequencing data using memory-distributed clusters. paraDSRC uses domain decomposition and message passing interface (MPI) to distributed data…

G T A T C G C T A
FQC
Desktop

FQC

A fastq compression method that, in addition to providing significantly higher…

A fastq compression method that, in addition to providing significantly higher compression gains over GZIP, incorporates features necessary for universal adoption by data repositories/end-users. FQC…

G T A T C G C T A
COMRAD
Desktop

COMRAD COMpression using RedundAncy of Dna

Finds repeats over multiple passes through the data so already-compressed…

Finds repeats over multiple passes through the data so already-compressed regions are extended, leading to detection and compression of long repeated substrings.

G T A T C G C T A
RLZ
Desktop

RLZ Relative Lempel-Ziv

An algorithm that compresses a collection of genomes or sequences from the same…

An algorithm that compresses a collection of genomes or sequences from the same species with respect to the reference sequence for that species using a simple greedy technique, akin to LZ77 parsing…

Information

By using OMICtools you acknowledge that you have read and accepted the terms of the end user license agreement.