1 - 50 of 85 results

Quartz / QUAlity score Reduction at Terabyte scale

A de novo quality score compression tool based on traversing the k-mer landscape of next-generation sequencing read datasets. Quartz preserves quality scores for probable variant locations and compresses quality scores of concordant bases by resetting them to a default value. It preserves quality scores at locations that potentially differ from this consensus genome. The Quartz software will benefit any researchers who are generating, storing, mapping, or analyzing large amounts of DNA, RNA, Chip-seq, or exome sequencing data.


A framework technology comprising file format and toolkit in which we combine highly efficient and tunable reference-based compression of sequence data with a data format that is directly available for computational use. This compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.

GDC / Genome Differential Compressor

A utility designed for compression of genome collections from the same species. The amount of such collections can be huge, e.g., a few (or tens) of gigabytes, so a need for a robust data compression tool is clear. Universal compression programs like gzip or bzip2 might be used for this purpose, but it is obvious that a specialized tool can work much better, since a universal compressor does not use the properties of such data sets, e.g., long approximate repetitions at long distances.

Compressed SAM format / CSAM

A compression approach offering lossless and lossy compression for SAM files. The structures and techniques proposed are suitable for representing SAM files, as well as supporting fast access to the compressed information. They generate more compact lossless representations than BAM, which is currently the preferred lossless compressed SAM-equivalent format; and are self-contained, that is, they do not depend on any external resources to compress or decompress SAM files.


A scalable low memory algorithm for constructing de Bruijn graphs from whole genome sequences. TwoPaCo is based on identifying the positions of the genome which correspond to vertices of the compacted graph. TwoPaCo works by narrowing down the set of candidates using a probabilistic data structure, in order to make the deterministic memory-intensive approach feasible. TwoPaCo can construct the graph for 100 simulated human genomes in less than a day and eight real primates in less than two hours, on a typical shared-memory machine.

SECRAM / Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map

A privacy-preserving solution for the secure storage of compressed aligned genomic data. SECRAM enables selective retrieval of encrypted data and improves the efficiency of downstream analysis (e.g., variant calling). Compared to BAM, the de facto standard for storing aligned genomic data, SECRAM uses 18% less storage. Compared to CRAM, SECRAM maintains efficient compression and downstream data processing, while allowing for unprecedented levels of security in genomic data storage.


A software tool for compressing and querying large collections of RNA-seq alignments. Boiler discards most per-read data, keeping only a genomic coverage vector plus a few empirical distributions summarizing the alignments. Since most per-read data is discarded, storage footprint is often much smaller than that achieved by other compression tools. Despite this, the most relevant per-read data can be recovered; we show that Boiler compression has only a slight negative impact on results given by downstream tools for isoform assembly and quantitation. Boiler also allows the user to pose fast and useful related queries without decompressing the entire file. Boiler is not a general-purpose substitute for RNA-seq SAM/BAM files, but it is an extremely space-efficient alternative that works well with tools like Cufflinks and StringTie.


Targets the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme, to compress raw genetic codes. It compresses both genomic reads and assembled genomic sequences without reference files. AFRESh splits the genomic data stream into blocks and selects, for each block, the most effective tool from a set of encoding and prediction tools. Comparing to generic compression approaches, a compression gain is achieved of up to 41% compared to GNU Gzip and 22% compared to 7-Zip at the Ultra setting.

GTRAC / GenoType Random Access Compressor

Allows for fast access of information of a specific variant or the genotype of a sample/group of samples over the compressed Variant Call Format (VCF) file. GTRAC achieves compression rates comparable to the state-of-the-art compressors, while allowing for ultra fast querying on the compressed domain. Specifically, the proposed algorithm allows for fast retrieval of all individuals/samples that possess certain variants, and the retrieval of all variants from a group of individuals/samples. Thus GTRAC will allow researchers to work efficiently with a highly compressed database containing the genotype information of a collection of samples.

NRGC / Novel Referential Genome Compressor

A referential genome compression algorithm to effectively and efficiently compress the genomic sequences. We employ a scoring based placement technique to quantify large variations among the genomic sequences. NRGC runs in three stages. At the beginning the target genome is divided into some segments. Each segment is then placed onto the reference genome. After getting the most suitable placement we further divide each segment into some non-overlapping parts. We also divide the corresponding segments of the reference genome into the same number of parts. Each part from the target is then compressed with respect to the corresponding part of the reference. We have done rigorous experiments to evaluate NRGC by taking a set of real human genomes. The simulation results show that our algorithm is indeed an effective genome compression algorithm that performs better than the best-known algorithms in most of the cases. Compression and decompression times are also very impressive.


A lossless compression method for WIG data offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics analysis, and fast queries from the compressed files. The key features of the smallWig algorithm are statistical data analysis and a combination of source coding methods that ensure high flexibility and make the algorithm suitable for different applications. Furthermore, for general-purpose file compression, the compression rate of smallWig approaches the empirical entropy of the tested WIG data. For compression with random query features, smallWig uses a simple block-based compression scheme that introduces only a minor overhead in the compression rate. For archival or storage space sensitive applications, the method relies on context mixing techniques that lead to further improvements of the compression rate.


Assists users in exploiting the alignment information contained in the SAM/BAM files. CALQ is a lossy compressor for quality values that computes a genotype certainty level per genomic locus to determine the acceptable coarseness of quality value quantization for all the quality values associated to that locus. It also uses the alignment information to determine the acceptable level of distortion for the quality values such that subsequent downstream analyses are presumably not affected.

HARC / HAsh-based Read Compressor

Reorders reads approximately according to their genome position and encodes them to remove the redundancy between consecutive reads. HARC is an algorithm for read compression that does not require a reference genome. It can be used in cases involving unsequenced species and metagenomics. While reordering reads can lead to better compression, the read order in general, and the read-pairing information in particular can be useful in downstream analysis. Therefore, this algorithm allows compression both with and without preserving the read order.

GQT / Genotype Query Tools

A command line software and a C API for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes. GQT represents genotypes as compressed bitmap indices, which reduce computational burden of variant queries based on sample genotypes, phenotypes, and relationships by orders of magnitude over standard "variant-centric" indexing strategies. This index can significantly expand the capabilities of population-scale analyses by providing interactive-speed queries to data sets with millions of individuals.

QVZ / Quality Value Zip

A lossy compressor for the quality values presented in genomic data files (e.g., FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit. QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms.

KIC / K-mer Index Compressor

A FASTQ compressor based on a new integer-mapped k-mer indexing method. KIC offers high compression ratio on sequence data, outstanding user-friendliness with graphic user interfaces, and proven reliability. Evaluated on multiple large RNA-seq data sets from both human and plants, it was found that the compression ratio of KIC had exceeded all major generic compressors, and was comparable to those of the latest dedicated compressors. KIC enables researchers with minimal informatics training to take advantage of the latest sequence compression technologies, easily manage large FASTQ data sets, and reduce storage and transmission cost.