DNA sequence data compression software tools | Whole-genome sequencing analysis
An unprecedented quantity of genome sequence data is currently being generated using next-generation sequencing platforms. This has necessitated the development of novel bioinformatics approaches and algorithms that not only facilitate a meaningful analysis of these data but also aid in efficient compression, storage, retrieval and transmission of huge volumes of the generated data.
Executes computationally-aware compression that not only diminishes storage space but also accelerates the analysis process. CaBLAST/CaBLAT aims to deal with the redundancy induced by the similarity of most genomes currently sequenced to ones already collected. This redundancy can be used by compressing data to allow direct computation on the compressed data. This approach curtails the computational task of operating many highly similar genomes.
Compresses quality scores by capitalizing on sequence redundancy. Quartz is a program that performs compression by smoothing a large fraction of quality score values based on the k-mer neighborhood of their corresponding positions in the read sequences. Moreover, it includes features for preserving quality scores at locations that potentially differ from this consensus genome.
A command line software and a C API for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes. GQT represents genotypes as compressed bitmap indices, which reduce computational burden of variant queries based on sample genotypes, phenotypes, and relationships by orders of magnitude over standard "variant-centric" indexing strategies. This index can significantly expand the capabilities of population-scale analyses by providing interactive-speed queries to data sets with millions of individuals.
Consists of a SAM and BAM file compression tool. DeeZ can encode the positional information of each read within only the relevant contig. This program uses a unique compression method for each field of the SAM record to exploit its specific properties.
A package to compress DNA sequence, using a reference genome. DNAzip uses a series of compression techniques when, taken together, reduces the size of a single genome by orders of magnitude. It permits sending genome by mail, for example. This technique does require a reference human genome (∼3 GB) and a reference SNP map (∼1.2 GB), but this cost is amortized over the total number of genomes. DNAzip runs in time that scales only linearly to the number of variations.
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. A tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences.
Serves as a technique to compress short-read sequences files. Path encoding process by drawing a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. This software aims to minimize the complications of managing large-scale sequencing data. It proceeds by exploiting a reference to construct a statistical model of the sequences that are adaptively updated during compression.
A framework technology comprising file format and toolkit in which we combine highly efficient and tunable reference-based compression of sequence data with a data format that is directly available for computational use. This compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.
An application designed for compression of data files containing reads from DNA sequencing in FASTQ format. The amount of such files can be huge, e.g., a few (or tens) of gigabytes, so a need for a robust data compression tool is clear. Usually universal compression programs like gzip or bzip2 are used for this purpose, but it is obvious that a specialized tool can work better.
A compression approach offering lossless and lossy compression for SAM files. The structures and techniques proposed are suitable for representing SAM files, as well as supporting fast access to the compressed information. They generate more compact lossless representations than BAM, which is currently the preferred lossless compressed SAM-equivalent format; and are self-contained, that is, they do not depend on any external resources to compress or decompress SAM files.
Permits to compress FASTQ data. LW-FQZip is a lossless light-weight reference-based compression algorithm. The data are first split into metadata, short reads and quality scores, respectively and then processes independently with different schemes. The software is equipped with lightweight mapping model, bitwise prediction by partial matching (PPM), arithmetic coding, and multi-threading parallelism. It shows good compatibility to long-read sequencing data and is hoped to provide insights into the storage problems of new sequencing data.
A privacy-preserving solution for the secure storage of compressed aligned genomic data. SECRAM enables selective retrieval of encrypted data and improves the efficiency of downstream analysis (e.g., variant calling). Compared to BAM, the de facto standard for storing aligned genomic data, SECRAM uses 18% less storage. Compared to CRAM, SECRAM maintains efficient compression and downstream data processing, while allowing for unprecedented levels of security in genomic data storage.
A utility designed for compression of genome collections from the same species. The amount of such collections can be huge, e.g., a few (or tens) of gigabytes, so a need for a robust data compression tool is clear. Universal compression programs like gzip or bzip2 might be used for this purpose, but it is obvious that a specialized tool can work much better, since a universal compressor does not use the properties of such data sets, e.g., long approximate repetitions at long distances.
Compresses next-generation sequencing (NGS) data in the FASTQ and SAM/BAM formats with extreme prejudice. Quip is lossless compression algorithm that employs several different compression techniques to complete high compression over sequencing data of many types. This program is based on the notion of statistical compression of read identifiers, quality scores and nucleotide sequences by using arithmetic coding.
Compresses genome resequencing data using a reference genome sequence. GReEn is a compression tool, based on arithmetic coding, that handles arbitrary alphabets. The software does not pose any restrictions or requirements on the sequences to compress. Its running time depends only on the size of the sequence being compressed.
Compresses the data stored in SAM/BAM files. NGC provides a lossless compression mode that returns a file that is semantically equal to the original file in the sense that it contains the exact same information. It maps all q-values that lie within an interval to some single value within this interval. This tool can make the difference between different categories of q-values and can preserve the original qualities of bases in selected columns.
A high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors and scale well to multi-TB datasets.
Assists users in compressing RNA-seq alignments. Boiler is an application that discards unnecessary BAM attributes and most of the data that ties individual reads to their aligned positions and shapes. Instead, this method stores coverage vectors and read- and outer-distance tallies, permitting to shift from the ‘alignment domain’ to the ‘coverage domain.’
A software suite for common genomic analysis tasks which offers improved flexibility, scalability and execution time characteristics over previously published packages. The suite includes a utility to compress large inputs into a lossless format that can provide greater space savings and faster data extractions than alternatives.
Allows k-mer-based dataset analysis and transformations. Khmer is a set of command-line tools for working with DNA shotgun sequencing (SGS) data from genomes, transcriptomes, metagenomes, and single cells. The software relies on a probabilistic data structure, a Count-Min Sketch, which permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms.