DNA sequence data compression software tools | Whole-genome sequencing analysis
An unprecedented quantity of genome sequence data is currently being generated using next-generation sequencing platforms. This has necessitated the development of novel bioinformatics approaches and algorithms that not only facilitate a meaningful analysis of these data but also aid in efficient compression, storage, retrieval and transmission of huge volumes of the generated data.
A software suite for common genomic analysis tasks which offers improved flexibility, scalability and execution time characteristics over previously published packages. The suite includes a utility to compress large inputs into a lossless format that can provide greater space savings and faster data extractions than alternatives.
Allows to manipulate several format files as FASTA, FASTQ and others. Fastaq is able to recognize the format of the files uploaded. This tool manipulates sequences and quality scores if present, and annotation is ignored where present in the input. It offers users to GZIP the input and the output files.
A framework technology comprising file format and toolkit in which we combine highly efficient and tunable reference-based compression of sequence data with a data format that is directly available for computational use. This compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.
Reduces the size of the named files using Lempel-Ziv coding. Gzip is designed to be a replacement for compress. The software automatically detects the input format and replaces each file by one with the extension `.gz', while keeping the same ownership modes, access and modification times. It allows to restore compressed files to their original form. The amount of compression obtained depends on the size of the input and the distribution of common substrings.
A software developed to facilitate application of the Haseman-Elston regression method and to estimate complex trait heritability. GEAR has been demonstrated to function in the following situations: (i) It can generate genetic relatedness of unrelated individuals, based on whole-genome markers; (ii) It can estimate the effective number of markers based on a genetic-relatedness matrix; (iii) It can estimate heritability with the Haseman-Elston regression.
Allows k-mer-based dataset analysis and transformations. Khmer is a set of command-line tools for working with DNA shotgun sequencing (SGS) data from genomes, transcriptomes, metagenomes, and single cells. The software relies on a probabilistic data structure, a Count-Min Sketch, which permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms.
Provides a format for storing multiple tracks of numeric data anchored to a genome. Genomedata suits for genome-scale numerical data and uses an HDF5 (hierarchical data format) container for random access to large genomic datasets. It consists of an intermediate format that intents to minimize the trouble of parsing and validating the data from an analysis programmer. It can be combined in different workflows to parse, validate and convert data into a binary format once to eliminate the computational expense.