Decipher sequencing data with file formats for NGS

Next-generation sequencing (NGS) technologies generate tremendous amount of data. A typical human genome consists of ~3 billion base pairs to be sequenced. A critical step in NGS is to extract the information, stock it and transmit it in an easy-to-use and lightweight way. Why need file format? Coded in bits, a human genome would “only” weight about 700 MB (Reid Robison). However, sequencers generate short reads that redundantly span the sequence and then need to be aligned to a reference genome. Moreover, since sequencing is not perfect, every base has a score attached to it to evaluate the quality of sequencing. Thus, file formats have been developed to code a maximum of this information in a minimum of space. Here …

