Decipher sequencing data with file formats for NGS
Next-generation sequencing (NGS) technologies generate tremendous amount of data. A typical human genome consists of ~3 billion base pairs to be sequenced. A critical step in NGS is to extract the information, stock it and transmit it in an easy-to-use and lightweight way.
Why need file format?
Coded in bits, a human genome would “only” weight about 700 MB (Reid Robison). However, sequencers generate short reads that redundantly span the sequence and then need to be aligned to a reference genome. Moreover, since sequencing is not perfect, every base has a score attached to it to evaluate the quality of sequencing. Thus, file formats have been developed to code a maximum of this information in a minimum of space.
Here are some of the most used file formats in NGS.
Developed at the Wellcome Trust Sanger Institute, the FASTQ format is a text-based format for storing a nucleotide sequence and its corresponding quality scores. Each base and its corresponding quality score are coded using a single ASCII character. Thus, quality scores ranging from 0 to 93 can be encoded (not all ASCII character are printable).
A FASTQ file is normally constituted of four lines per sequence:
Line 1 begins with a “@” and is followed by a sequence identifier.
Line 2 is the raw sequence letters.
Line 3 begins with a “+”.
Line 4 encodes the quality value for each base in Line 2 and must contain the same number of characters as in the sequence.
This format has now become the standard for storing the output of high-throughput sequencing machines.
The SAM (Sequence Alignment/Map) format was developed by the 1000 Genomes Project and aims at storing large nucleotide sequence alignments. This format is flexible enough to store all the alignment information generated by various alignment programs, is simple enough to be generated by alignment programs, and is compact in file size.
The SAM format consists of a header and an alignment section, which has 11 mandatory fields and a variable number of optional fields.
Example of header lines:
Example of alignment lines:
The 11 mandatory fields of the alignment section include information on mapping quality, fragment position, quality control, sequence, etc.
The SAM format can be compressed to take less space in the Binary Alignment Map (BAM) format.
VCF is a text file format to store gene sequence variations, developed by the 1000 Genomes Project. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
There is an option whether to contain genotype information on samples for each position or not.
In this format, header lines start with “#”, and the body containing sequence information has 8 mandatory columns separated by tabs.
A great number of other file formats have been developed to store biological data from various types, including:
- mzML: Provides a standard output format for mass spectrometry (MS) data that facilitates data sharing and analysis.
- MacroMolecular Transmission Format: Allows to transmit and store biomolecular structures for fast 3D visualization and analysis.
- BGT: Permits to separates sample phenotypes, site annotations and genotypes into individual files.
Reid Robison (2014). How big is the human genome. Medium.
Pierre Lindenbaum (2017). Next Generation Sequencing File Formats. Slideshare.