Unlock your biological data
Quality score recalibration is trivial if the status of every base (correct or error) is known; the fraction of sequencing errors with a given quality score can be used to calculate the empirical (recalibrated) quality. For real sequencing data, however, erroneous bases are of course not already known. Intriguingly, all current recalibrators (Li et al., 2009; DePristo et al., 2011; Zook et al., 2012; Cabanski et al., 2012) are strongly based on this assumption that erroneous bases are known; sequencing errors are identified as mismatches to a reference genome, excluding sites of known variants (e.g., dbSNP (Sherry et al., 2001) for humans). This assumption would be tenable if variant databases were complete, but this is also not the case (The 1000 Genomes Project Consortium, 2010), and the purpose of sequencing is often to discover variants not present in existing databases. Furthermore, outside of humans and several model organisms, variant databases are not available and thus recalibration is often not done.
(Cabanski et al., 2012) ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data. BMC Bioinformatics.
(DePristo et al., 2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet.
(Li et al., 2009) SNP detection for massively parallel whole-genome resequencing. Genome Res.
(Sherry et al., 2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res.
(The 1000 Genomes Project Consortium, 2010) A map of human genome variation from population-scale sequencing. Nature.
(Zook et al., 2012) Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing. PLoS One.
(Chung and Chen, 2017) Lacer: accurate base quality score recalibration for improving variant calling from next-generation sequencing data in any organism. bioRxiv.