Computational protocol: Fixing Formalin: A Method to Recover Genomic-Scale DNA Sequence Data from Formalin-Fixed Museum Specimens Using High-Throughput Sequencing

Similar protocols

Protocol publication

[…] Upon receipt of raw data, pre-processing and alignment largely followed [] as outlined below, with the following exceptions: Bowtie2 [] was used instead of Bowtie [] for contaminant filtration; Bowtie2 and Novoalign ( were used for alignment to the Anolis carolinensis genome (Anocar2.0, downloaded from the UCSC Genome Browser,; and updated versions of in-house scripts ( were employed throughout the process.DNA recovered from ancient and museum historic specimens is often characterized by various types of postmortem nucleotide damage (e.g. [,,]), and formalin-fixation can cause base modifications in the DNA fragment []. The specimens used in this study had been fixed in formalin for at least 30 years, thus damage to the DNA could be the result of post-mortem denaturation or subsequent exposure to formalin. To inspect potential base misincorporation in sequence reads, we first aligned the untrimmed raw paired-end reads against the Anolis carolinensis reference genome with Bowtie2. By parsing the SAM output, we generated base mismatch frequency plots by plotting the frequency of all 12 possible mismatches against distance from 5′ and 3′ ends of reads, respectively. We observed a sharp increase in mismatch frequencies of almost all types at both ends, and particularly at the 3’ end of reads (). We then performed multiple rounds of trimming from both 5′ and 3′ ends of reads, until the frequencies of all 12 types of mismatches were relatively constant and similar along post-trimmed reads (). As a final trimming step, we removed 36 bp from the forward reads (6 bp from 5′ and 30 bp from 3′ end) and 47 bp from the reverse reads (17 bp from 5′ end and 30 bp from 3′ end). To evaluate the quality of sequence reads before and after cleaning and trimming, SAMtools [] and an in-house script were used to estimate empirical error rates, measured as the percentage of mismatched bases out of the total number of aligned bases in the mitochondrial genome [].The hard-trimmed raw sequence data were then re-processed to remove exact duplicate reads, adaptors, and low-quality sequences, and to merge overlapping paired-end reads following [] and [] using in-house scripts. To remove reads that might result from contamination by organisms other than Anolis, we aligned all adaptor-trimmed reads to the human (hg19) and Escherichia coli (NCBI st. 536) genomes using Bowtie2 []. We assumed that reads aligning to these genomes represented contamination and removed them from our data. After cleanup, we mapped the resulting paired-end reads to the Anolis reference genome using Novoalign, then applied SAMtools to check mapping efficiency and depth. All cleaned data, including paired-end and unpaired reads, were de novo assembled using ABySS [] and individual assemblies were generated under a wide range of k-mers as in []. We used cd-hit-est [], Blat [], and CAP3 [] to merge raw assemblies and reduce redundancy in our libraries. Contiguous sequences (contigs) less than 200 bp were removed. The resulting contigs were mapped to the Anolis reference genome using the BLASTn program []. To evaluate coverage of the mitochondrial genome, we mapped cleaned reads to the mitochondrial reference genome of Anolis carolinensis and used SAMtools to reconstruct the coding sequence of the mitochondrial genome from the MVZ 214979 library.To evaluate if we had accurately recovered phylogenetically useful sequence data, we extracted the consensus sequence from reads mapping to the mitochondrial genome. We aligned our inferred complete mitochondrial sequence to the Anocar2.0 reference genome to assess sequence similarity. We then aligned the NADH dehydrogenase subunit 2 (ND2) sequence recovered from our formalin fixed sample to that available for the Anolis genome, as well as to ND2 data from NCBI for an outgroup, Oplurus cyclurus, and eight additional Anolis species, including the putative sister taxon of A. carolinensis, A. porcatus, and three other close relatives, A. brunneus, A. allisoni, and A. smaragdinus (NCBI IDs: OCU39585, AB218960, AY263042, KJ954109, AF337807, AY902412, AY296151, AY902417, and AY296195). We based this analysis on ND2 alone because this gene is widely available for Anolis species. We then estimated the phylogeny for this alignment using maximum likelihood under the GTR+I+G model in PAUP (version 4.0a142) [] and calculated bootstrap values using maximum likelihood with 100 replicates, also under the GTR+I+G model in Garli []. […]

Pipeline specifications

Software tools Bowtie2, NovoAlign, SAMtools, ABySS, CD-HIT, BLAT, CAP3, BLASTN, PAUP*, GARLI
Databases UCSC Genome Browser
Applications Phylogenetics, Genome data visualization
Organisms Anolis carolinensis
Chemicals Ethanol, Formaldehyde