Computational protocol: The paradox of HBV evolution as revealed from a 16th century mummy

Similar protocols

Protocol publication

[…] Reads were demultiplexed using CASAVA-1.8.2 (Illumina, San Diego, California), then adapters were trimmed and reads merged using leeHom [] with aDNA specific settings (—ancientdna). These processed reads were mapped to an appropriate reference genome (HBV genotype D3, GenBank accession X65257; revised Cambridge Reference Sequence for human mtDNA, GenBank accession number, NC_012920) using a network-aware version of the Burrows-Wheeler Aligner [] ( with distance, gap and seed parameters as previously described []. Duplicates were removed based on 5’ and 3’ positions ( Reads shorter than 30 base pairs and with mapping quality less than 30 were removed using Samtools []. The resulting BAM files were processed using mapDamage 2.0 on default settings with plotting and statistical estimation []. Haplogrep v2.1.0 [] using PhyloTree Build 17 [] was used to identify the haplogroup of the mtDNA as U5a1b. The complete genome sequence of NASD24 has been submitted to GenBank and assigned accession number MG585269. [...] We analyzed the ancient HBV sequenced in this study in the context of modern whole-genomes of HBV. To this end we downloaded all human HBV genomes from GenBank that were over 3,000 nt in length and for which the year of sampling was available (all date information, including month and day, if available, was converted into decimal format). This initial GenBank data set comprised 3,696 sequences sampled between 1963 and 2015 (). Sequences were aligned with the MAFFT v7 program using the FFT-NS-1 routine [] to visually check for obvious errors in database labeling.For initial genotypic subtyping, one representative of each HBV subtype was selected (). This data set included the purportedly ancient HBV sequence previously obtained from a 17th century Korean mummy (GenBank accession number JN315779; radiocarbon-dated to 1682 with an error range of 1612–1752 []). The subtype of NASD24SEQ was inferred from maximum likelihood (ML) phylogenetic trees estimated using PhyML v3.0 [] with the GTR+Г4 model of nucleotide substitution and employing SPR branch-swapping, with nodal support assessed by conducting 1000 non-parametric bootstrap replicates.Following this, a random subsample of HBV sequences was taken using the Ape package in R []. Specifically, we sampled five representatives of each genotype and subtype, or the maximum number available if this was not five. This produced a data set of 135 sequences sampled between 1963–2013 which we refer to as subset a (, ). We then added the ancient Italian HBV sequence NASD24SEQ to subset a to make data set a-i with n = 136. A third subset was built by adding the ancient Korean HBV sequence JN315779 to a-i to make subset a-ii with n = 137. We next randomly sampled only D genotype sequences from the initial GenBank data set to form subset b with n = 64 (). To this we added NASD24SEQ, generating subset b-i with n = 65. We aligned the nucleotide sequences in each subset using the L-INS-i routine in MAFFT v7. [...] We conducted a range of analyses to assess the extent of temporal structure in the data sets and to estimate the rate and time-scale of HBV evolution. To initially verify the temporal structure in the data we conducted regressions of root-to-tip genetic distance as a function of the sampling time (year) using TempEst v0.1 []. We then conducted a date-randomization test []. This involved analyzing the data using the Bayesian method implemented in BEAST v1.8.3 [] under a lognormal relaxed clock model [] and assuming a constant population size, with 20 replicates in which the sampling dates were randomized among the sequences. The HBV data were considered to have temporal structure if the mean rate estimate and 95% HPD intervals were not contained within the 95% HPD of any of estimates resulting from the randomized data sets []. All analyses were run with an Markov chain Monte Carlo chain length of 107 steps with samples from the posterior distribution drawn every 2 × 103 steps. After discarding the first 10% of steps as burn-in, we assessed sufficient sampling from the posterior by visually inspecting the trace file and ensuring that the effective sample sizes for all parameters were at least 200.We also performed more detailed Bayesian estimates of evolutionary dynamics. One method of validating the ages of ancient samples is to specify uninformative prior distributions for these and test if the tip dates or an informative rate give the postulated ages. Specifically, if the sequence data and calibrations are informative, the posterior should consist of a narrow distribution that includes the true sampling time of the ancient samples []. However, if the prior and posterior distributions for the ages of the ancient samples are the same, then the data are considered to have insufficient information to estimate the ages of these samples. We conducted these analyses by setting uniform distributions for the two ancient samples between 0 and 107 with a mean of 105. For the Korean sample we set a uniform prior with upper and lower bounds of 400 and 0, whereas for the Italian sample we used 507 and 0, with the maximum values in both cases reflecting their presumed sampling date. We also conducted analyses in which, for the Italian sample, we set a normal truncated prior distribution with the upper and lower values at 507 and 387, and mean of 447 and standard deviation of 10, whereas for the Korean sample we used 400 and 260 for the bounds, and 330 and 35 for the mean and standard deviation, respectively. These numbers are based on the radiocarbon dating analysis, with the upper and lower values reflecting the error margins.We employed three calibration strategies for these analyses. (i) First, we used the sampling times of the modern samples, but with a uniform prior for the rate bounded between 0 and 1. (ii) Second, we assumed that all the modern samples were contemporaneous, and specified internal node calibrations, corresponding to those used by Paraskevis et al. (2015) and Llamas et al. (2016) [, ] and assuming that the spread of HBV corresponds to that of early human populations [, ]. These consisted of setting normal priors for the time of the most recent common ancestor (tMRCA) of F and H genotypes at 16,000 years, with a standard deviation of 1000, setting the tMRCA for subgenotype B6 at 3500 years with a standard deviation of 3000, and setting the tMRCA for subgenotype D4 at 8500 years with a standard deviation of 3500. (iii) Finally, we calibrated the molecular clock by specifying an informative prior for the mean (long-term) substitution rate of HBV based on the estimate by Paraskevis et al. (2013) []. Accordingly, the prior was in the form of a normal truncated prior with mean of 2.2 × 10−6 sub/site/year, standard deviation of 5 × 10−7, and lower and upper bound of 1.5 × 10−6 and 3 × 10−6, respectively. We again used the GTR+Γ4 nucleotide substitution model for these analyses, although a codon-partitioned HKY model was also considered within the internally calibrated analysis. […]

Pipeline specifications

Software tools BaseSpace, leeHom, BWA, SAMtools, mapDamage, HaploGrep, MAFFT, PhyML, APE, TempEst, BEAST
Databases nextstrain
Application Phylogenetics
Organisms Homo sapiens, Variola virus
Diseases Hepatitis B, HIV Infections
Chemicals Cytosine