Computational protocol: Adaptive evolution during the establishment of European avian‐like H1N1 influenza A virus in swine

Similar protocols

Protocol publication

[…] Over 2000 nucleotide sequences of all H1NX (for HA), HXN1 (for NA) and HXNX (for all internal gene segments) isolated from avian and swine hosts before the year 2000 and their associated metadata were downloaded from the NCBI Influenza Virus Resource ( Sequence lengths of <500 bp were excluded from the analysis. For each gene segment, the large data sets were aligned using MAFFT (Katoh, Misawa, Kuma, & Miyata, ) followed by manual alignment. The sequences were trimmed to only include coding regions for subsequent analysis. Preliminary maximum‐likelihood (ML) analysis was conducted for large individual gene data sets (PB2, PB1, PA, HA, NP, NA, MP and NS) using RAxML v9.0 (Stamatakis, ). To focus on the adaptation and evolution of EA‐swine lineage, we used the resulting large ML trees (Figure ; Figs ) to select the EA‐swine lineage and its closest avian lineage (denoted by blue branches in Figure  and Fig. ). The latter included all available avian sequences that are exclusively form a sister group to the entire EA‐swine lineage. Therefore, distantly related avian lineages, with heterogeneous evolutionary rates that may affect phylogenetic analyses, were excluded in subsequent analyses. All duplicate viruses were manually removed from the data sets, and isolates with 100% identical residues were removed using the webserver of the program CD‐HIT (Huang, Niu, Gao, Fu, & Li, ; Li & Godzik, ). The program TempEst v1.5 ( was also used to plot root‐to‐tip divergence times to remove any outliers from the sequence data sets as a possible result of mislabelled isolation dates. To account for uncertainty of isolation date, isolates were mid‐year‐rooted if the exact date of sampling was unknown. For each segment, the reduced data sets (comprising the EA‐swine and its closely related avian lineages) were used to reconstruct ML trees using a generalized time reversible nucleotide substitution model plus gamma distributed rates among sites (GTR+Γ) in PhyML v3.0 (Guindon et al., ). [...] For each gene segment, the estimates of evolutionary rates and temporal phylogenies were performed in BEAST v1.8.2 (Drummond, Suchard, Xie, & Rambaut, ). An uncorrelated lognormal relaxed clock model within a Bayesian Markov chain Monte Carlo (MCMC) framework was used, with a Gaussian Markov random field coalescent tree prior. At least two independent MCMC runs of 100 million steps were performed and combined to ensure adequate sampling of all parameters, with a 10%–20% “burn‐in” removed in each run. The relevant statistics and values were parsed from these runs directly from the combined log files using the program Tracer v1.6 ( Bayes factors (BF) for statistical support of differences between estimated time to most recent common ancestor (TMRCA) and nucleotide substitution rate values were calculated as described previously (Bahl, Vijaykrishna, Holmes, Smith, & Guan, ), where BF ≥ 150 indicates very strong support, 150 > BF ≥ 20 indicates strong support values, and 20 > BF ≥ 3 indicates supported values. [...] The estimates of the degree of natural selection were performed as previously described (Joseph et al., ). Briefly, the ratio of nonsynonymous to synonymous substitutions per codon (d N/d S ratio) was estimated for each segment data set using the single‐likelihood ancestor counting (SLAC) method (Kosakovsky Pond & Frost, ) run through the Datamonkey webserver (Delport, Poon, Frost, & Kosakovsky Pond, ) with user‐supplied ML trees (as above). Specific amino acid sites of selection were determined using the Tdg09 program (Tamuri, Dos Reis, Hay, & Goldstein, ), with statistical cut‐offs of set at the false discovery rate (FDR) value of 0.20. Ancestral codon substitutions of nodes were determined using the baseml program of the PAML suite v4.7 (Yang, ) and transcribed onto trees generated with RAxML v8.1.6 (Stamatakis, ) using the treesub program (Tamuri, ), as described previously (Su et al., ). [...] To identify the intralineage reassortment events in the EA‐swine virus lineage, we used the software Dendroscope v.3.0 (Huson & Scornavacca, ) to generate tanglegrams using the ML phylogenies. The ML phylogenies for individual gene segments of 38 EA‐swine isolates were reconstructed using IQ‐Tree v1.3.0 (Nguyen, Schmidt, von Haeseler, & Minh, ). The runs were performed using 10,000 ultrafast bootstrap replicates and automatic selection of best‐fit substitution model. The resulting ML phylogenies were rooted using the earliest EA‐swine strain (A/swine/Arnsberg/1979). To infer and visualize the reassortment events, auxiliary lines were then drawn between same set of virus isolates in the phylogenies of two gene segments (i.e., between HA and non‐HA phylogenies). […]

Pipeline specifications

Software tools MAFFT, RAxML, CD-HIT, TempEst, PhyML, BEAST, Datamonkey, PAML, Dendroscope, Tanglegrams, IQ-TREE
Databases NCBI Influenza Virus Resource
Applications Phylogenetics, Population genetic analysis
Organisms Sus scrofa, Anas platyrhynchos, Gallus gallus, Homo sapiens, Human poliovirus 1 Mahoney, Viruses
Diseases HIV Infections
Chemicals Amino Acids