Computational protocol: Influenza A virus evolution and spatio-temporal dynamics in Eurasian wild birds: a phylogenetic and phylogeographical study of whole-genome sequence data

Similar protocols

Protocol publication

[…] Over a period of 15 years, 186 054 samples from 440 different bird species were analysed for the presence of AIVs. Positive isolates were subtyped and sequenced. In collaboration with the National Institutes of Health and the J. Craig Venter Institute, ∼83 full or nearly full genomes and 30 partial genomes of AIVs have been submitted to GenBank.The coding complete genomes of the influenza viruses were sequenced using a high-throughput next-generation sequencing pipeline at the J. Craig Venter Institute, which included the 454/Roche GS-FLX and the Illumina HiSeq 2000 platforms. Viral RNA was isolated using a ZR 96 Viral RNA kit (Zymo Research). The influenza A genomic RNA segments were simultaneously amplified from 3 μl purified RNA using a multisegment reverse transcription (M-RT)-PCR strategy (; ) The influenza M-RT-PCR amplicons were barcoded and amplified using an optimized SISPA (sequence-independent single primer amplification) protocol (). Subsequently, the SISPA amplicons were purified, pooled and size selected (∼800 or∼200 bp), and the pools were used for both Roche 454 (Roche Diagnostics) and Illumina (Illumina) library construction. Samples were sequenced on the 454/Roche GS-FLX and Illumina HiSeq 2000 platforms. Libraries were prepared for sequencing on the 454/Roche GS-FLX platform using Titanium chemistry or for sequencing on the Illumina HiSeq 2000. The sequence reads were sorted by barcode, trimmed and searched by tblastx against custom nucleotide databases of full-length influenza A segments downloaded from GenBank to filter out both chimeric influenza sequences and non-influenza sequences amplified during the random hexamer-primed amplification. The reads were binned by segment and the 454/Roche GS-FLX reads were de novo assembled using the clc_novo_assemble program (CLC Bio). The resulting contigs were searched against the corresponding custom full-length Influenza segment nucleotide database to find the closest reference sequence for each segment. Both 454/Roche GS-FLX and Illumina HiSeq 2000 reads were then mapped to the selected reference influenza A virus segments using the clc_ref_assemble_long program (CLC Bio). At loci where both 454/Roche GS-FLX and Illumina HiSeq 2000 sequence data agreed on a variation (as compared with the reference sequence), the reference sequence was updated to reflect the difference. A final mapping of all next-generation sequences to the updated reference sequences was then performed. Any regions of the viral genomes that were poorly covered or ambiguous after next-generation sequencing were amplified and sequenced using the standard Sanger sequencing approach.These viruses were isolated from different wild bird species, and included different subtypes and sampling locations within West Eurasia throughout the time period of the study. In addition, all full-genome sequences from AIV genomes containing NA1–NA9 and HA1–HA12 available from GenBank were retrieved. All sequences from domestic birds and all sequences related to poultry outbreaks, particularly HPAI H5N1, H7 and H9, were excluded. Our final datasets of matched genome sequences for PB2 (2266 nt), PB1 (2259 nt), PA (2142 nt), HA (1716 nt), NP (1482 nt), NA (1374 nt), MP (979 nt) and NS (838 nt) were aligned with BioEdit version 7.1 (a total of 211 complete genomes; see Table for GenBank accession numbers). [...] Phylogenetic trees for each segment were reconstructed with PhyML version 3.0 (), using the general time reversible (GTR) nucleotide substitution model with a proportion of invariant sites and a Γ distribution of among-site rate variation, all estimated from the data (determined by ModelTest as the appropriate nucleotide substitution model). garli version 0.96 () was run on the best tree from PhyML for 1 million generations to optimize tree topology and branch lengths. [...] To identify potential errors in sequence data annotation that might have affected the clock estimation, we used the reconstructed ML nucleotide trees in Path-O-Gen version 1.3 (http://tree.bio.ed.ac.uk/software/pathogen) to generate linear regression plots of the years of sampling versus root-to-tip distance. We did not observe any anomalies in the eight segment datasets, which all exhibited a clock-like behaviour ().We estimated rates of evolutionary change (nucleotide substitutions per site per year) and times of circulation of the MRCA (years) with beast version 1.7.3 using time-stamped sequence data with a relaxed-clock Bayesian Markov chain Monte Carlo (MCMC) method (; ). For all analyses, the uncorrelated log-normal relaxed molecular clock and a Γ site heterogeneity model with four Γ categories was used in combination with the GTR nucleotide substitution model. A normal rate prior with a mean of 0.0033 substitutions per site per year (sd = 0.0016) was used (). These analyses were conducted with a Bayesian Skyline coalescent model, a random starting tree and a constant rate of migration. We performed at least three independent analyses of at least 100 million MCMC chains to ensure convergence and combined these analyses after removal of the burn-in of 10 % using LogCombiner version 1.7.3. Finally, the MCMC chains were summarized to reconstruct the MCC trees using TreeAnnotator version 1.7.3. Trees were visualized and coloured with the FigTree program version 1.4.0 (http://tree.bio.ed.ac.uk/software/figtree/). [...] To visualize similarities and differences between the phylogenies, and investigate reassortment, tanglegrams were generated using the nucleotide substitution MCC trees generated by beast and TreeMap version 1.0 (http://taxonomy.zoology.gla.ac.uk/rod/treemap.html). These tanglegrams consisted of two rooted phylogenetic trees of which taxa that corresponded to each other in the two trees were connected. In the absence of reassortment, one would expect to see nearly horizontal linkage connecting one taxa to another. […]

Pipeline specifications

Software tools TBLASTX, BioEdit, PhyML, ModelTest-NG, GARLI, TempEst, BEAST, FigTree, Tanglegrams, TreeMap
Applications Phylogenetics, Nucleotide sequence alignment
Organisms Influenza A virus