Computational protocol: Reassortment between Influenza B Lineages and the Emergence of a Coadapted PB1–PB2–HA Gene Complex

Similar protocols

Protocol publication

[…] We compiled a primary data set of 452 complete influenza B genomes from GISAID () dating from 1984 to 2012. The longest protein-coding region of each segment was extracted and used for all further analyses. We thus assume that homologous recombination has not taken place and that the evolutionary history of the whole segment can be inferred from the longest coding sequence in the segment. To date, there has been little evidence of homologous recombination in influenza viruses (; ; ). The segments of each strain were assigned to either Vic or Yam lineage by making maximum-likelihood trees of each segment using PhyML () and identifying whether the isolate was more closely related to B/Victoria/2/87 or B/Yamagata/16/88 sequences in that segment, with the exception of the NS segment as B/Victoria/2/87 was a reassortant and possessed a Yam lineage NS (). B/Czechoslovakia/69/1990 was considered as being representative of Vic lineage for the NS segment. Every segment in each genome thus received either a Vic or a Yam lineage designation, for example, the strain B/Victoria/2/87 received V–V–V–V–V–V–V–Y, as its NS segment is derived from the Yam lineage and the rest of the genome is Vic.We also collated a secondary data set from all complete influenza B virus genomes available on GenBank as of May 5, 2014. After removing isolates that had considerable portions of any sequence missing, were isolated prior to 1980, or were suspected of having a contaminant sequence in any segment, we were left with 1,603 sequences. This data set only became available after all primary analyses were performed, are mainly from Australia, New Zealand and the United States, and are too numerous to analyze in BEAST (). PhyML () was used to produce phylogenies of each segment, and the lineage of each isolate was determined based on grouping with either B/Victoria/2/87 or B/Yamagata/16/88 sequences, as described above. By associating strains with lineage identity of each of their segments, we reconstructed the most parsimonious interlineage reassortment history for the secondary data set. The secondary data set was used to check how representative the primary data set was, to estimate LD, and to broadly confirm our results. All analyses pertain to the primary data set unless stated otherwise.Temporally calibrated phylogenies were recovered for each segment in the primary data set using Markov chain Monte Carlo (MCMC) methods in the BEAST software package (). We modeled the substitution process using the Hasegawa–Kishino–Yano model of nucleotide substitution (), with separate transition models for each of the three codon partitions, and additionally estimated realized synonymous and nonsynonymous substitution counts (). We used a flexible Bayesian skyride demographic model (). We accounted for incomplete sampling dates for 94 sequences (of which 93 had only year and 1 had only year and month of isolation) whereby tip date is estimated as a latent variable in the MCMC integration. A relaxed molecular clock was used, where branch rates are drawn from a lognormal distribution (). We ran three independent MCMC chains, each with 200 million states, sampled every 20,000 steps and discarded the first 10% of the MCMC states as burn-in. After assessing convergence of all three MCMC chains by visual inspection using Tracer (), we combined samples across chains to give a total of 27,000 samples from the posterior distribution of trees.Every sequence was assigned seven discrete traits in BEAUti corresponding to the lineages of all other segments with which a strain was isolated, for example, PB1 tree had PB2, PA, HA, NP, NA, MP, and NS as traits and V or Y as values for each trait. We inferred the ancestral state of lineages in each segment by modeling transitions between these discrete states using an asymmetric transition matrix () with Bayesian stochastic search variable selection to estimate significant rates. Because the posterior set of trees for a single segment has branches labeled with the inferred lineage in the remaining seven segments, we can detect interlineage reassortments between pairs of segments by observing state transitions, that is, Yam to Vic or Vic to Yam (). In addition, by reconstructing the ancestral state of all other genomic segments jointly we can infer coreassortment events when more than one trait transition occurs on the same node in a tree. Interphylogeny labeling approaches have been extensively used in the past to investigate reticulate evolution in influenza A viruses and HIV (; ; ). [...] We express the normalized distance ΔTMRCA between trees belonging to two segments A and B for a particular posterior sample i, following (1)ΔTMRCA(Ai,Bi)=δTMRCA(Ai,Ai′)+δTMRCA(Bi,Bi′)2 δTMRCA(Ai,Bi), where δTMRCA(Ai,Bi)=1n∑j=1ng(Aij,Bij) and n is the total number of pairwise comparisons available between sets of tips. Thus, g(Aij,Bij) is the absolute difference in TMRCA of a pair of tips j, where the pair is drawn from the ith posterior sample of tree A and the ith posterior sample of tree B. Additionally, δTMRCA(Ai,A′i) is calculated from the ith posterior sample of tree A and ith posterior sample of an independent analysis of tree A (which we refer to as A′), which is used in the normalization procedure to control for variability in tree topology stability over the course of the MCMC chain (see supplementary figs. S6 and S7, Supplementary Material online). We had three replicate analyses of each segment and in order to calculate δTMRCA(Ai,A′i) we used analyses numbered 1, 2 and 3 as A and analyses numbered 2, 3 and 1 as A′, in that order. We subsampled our combined posterior distribution of trees to give a total of 2,700 trees on which to analyze ΔTMRCA.Calculating the normalized ΔTMRCA(Ai,Bi) for each MCMC state provides us with a posterior distribution of this statistic allowing specific hypotheses regarding similarities between the trees of different segments to be tested. Our approach exploits the branch scaling used by BEAST (), as the trees are scaled in absolute time and insensitive to variation in nucleotide substitution rates between segments, allowing for direct comparisons between TMRCAs in different trees. In the absence of reassortment we expect the tree of every segment to recapitulate the “virus tree,” a concept analogous to “species trees” in population genetics. Our method operates under the assumption that the segment trees capture this “virus tree” of influenza B viruses quite well. It is not an unreasonable assumption, given the seasonal bottlenecks influenza viruses experience. This makes it almost certain that influenza viruses circulating at any given time point are derived from a single genome that existed in the recent past. The δTMRCA statistic essentially quantifies the temporal distance between admixture events and nodes in the virus tree (see supplementary fig. S17, Supplementary Material online). We normalize δTMRCA values to get ΔTMRCA, a measure which quantifies the extent to which the similarity of two independent trees resembles phylogenetic noise. The δTMRCA statistic is an extension of patristic distance methods and has previously been used to tackle a wide variety of problems, as phylogenetic distance in predicting viral titer in Drosophila infected with viruses from closely related species (), and to assess temporal incongruence in a phylogenetic tree of amphibian species induced by using different calibrations (). […]

Pipeline specifications

Software tools PhyML, BEAST, PATRISTIC
Applications Phylogenetics, Population genetic analysis
Diseases Influenza, Human, Lymphoma, B-Cell