Computational protocol: Quantifying Pathogen Surveillance Using Temporal Genomic Data

Similar protocols

Protocol publication

[…] One may wonder if other alternative methods that account for pathogen evolution may suffice to characterize the genetic surveillance of a pathogen. Phylogenetics has been used in many studies to characterize pathogen surveillance qualitatively without producing a quantitative measure of sampling completeness (). A possible phylogenetic analogue to the q2 coefficient might entail the reconstruction of a tree based on available sequences and measurement of the distribution of branch lengths. The true distance between two isolates, A and B, is represented by the sum of their patristic distances, dA and dB, which are the branch lengths from each respective sequence to their common ancestral node. Sequences are time ordered, however, and if we assume an approximate molecular clock, then dA < dB given that sequence A occurs before sequence B. An estimate of distance is then the larger patristic distance. A parallel to our q2 coefficient would predict high surveillance to correspond to a maximal number (#) of patristic distances d to their closest ancestor in the past less than 2 years as follows:Moreover, homogeneity of surveillance can be confirmed if branch lengths d have low variance.Phylogenies can be divided into those that are distance based and those that are character based. Since the q2 coefficient readily incorporates different genetic distance methods, it is equivalent to any p2 coefficient calculated from distance-based trees. On the other hand, character-based trees, including maximum-likelihood and Bayesian inference methods, incorporate site heterogeneity by considering one character (a site in the alignment) at a time to reconstruct a tree (); moreover, Markov chain Monte Carlo (MCMC) methods like BEAST () can incorporate relaxed clock rates. The q2 coefficient does not take into account either site or clock rate heterogeneity.To determine the impact of site and clock rate heterogeneity in quantifying surveillance completeness, we calculated the p2 coefficient of the human H5N1 HA data set of 158 sequences by using BEAST (see Materials and Methods). We accounted for site heterogeneity by using the gamma model () and reconstructed trees under both strict and relaxed molecular clocks. We calculated the p2 coefficients to be 0.848 (0.740 to 0.917) and 0.860 (0.721 to 0.911) for the strict and relaxed clocks, respectively. Given our q2 coefficient of 0.821 (0.795 to 0.848) for human H5N1 HA, we concluded that incorporating site heterogeneity and a relaxed molecular clock did not make a significant difference.While these phylogenetic techniques can examine the fit of a number of evolutionary models, they suffer from problems of robustness. For example, tree topology can be highly unstable; the addition or deletion of a single sequence can radically restructure the tree. Moreover, different methods of phylogenetic inference, such as maximum likelihood or Bayesian inference, can lead to variable results, rendering interpretation of surveillance complicated. Finally, computation time, particularly for BEAST, can be very expensive; for data sets of more than 1,000 sequences, several weeks may be needed for the MCMC to converge to a stable tree solution. In our analysis of 158 human H5N1 HA sequences, p2 coefficients needed days of computation to complete, whereas q2 coefficient analysis was finished in a matter of seconds. […]

Pipeline specifications

Software tools PATRISTIC, BEAST
Application Phylogenetics
Organisms Sus scrofa, Influenza A virus, Homo sapiens
Diseases Dengue