Computational protocol: Comparative analysis of hepatitis C virus phylogenies from coding and non-coding regions: the 5' untranslated region (UTR) fails to classify subtypes

Similar protocols

Protocol publication

[…] We used multiple methods for phylogenetic inference, including neighbor joining (NJ), maximum parsimony (MP), and maximum likelihood (ML) [,]. This was done to evaluate whether the inferential technique has an influence on the ability of the resulting phylogenies to resolve subtypes into clades. We used PAUP*, version 4.0b10 [] for phylogenetic inference. Neighbor-joining trees were constructed with the F84 distance metric [] and the BioNJ algorithm []. For parsimony analyses, uninformative invariant characters were excluded and gaps were treated as a fifth character state.To select an appropriate nucleotide substitution model, we used FindModel, an independ-ent, online implementation of ModelTest []. This approach uses an information-based goodness-of-fit criterion, in the sense that the best model minimizes the quantity of bits required to encode both the model and the model-encoded data for electronic transmission [-]. Such an approach includes a penalty term for the number of parameters, and thus facilitates comparing models with varied numbers of parameters []. The fit of each model to the data was evaluated both with and without a four-category discrete approximation to a gamma distribution of substitution rates per site. Because FindModel does not test models with invariant sites, we also used ModelTest (version 3.6) to evaluate nucleotide substitution models with invariant sites []. Akaike's information criterion (AIC) was used to quantify the suitability of alternative models having varied numbers of parameters to fit the data []. [...] To understand better phylogenetic inconsistencies over the HCV genome, we computed the character consistency index (CI) for each site in PAUP with the whole-genome phylogeny, and summarized CI with a moving-window (running) average over 100, 300, and 500 nt. The 100 nt window size was used subsequently because it allows for clear visualization of the 342 nucleotides that constitute the 5' UTR. Because the consistency and homoplasy indices (HI) are complementary (CI+HI = 1), character consistency is high when homoplasy is low, and vice versa. Thus, we expect lower homoplasy to result from fewer informative sites. Further, homoplasy decreases rapidly with decreasing substitution rates. To control for variation in the number of informative sites across the genome, we rescaled the homoplasy index against the square of the proportion of informative sites in the window region. This was done because, in the limit of short branch lengths, the number of informative sites should be proportional to the substitution rate r, while the number of homoplasies should be proportional to r2. The result was subsequently normalized against the maximum, to facilitate comparison with the proportion of informative sites. As a result, if all parts of the HCV genome are equally informative, one can expect the rescaled homoplasy index to be roughly constant over the viral genome. […]

Pipeline specifications

Software tools BIONJ, ModelTest-NG, PAUP*
Application Phylogenetics
Organisms Hepacivirus C, Human poliovirus 1 Mahoney
Diseases Hepatitis C, Infection