Computational protocol: The Effect of Nonreversibility on Inferring Rooted Phylogenies

Similar protocols

Protocol publication

[…] We next investigated the performance of the NR and NR2 models on a real biological data set for which there is broad biological consensus on the root position (; ). The lineage leading to Saccharomyces cerevisiae (brewer’s yeast) and its relatives underwent a conserved whole-genome duplication (WGD) about 100 million years ago (; ). Evidence for this WGD, in the form of duplicated genes and genomic regions, is shared by all post-WGD yeasts and defines the group as a clade from which the root of the Saccharomycetales is excluded () (). The root inferred through outgroup analysis separates a clade comprising Eremothecium gossypii, Eremothecium cymbalariae, Kluyveromyces lactis, Lachancea kluyveri, Lachancea thermotolerans, and Lachancea waltii from the other species (). We analyzed an alignment of concatenated large and small subunit ribosomal DNA sequences for 20 yeast species, with a combined length of 4,460 bp. The sequences were aligned with MUSCLE (), and poorly aligned regions were detected and removed using TrimAl (). The alignment is available in the online. We analyzed this data set with the NR and NR2 models, using both the Yule prior and the structured uniform prior. In the analysis with the structured uniform prior, the root split supported by outgroup rooting () has the highest posterior probability (root 1 in ) for both models. However, there is a substantial amount of uncertainty represented by the nonnegligible posterior probabilities of the other root splits () and, for example, the second most plausible root is located within the post-WGD clade (root 2 in ). This posterior uncertainty is also reflected in the sensitivity of the analysis to the topological prior: Although the structured uniform prior recovered the root supported by the outgroup analysis with the highest posterior support, the Yule prior instead recovered this root with the second-highest support (). The most plausible root inferred with the Yule prior is placed within the post-WGD clade (root 2 in ) contradicting the WGD analysis. The posterior for Huelsenbeck’s I statistic is suggestive of a nonnegligible degree of nonreversibility in the data (the posterior mean is 0.2 for the analysis with the NR model, 0.14 for the analysis with the NR2 model). In our simulations, larger values of I were generally required to infer the true root with high posterior probability. However, the support offered to the widely accepted outgroup root in this analysis shows that it is possible to extract useful root information in spite of the data suggesting only a modest degree of nonreversibility.The unrooted topologies of the rooted majority rule consensus trees from the analyses with the two topological priors () differ from that supported by the WGD analysis by the placement of Vanderwaltozyma polyspora. Although the WGD analysis places it within the post-WGD clade, in our analysis this taxon is located within the pre-WGD clade. This result is consistent with our posterior inferences from fitting the HKY85 and GTR models. Interestingly, it is also consistent with the analysis performed with the site-heterogeneous CAT-GTR model () where V. polyspora is, again, excluded from the post-WGD clade (not shown). The placement of V. polyspora outside the WGD clade is surprising given that the genome of V. polyspora preserves evidence of having undergone WGD (). Although this result requires further investigation, the similarity between the consensus trees obtained with the CAT-GTR model and our nonreversible models suggests that the nonreversible models can not only extract meaningful information about the root position, but also capture information for inferring the unrooted topology. However, the minor mismatch of the topologies inferred in our analysis with that supported by WGD and outgroup analyses () confirms the presence of some features of the data that our models do not account for. For example, ribosomal RNA function depends on the molecule folding into a complex three-dimensional shape. Interactions among sites that are distant in the primary sequence, but close in the three-dimensional structure, are likely to induce site-specific selective constraints that are not accounted for in our models. Thus further refinement of the models, for instance, allowing compositional heterogeneity across sites, might be necessary to improve the ability of the models to provide better insight into the evolution of paleopolyploid yeasts. It is worth noting that the root split on the majority rule consensus tree () does not match the marginal posterior modal root split (). This happens because the consensus tree is a conditional summary, computed recursively from the leaves to the root, which depends upon the plausibility of subclades. On the other hand, the posterior over root split is a marginal summary that averages over the relationships expressed elsewhere in the tree; see Appendix B for an illustrative example. […]

Pipeline specifications

Software tools MUSCLE, trimAl
Application Nucleotide sequence alignment
Organisms Saccharomyces cerevisiae