Computational protocol: Phylogenetic Tree Reconstruction Accuracy and Model Fit when Proportions of Variable Sites Change across the Tree

Similar protocols

Protocol publication

[…] We generated data using our newly developed simulator LineageSpecificSeqgen (); an extension to the Seq-Gen program () that allows generation of sequences with both changes in the proportions of variable sites (Pvar) and changes in the variable/invariable switch rate of the covarion model (). One hundred DNA data sets of 10,000 nucleotides each were generated along the 4-, 6-, 8-, and 16-taxon trees depicted in . We used the default option of LineageSpecificSeqgen where branch lengths are defined as the expected number of substitutions per variable site, as opposed to the expected number of substitutions per site (which is averaged over all sites, including invariable sites). The advantage of this setting is that it is more intuitive; the input branch lengths are used directly and the rate of variable sites is not increased (rescaled) to compensate for the invariable sites when the data are generated. This results in simulation of more moderate rates than in the alternative setting of branch lengths being the expected number of substitutions per site (see for further detail). The setting used does not affect tree estimation, as the expected number of substitutions per site will be estimated from the data.The Jukes–Cantor (JC) model () of nucleotide substitution was used both with and without the covarion model of ; the proportion of sites that are variable under the covarion model was set to 0.6 and the rate of change from variable to invariable and vice versa was set to 0.1). As illustrated in , a site can be invariable at a certain section of the tree if 1) it is part of the proportion of sites that are invariable (Pinv) or 2) it is part of the proportion of sites that are variable (Pvar) but is invariable (“off”) under the covarion model. At the root, 80% of the sites were set as invariable (i.e., Pinv=0.8 and Pvar=0.2). Changes in the proportion of variable sites (Pvar), “events”, were introduced in 2 positions on the trees marked as “1st_event” and “2nd_event” (); Pvar+=(0,5,10,15,20,25,30,35,40,45,50) percent of the invariable sites were reset to be variable in these 2 events. Unless otherwise stated, these 2 events were set to be correlated, so that the positions of sites that switch state are identical.Although the simulation tree used is very specific, we believe that the parameters used are of great relevance to phylogenetic studies. By choosing to have 2 events on nonsister branches, we are of course deliberately selecting a situation that we expect to be problematic for phylogenetic methods, but it seems more important to focus attention on cases where phylogenetic methods may be mislead than situations (e.g., events on sister taxa) where there may be a positive bias toward getting the correct tree. We chose a high proportion of sites to be invariant at the root of the tree based on the suggestions of who found that (in the case of mammalian cytochrome c) when a single species is considered, more than 90% of the codons are invariant. We have considered both fully correlated and uncorrelated events to demonstrate the effect this setting has on the results (accuracy still decreases although slower than in correlated events). Of course, many other interesting settings are possible. [...] For each simulated data set, we conducted a Bayesian analysis using MrBayes version 3.1 () under 5 different models: JC, JC with invariable sites (JC + I), JC with a gamma distribution of rates across sites (JC + G), JC with invariable sites and a gamma distribution (JC + I + G), and JC with the covarion model (JC + Cov). Four chains (3 heated) were run for 2,000,000 generations with the default settings. Pilot runs using the more complex models (JC + I + G and JC + Cov) were examined for convergence in Tracer version 1.4 () and used to choose an appropriate burn-in (sump and sumt burn-in=5000; this equals 50,000 generations). MP analysis was conducted using PAUP* version 4.0b10 (with default settings except for HSearch NBest=1).For the model incorporating covarion evolution (JC + Cov), we used the covarion model of . described an extension to this model with an underlying variable rates across sites (a rate for each site is first drawn from a gamma distribution) and an overlaying covarion process. Under this model, a site can be variable, in which case its rate is taken from the gamma distribution, or invariable; an invariable site can become variable and vice versa. This model is implemented in a Bayesian framework in MrBayes. However, we encountered problems when using JC with variable rates across sites and covarion (JC + Hue). In many cases, the application of both these models to our data resulted in convergence on positive log likelihoods! Similar problems with MCMC using parameter-rich models have been previously reported (). We reported these problems in April 2008 using the MrBayes bug report tool (http://sourceforge.net/tracker/index.php?func=detail&aid=1945304&group_id=129302&atid=714418). […]

Pipeline specifications

Software tools Seq-Gen, MrBayes
Application Phylogenetics