Computational protocol: Covariation of Branch Lengths in Phylogenies of Functionally Related Genes

Similar protocols

Protocol publication

[…] As the dataset consists of prokaryotes, gene tree topologies can differ from the species topology as a result of horizontal gene transfers (HGT). To filter out genes where the gene relationship may not reflect the underlying species relationships, MCMC analyses were performed using MrBayes . For each of the genes, we computed two runs, each with one cold chain and three heated chains, under a mixed amino-acid model with four gamma (γ) rate categories and allowing invariable sites (i). Prior distributions of tree branch lengths and the gamma shape parameter were set to exponential distributions with λ = 10 and the starting tree was set to random. The chains were run for 1100000 steps and sampled every 200 steps, with the first 500 trees discarded.The posterior distributions were taken and used to determine the correct relationships amongst the species. Probabilities of each tree topology from the 95% credible set of trees was taken for each gene. The probabilities of each topology for each gene were multiplied to get the joint posterior probability of each topology over all genes, assuming independence of genes. The tree with the highest joint posterior probability was chosen as the best estimate of phylogeny. The procedure here is justified by the fact that if the tree priors for each gene are assumed to be equal, and the genes are unlinked, then this calculation is monotonic with the joint posterior probability, as follows. The posterior probability of a given tree, τ, over all genes, Di, is:(1)If the posterior probabilities are obtained separately for each gene then:(2)As can be seen, Eqn (2) is monotically (but non-linearly) proportional to Eqn (1).When a particular topology is not found in a gene, a minimum probability is assigned, equivalent to one divided by the number of samples taken in the MCMC analysis. According to this criterion, the most probable tree topology yielded a log probability of −2289.62. In contrast, the second most probable tree had a log probability of −2814.34. The most probable species topology found from our MCMC analysis concurs with the one used in Pazos et al.'s study, which is derived from neighbor-joining trees of distances in the 16S rRNA gene.As the issue of HGT needed to be addressed, any genes that had significant uncertainty as to whether they had the species topology were filtered from the dataset. Genes were excluded if the MrBayes analysis did not contain the species topology we found to be the most probable within the 95% credible set of trees. As a result, 222 genes out of 471 were excluded from the dataset. [...] Our program was written in Java 1.5 and utilizes some of the functions and classes from the Phylogenetic Analysis Library (PAL) package version 1.5 . [...] Each of the gene trees were constructed by maximum likelihood with PHYML 3.0 . Gene tree topologies were constrained to the species topology that we found previously. A Dayhoff + γ + i model with 8 relative substitution rate categories was used Equilibrium amino-acid frequencies, proportion of invariable sites and distribution shape were estimated from sequence data of each gene. […]

Pipeline specifications

Software tools MrBayes, PAL, PhyML
Application Phylogenetics