Computational protocol: Factors affecting the concordance between orthologous gene trees and species tree in bacteria

Similar protocols

Protocol publication

[…] As B. quintana strain Toulouse has the smallest proteome out of all the species considered herein, this strain was used as the reference genome to establish an RBH approach. Each of the 1142 proteins of B. quintana strain Toulouse were compared with the proteomes of the other strains, using BLAST [] with an E-value cutoff of < 1.0e-12. We retained all cases where a protein of B. quintana strain Toulouse had a bidirectional best hit in each of the other proteomes, and the proteins aligned along at least 50% of their lengths.The above analysis yielded 469 groups (potential orthologs). Each of these possible orthologous groups were aligned using MUSCLE [] with the default parameters. The best model of amino acid substitution for each alignment was determined using ProtTest [], and the most likely phylogeny was constructed using PHYML [] with 100 non-parametric bootstrap replicates. The gamma shape parameter and the proportions of invariable sites were estimated by maximizing the likelihood of the phylogeny. Likelihood mapping analysis was carried out to determine the phylogenetic content for every individual alignment, using PUZZLE [,]. [...] Two filters were used to eliminate false positives. The first filter consisted of using confidence sets to assess whether the differences in topology between the probable species trees (see below) and individual gene trees exceeded those expected to occur by chance. We used expected likelihood weighting [], which provides a simple and intuitive method for making multiple comparisons of models and constructing corresponding confidence sets. This test has the benefit of being less conservative than the SH test []. The topologies tested included the superalignment Bayesian topology and the consensus tree topology (see below). PUZZLE was used to carry out this test for each of the 469 alignments, as well as for the superalignment (see below). The 469SBP typology (see Figure ) contained a sister group relationship between the group comprising Sinorhizobium meliloti, Brucella abortus 9–941 and the genera Rhizobium, Agrobacterium, Mesorhizobium, and Bartonella and that comprising the genera Rhodopseudomonas, Nitrobacter, and Bradyrhizobium japonicum. The presence of this sister group relationship was used as the second filter; we used PAUP* 4.01 b10 [] to see whether each of the 432 potential orthologous genes that passed the first filter had phylogenies manifesting the two sister groups. We then used likelihood mapping analysis (applied through PUZZLE) to determine the phylogenetic content for each of the remaining orthologous genes; the number of resolved quartets was counted for each gene, and then a mean and SE were calculated for the entire set. [...] A superalignment was created by concatenating the 469 individual alignments. Two phylogenies were derived. The first was undertaken with maximum parsimony, using PAUP* 4.01 b10 [] with random addition of sequences and tree bisection reconnection. The second phylogeny was created using MrBayes v3.1.2 [], allowing the MCMC sampler to explore all of the fixed-rated amino acid models included in MrBayes. The number of rate categories for gamma distributions was set to four, with an allowance for a proportion of sites to be invariable. Due to the computational burden, we performed a single run with four chains, for 500,000 generations. Trees were sampled every 500 generations, 25% of all generations were removed as burn-in, and a consensus was taken. Once the candidate orthologous genes had been filtered for removal of false positives, we generated a second Bayesian phylogeny from the remaining 370 genes, using the same specifications as above. Because we ran only one run, for each Bayesian phylogeny, we could not use the standard deviation of the split frequencies, instead we examined the log likelihood values. For both superalignments, these values stabilized very soon and started to fluctuate within a very narrow range. In additional file we plotted the log likelihood values of the second phylogeny. [...] The number of different topologies for the confidence set of orthologous groups was deduced using the Robinson and Fould distance (RFd), as calculated through application of TREEDIST []. The RFd indicates the number of bipartitions that are unique to one of two phylogenies being compared; the RFd equals zero when the two phylogenies have the same topology. The number and proportion of total bipartitions were determined using an ad hoc perl script that is based on inputting the consensus file generated from CONSENSE []. […]

Pipeline specifications