Computational protocol: A Rapid and Scalable Method for Multilocus Species Delimitation Using Bayesian Model Comparison and Rooted Triplets

Similar protocols

Protocol publication

[…] In order to test the performance of the delimitation model, we first conducted a simple 3-species simulation and assessed the error rates of the model. In this simulation, we assume gene trees are known without error. Gene genealogies were simulated within a species tree with 3 tips and fixed branch lengths, T1=4000 and T0=8000 generations (a). The number of samples per species was set to 10, totalling 30 individual samples. The effective population sizes were set to 1/2*T1 to 8*T1 for all species (T1=1/8Ne – 2Ne generations). Coalescent trees within the species tree were simulated using SIMCOAL () assuming that 1 species represents 1 population and populations merge on speciation events. Custom scripts were used to generate input files for SIMCOAL from species trees. Twenty-five independent loci were simulated 100 times, which resulted in 2500 gene trees in total. The posterior probability for a global delimitation W (all individuals are from a single species) and the 3 alternative models representing correct delimitation (a), under-splitting (b) and over-splitting (c) () were calculated with the increasing numbers of loci between 5 and 25 with step 5. Error rates, that is, the frequency of choosing an incorrect model as the best model, were recorded for each iteration. Figure 2. [...] The ten-species simulation considers more realistic conditions. Species trees with 10 tips were simulated under the Yule model with a constant speciation rate. The total depth of species trees, T, was rescaled to 20,000 generations, and the effective population size (Ne) of species was set as 1, 1/2, 1/4, 1/8, and 1/16 times of T (Ne = 1250–20,000). These parameter settings cover speciation rates and effective population sizes observed in various eukaryotic groups (; ) including extreme cases of rapid radiations. Gene trees with 10 samples per species (100 total samples) for 40 independent loci were simulated using SIMCOAL and the custom scripts. Simulations were replicated 100 times.In the first simulation, hereafter simulation A, we assume that the topology of the guide tree and assignment of terminals to species groups is known. This simulation tests whether the method can correctly find the positions of nodes which define species from multiple competing combinations on a guide tree. The tr2 program was run with the species tree as a guide and simulated gene trees as inputs. The number of loci used ranged from 5 to 40.In the second simulation, B, delimitation was conducted solely from sets of gene trees (species discovery approach). A consensus tree was built from gene trees from multiple loci using the rooted triple consensus (). Then, the consensus tree was used as the guide tree in the delimitation step. This guide tree contains all possible hierarchical delimitations, from each individual representing a separate species to all individuals representing a single species. Polytomies on consensus trees were randomly resolved by the “multi2di” function in the “ape” package (). In addition, we performed a set of simulations to assess the effect of increasing numbers of loci and individual samples. Gene trees were simulated within the same species trees as above with Ne = T/4, but the total number of samples was reduced to 50 (5 per species) and the number of loci was doubled, keeping the total sample size (number of loci × number of samples) constant. Delimitation with tr2 was conducted in the same procedure as simulation B.The third simulation, C, considers conditions where gene trees and species trees are estimated from DNA sequences. Sequences were simulated along the branches of the gene trees simulated above using Seq-Gen () assuming HKY+G model (Ts/Tv = 2.5 and α=0.1) and 3% of overall genetic variations. These parameters were chosen to be comparable to the case studies described in the next sections. Sequence length was set to a constant length of 750 bp. Gene trees were reconstructed from the simulated sequences using RAxML with a GTR+G model () and rooted by the “-I f” option of RAxML. Guide trees were estimated from the reconstructed gene trees with the rooted triple consensus, and delimitation was conducted with tr2. Under the parameter settings above, within-species genetic variation of simulated sequences ranged from 0.3% to 1.4% depending on Ne, and between-species variation was 3.0%.The number of estimated species and the number of exact matches between estimated and true species were measured as the accuracy of delimitation. The elapsed time for each trial was also recorded. The numbers of nonmonophyletic species were counted to measure the degree of incomplete lineage sorting. The effects of Ne, the number of loci, simulation type (A, B, and C) and their second interaction terms on the proportion of exact matches were tested using GLM. For simulation B, the effect of the 2 sampling strategies was also tested. Simulations and delimitations were run on a Linux personal computer with a 2.3 GHz Intel i5 quad-core processor and 4 GB memory. [...] We tested the applicability of the tr2 method to bacterial species using a multilocus sequence typing (MLST) data set of the B. cereus complex. MLST is a typing scheme for bacterial species/subspecies using a few (typically 7) loci (; ). It is widely used in clinically relevant bacteria and occasionally in environmental prokaryotes to delimit species (e.g., ). Although bacterial reproduction is largely clonal, in many bacteria including B. cereus, genetic exchange also occurs (). If there was frequent gene exchange within a group of closely related individuals, but none between distantly related groups, this could lead to units equivalent to reproductively isolated species in sexual eukaryotes (; ).The tr2 method should be able to delimit such a group as a putative species. However, in clonal bacteria without any recombination, the delimitation method based on gene tree congruence would delineate all individuals as separate species because the true genealogy of each locus would be identical. Another complication is that horizontal transfer might occur rarely between otherwise distinct species. This could introduce additional incongruence among loci between otherwise separate species. We were interested to see how the method coped with a prokaryotic clade that might display these complications.Our sample comprised 144 isolates originally collected from evenly spaced quadrats in the walled garden at Silwood Park for the study by . In brief, freezer isolates were regrown on B. cereus selective agar and DNA extracted using Chelex Instagene matrix method. The 7 house-keeping genes used for standard B. cereus MLST () were PCR amplified and Sanger sequenced using primers and conditions at the MLST database ( Sequences were edited in Geneious and trimmed to the lengths used at the MLST database. Full details are provided elsewhere (; Barraclough et al. in preparation), and sequences are available at GenBank (Accession: KT806485-KT807462).Alignment lengths of the MLST sequences ranged from 348 bp to 504 bp, and there were 29–55 unique haplotypes at each locus (maximum of 55 for purH and minimum of 29 for glpF and gmk). The complete data matrix excluding missing loci contained 2806 bp from each of 114 isolates, which included 99 unique multilocus sequence types. Overall genetic variation was 4.0%. Sequences from the 7 loci were separately aligned with MUSCLE 3.8 (). Gene genealogies of the 7 loci were estimated using BEAST 1.80 (). Ten million generations of MCMC sampling were run with a GTR+G substitution model and the log-normal relax clock model (Drummond et al. 2006). Twenty percent of the MCMC samples were discarded as burn-in. The convergence of the parameters was checked by effective sampling size using Tracer (), and the maximum clade credibility trees (MCC trees) were extracted from the MCMC runs using TreeAnnotator.Two methods were used to obtain guide trees for delimitation of the Bacillus group. First, a consensus tree was constructed using the rooted triple consensus from the MCC trees of 7 loci. Second, in order to account for the effects of horizontal transfer on the guide tree estimation, we ran ClonalFrame () on the concatenated alignment. ClonalFrame estimates the most likely clonal genealogy by removing putative horizontally transferred regions. An MCMC of ClonalFrame was run with 800 thousand generations, and 50% of the chain was discarded as burn-in. Convergence of parameters was examined by checking effective sample size using Tracer. The 50% majority consensus from the ClonalFrame MCMC was used as a second guide tree. Resampling of loci was conducted 50 times using these 2 guide trees. To further account for the uncertainty of tree building, 100 trees were sampled from the MCMC chain from BEAST for each locus and from the chain of ClonalFrame, and delimitation was repeated with these 100 sets. The frequency for each pair of samples to be grouped in the same species was recorded. (Sequence alignments and trees are available at Dryad: […]

Pipeline specifications