Computational protocol: A performance study of the impact of recombination on species tree analysis

Similar protocols

Protocol publication

[…] Simulations under the coalescent model with uniform recombination rate across loci followed the general protocol in []. Species trees with 8, 15, and 25 taxa were generated under a uniform speciation model using Mesquite []. To further validate our findings, we also included alternative model trees which consisted of the 10 8-taxon model trees from the simulation study of [] and an empirical species tree based upon the consensus Mus phylogeny reported by Guénet and Bonhomme (see Fig. 1 in []). The former can be downloaded as part of the supplementary data provided in [], and the latter is listed in Additional file . For each model condition, 20 replicates were generated. Each species tree had a total depth of 1N. For each species tree, coalescent gene trees were generated by ms [] under the multi-species coalescent with a finite-sites model of recombination. We used 3 different choices for the population recombination rate ρ: 100, 200, and 1000. For the simulated sequence length used in our study (10 Mb) and effective population size of 2500, a ρ value of 1000 corresponds to a per-generation crossover probability between adjacent sites of 10−8. These values are within the range of estimates for mouse, rat, and human [] (e.g., an empirical study of human demography estimated a population recombination rate of 13560 for use in related simulations involving a finite-sites model of recombination and sequence length of 30 Mb [].) Sequence evolution was then simulated using the resulting gene trees as input. We used Seq-Gen [] to simulate DNA sequence evolution under an HKY85 +Γ substitution model with α=0.8. The simulated sequence length for each replicate dataset was 10 Mb.For each replicate, we ran four different phylogenomic inference pipelines. The pipelines differed based upon the set of loci and gene trees used as input to species tree analysis, where one of the following five options were used: The LD-based sequence preprocessing approach discussed above with locus length of 1000 bp, which we refer to as “LD1000”. For each sequence alignment, we estimated an empirical cutoff based on the LD decay plot using r 2 to measure LD (equation 7.13 in []). (See Additional file for LD decay plots and empirical cutoffs.) Loci were then sampled at an interval equal to the empirical cutoff. The sequence length of each sampled locus was 1000 bp, which was identical to the locus length used by []. FastTree [, ] was used to infer a gene tree on each locus under the GTR +Γ substitution model. The LD-based sequence preprocessing approach discussed above with locus length of 100 bp, which we refer to as “LD100”. The LD100 method was otherwise identical to the LD1000 method. An inferred breakpoints/inferred gene trees approach, which we refer to as “IBIG”. The sequence was partitioned into blocks using the LRScan algorithm [, ] with each block satisfying the Four-Gamete test to rule out historical recombination []. We used a custom implementation of the LRScan algorithm which is provided as open-source software at the URL given in Additional file . To reduce computational burden at the potential expense of downstream phylogenomic inference accuracy, we chose to concatenate every 1000 blocks into a single locus, rather than letting each block correspond to a locus for the purpose of phylogenomic inference. (See Additional file for an experiment that explores different settings for the concatenation step.) For this reason as well as the simple FGT-based approach, IBIG’s accuracy can be interpreted as a lower bound on the accuracy of phylogenomic pipelines which incorporate explicit breakpoint analysis. The lower bound suffices for the purposes of our study. (Recall also the findings of [], which suggest that state-of-the-art phylogenomic inference pipelines are largely robust to violations of the assumption of zero intra-locus recombination.) A gene tree was then estimated on each locus using FastTree, similar to the above methods. A true breakpoints/inferred gene trees approach, which we refer to as “TBIG”. This approach made use of the true recombination breakpoints. Each recombination-free interval between a pair of neighboring breakpoints served as a locus in downstream analyses. Gene trees were inferred on loci using FastTree [, ], similar to the above methods. A true breakpoints/true gene trees approach, which we refer to as “TBTG”. This approach used the set of true gene trees (and, implicitly, the set of true recombination breakpoints) for each replicate dataset as input for downstream analysis. The LD-based sequence preprocessing approach discussed above with locus length of 1000 bp, which we refer to as “LD1000”. For each sequence alignment, we estimated an empirical cutoff based on the LD decay plot using r 2 to measure LD (equation 7.13 in []). (See Additional file for LD decay plots and empirical cutoffs.) Loci were then sampled at an interval equal to the empirical cutoff. The sequence length of each sampled locus was 1000 bp, which was identical to the locus length used by []. FastTree [, ] was used to infer a gene tree on each locus under the GTR +Γ substitution model. The LD-based sequence preprocessing approach discussed above with locus length of 100 bp, which we refer to as “LD100”. The LD100 method was otherwise identical to the LD1000 method. An inferred breakpoints/inferred gene trees approach, which we refer to as “IBIG”. The sequence was partitioned into blocks using the LRScan algorithm [, ] with each block satisfying the Four-Gamete test to rule out historical recombination []. We used a custom implementation of the LRScan algorithm which is provided as open-source software at the URL given in Additional file . To reduce computational burden at the potential expense of downstream phylogenomic inference accuracy, we chose to concatenate every 1000 blocks into a single locus, rather than letting each block correspond to a locus for the purpose of phylogenomic inference. (See Additional file for an experiment that explores different settings for the concatenation step.) For this reason as well as the simple FGT-based approach, IBIG’s accuracy can be interpreted as a lower bound on the accuracy of phylogenomic pipelines which incorporate explicit breakpoint analysis. The lower bound suffices for the purposes of our study. (Recall also the findings of [], which suggest that state-of-the-art phylogenomic inference pipelines are largely robust to violations of the assumption of zero intra-locus recombination.) A gene tree was then estimated on each locus using FastTree, similar to the above methods. A true breakpoints/inferred gene trees approach, which we refer to as “TBIG”. This approach made use of the true recombination breakpoints. Each recombination-free interval between a pair of neighboring breakpoints served as a locus in downstream analyses. Gene trees were inferred on loci using FastTree [, ], similar to the above methods. A true breakpoints/true gene trees approach, which we refer to as “TBTG”. This approach used the set of true gene trees (and, implicitly, the set of true recombination breakpoints) for each replicate dataset as input for downstream analysis.The main motivation behind the use of ground truth in the TBIG and TBTG methods was for theoretical comparison with the other methods, which did not make use of ground truth. Thus, the accuracy of TBIG and TBTG serves to bound the potential accuracy of the other methods.Given a set of gene trees inferred using one of the four approaches described above, each pipeline utilized ASTRAL-II [, ] to perform species tree inference. Our choice was motivated by prior studies which have shown ASTRAL-II to be among the most accurate state-of-the-art methods while offering much improved computational efficiency [, ].As an alternative to the modeling assumption of uniform rate of recombination across loci, we also used the msHOT simulation tool [] to perform coalescent simulations incorporating recombination hotspots. The simulations utilized the 8-taxon species trees that we generated using Mesquite. The recombination hotspots were simulated using two different approaches: The procedure used by [ ], where the number and length of hotspot regions were chosen deterministically. The locations of 10 hotspots were chosen uniformly at random within an alignment. The 10 hotspot lengths were: two hotspots with length 1 kb each, two with 2 kb length, two with 3 kb length, two with 4 kb length, and two with 5 kb length. Each hotspot had local recombination rate that was 10 times the background recombination rate used outside of hotspots. The procedure used by [ ], where the number and length of hotspot regions were chosen non-deterministically. The number of hotspots was drawn from a Poisson distribution parameterized so that the average distance separating neighboring hotspots was 500 kb. The width of each hotspot (in kb) was draw uniformly in the open interval (1,2). The intensity above background for each hotspot was drawn uniformly from the open interval (1,10). The procedure used by [ ], where the number and length of hotspot regions were chosen deterministically. The locations of 10 hotspots were chosen uniformly at random within an alignment. The 10 hotspot lengths were: two hotspots with length 1 kb each, two with 2 kb length, two with 3 kb length, two with 4 kb length, and two with 5 kb length. Each hotspot had local recombination rate that was 10 times the background recombination rate used outside of hotspots. The procedure used by [ ], where the number and length of hotspot regions were chosen non-deterministically. The number of hotspots was drawn from a Poisson distribution parameterized so that the average distance separating neighboring hotspots was 500 kb. The width of each hotspot (in kb) was draw uniformly in the open interval (1,2). The intensity above background for each hotspot was drawn uniformly from the open interval (1,10).In both approaches, the background recombination rate was 100. Otherwise, simulations incorporating recombination hotspots utilized a procedure that was identical to simulations elsewhere in our study: msHOT was used to simulate gene trees and locus lengths, and seq-gen was then used to simulate sequence evolution using the procedure described above.For each dataset, the topological distance between an estimated species tree and the true species tree was measured using normalized Robinson-Foulds (RF) distance []. We used a routine implemented in the PhyloNet software package for this purpose []. […]

Pipeline specifications