Computational protocol: multi‐dice: r package for comparative population genomic inference under hierarchical co‐demographic models of independent single‐population size changes

Similar protocols

Protocol publication

[…] Our hierarchical co‐demographic model consists of n taxa, which refer to independent panmictic populations with no assumption of or requirement for recent shared ancestry (Mazet, Rodríguez, Grusea, Boitard, & Chikhi, ), randomly assigned to Ψ instantaneous expansion (Figure a) or contraction (Figure b) times. Of the Ψ times, there are ψ times corresponding to synchronous pulse events that involve at least two taxa, and σ times corresponding to idiosyncratic events ungrouped from any pulses with only a single taxon, such that Ψ = ψ + σ (Table ). The proportion of n taxa assigned to any of the ψ pulses is represented by ζT, the proportion of n taxa belonging to each of the ψ pulses is described by the associated hyperparameter vector ζs = {ζ1, …, ζψ}, and the proportions of n taxa across all Ψ events are indexed by the vector ζ = {ζs, ζi,1, …, ζi,σ}. Here, ζT is a single proportion value that ranges from 0.0 to 1.0 and equals the total sum of ζs (i.e., ζT = ∑j=1ψζj when ψ > 0), and both ζs and ζ are hyperparameter vectors that index proportion values across events. Specifically, each of ψ elements within the vector ζs ranges from 2/n to 1.0, and ζ comprises of ζs as well as each ζi element = 1/n. The proportion ζT and proportions within the vector ζs may be converted to numbers of taxa S T = ζT × n and S = ζs × n, respectively. Synchronous pulse times are indexed in the vector τs = {τs,1, …, τs,ψ}, whereas idiosyncratic times are indexed in the vector τi = {τi,1, …, τi,σ}, with both vectors arranged in ascending order from most recent to oldest. To clarify, synchronous pulses are indexed by the temporal order established by τs = {τs,1, …, τs,ψ}, which thus determines the order of ζs such that ζ1 pertains to the most recent pulse and ζψ reflects the most ancient. In the case of Ψ = ψ and σ = 0, accordingly ζT = 1.0 such that all taxa are assigned to one of ψ synchronous pulses with no temporally idiosyncratic taxa. On the other extreme, when Ψ = σ = n and ψ = 0, accordingly ζT = 0.0 with zero elements in the associated ζs vector such that there are no synchronous pulses with all taxa idiosyncratically experiencing population size change across σ different times. Other taxon‐specific demographic parameters include each taxon's ratio of size change from the ancestral effective population size to current effective population size is indexed by the vector ε = {ε1, …, εn} and each taxon's current effective population size indexed by the vector N = {N 1, …, N n}. Additionally, population size change times may be indexed to coincide with the taxa arrangement of ε and N such that τ = {τ1, …, τn} (Table ).When implementing this co‐demographic model for comparative demographic inference, there exists flexibility in the hierarchical parameterization, with several options available in multi‐dice. One such option, similar to the approach described in Chan, Schanzenbach, and Hickerson () and Xue and Hickerson (), is to constrain the hyperparameter ψ to the values within the set {0, 1} and condition Ψ and σ on the hyperparameter ζT, which freely varies according to the hyperprior distribution P(ζT). This allows scenarios of complete idiosyncrasy, absolute synchrony within a single pulse, and intermediate degrees of synchronicity belonging to one pulse with remaining taxa temporally idiosyncratic. Here, ζ1 is the only element possible in ζs whereby ζT = ζ1 when ψ = 1 and ζT = 0.0 when ψ = 0, resulting in the joint posterior distribution P(ζT, τ, ε, N | Data) ∝ P(Data | ζT, τ, ε, N) P(ε, N) P(τ | Ψ, σ, ζT) P(Ψ, σ | ζT) P(ζT | ψ < 1). The values for Ψ and σ are then determined by Ψ = 1 + n − S T (when ψ = 1) and σ = n − S T, respectively.An alternative scheme is to randomly assign the proportions of n taxa to Ψ times according to the hyperprior distribution for the vector ζ, which is conditional on the hyperprior distribution of Ψ, with ψ and σ accordingly conditional on P(ζ | Ψ) and P(Ψ). This leads to the joint posterior distribution P(Ψ, ζ, τ, ε, N | Data) ∝ P(Data | Ψ, ζ, τ, ε, N) P(ε, N) P(τ | Ψ, ζ, ψ, σ) P(ψ, σ | Ψ, ζ) P(ζ | Ψ) P(Ψ). The values for ψ and σ are then determined by the number of Ψ draws for ζ that are above and equal to 1/n, respectively, yielding the so‐called Chinese restaurant process (Aldous, ; Blei, Griffiths, Jordan, & Tenenbaum, ) that is similarly applied in msBayes (Hickerson et al., ; Huang, Takebayashi, Qi, & Hickerson, ). Similarly, a third scheme is to condition the hyperprior distribution for the vector ζs, which must have a lower bound greater than 1/n, on the hyperprior distribution of ψ, with Ψ and σ accordingly conditional on P(ζs | ψ) and P(ψ), such that the joint posterior distribution is P(ψ, ζs, τ, ε, N | Data) ∝ P(Data | ψ, ζs, τ, ε, N) P(ε, N) P(τ | ψ, ζs, Ψ, σ) P(Ψ, σ | ψ, ζs) P(ζs | ψ) P(ψ). The values for Ψ and σ are then determined by Ψ = ψ + n − S T and σ = n − S T, respectively. Optionally, for each possible value in the ψ hyperprior, the associated ζs, Ψ and σ values may be fixed to specified values rather than allowed to vary. [...] We conducted a series of in silico experiments to quantify accuracy and bias for various inferential frameworks and hierarchical co‐demographic modelling variants. Data were simulated under known hyperparameter and parameter values with the coalescent simulator fastsimcoal version 2.5 (Excoffier et al., ). To directly generate single‐population folded SFS, the FREQ setting was enabled assuming a set number of independent genealogies per SFS, which was treated as an approximation for the number of SNPs sampled and differed between experiments. Each SFS contained 20 haploid samples, only polymorphic bins and proportional SNP frequencies rather than total SNP counts. Per individual simulation, a set of 10 SFS corresponding to n = 10 populations was converted to a single aSFS summary vector following Xue and Hickerson (). Simulation reference tables composed of hyperparameter and parameter values randomly drawn from their respective hyperprior and prior distributions and their corresponding aSFS summaries were separately produced for each hierarchical co‐demographic model variant and read into the R environment with the r package bigmemory to perform hierarchical RF regression (hRF) and hierarchical ABC (hABC) under the simple rejection algorithm against pseudo‐observed data sets (PODs). PODs were produced under one of two methods, either independently from the reference table or using the “leave‐one‐out” cross‐validation procedure. In brief, the “leave‐one‐out” procedure involves iteratively treating a single randomly selected simulation from a reference table as a POD and conducting inference using the remaining simulations (Csilléry, François, & Blum, ). For each inferential application, Pearson's r correlation and root mean squared error (RMSE) were calculated from estimated values against true POD values. [...] In addition to hRF and hABC, we coupled these frameworks with transformation of the aSFS by PLS as well as evaluated the performance of hierarchical CL (hCL). To compare these inferential strategies, per each of the two hierarchical co‐demographic models of co‐expansion and co‐contraction (Figure ), 100 aSFS PODs were simulated under the hyperprior distribution of ψ ~ U{0, 5} while permitting idiosyncratic taxa such that ζT was allowed to vary from 0.0 to 1.0. These PODs were consistently utilized to independently estimate ψ across each inferential approach. A reference table of 1,000,000 simulated aSFS was likewise produced per model under the same specification as the PODs (). For hRF, using the r package randomforest (Liaw & Wiener, ), a total of 1,000 decision trees, with the default maximum of 33 variables randomly sampled as candidates at each tree split and from 10 trees per each of 100 cycles of randomly subsampling 1,000 simulations per ψ (for a total of 6,000 simulations) with replacement after each cycle, were built per reference table to capture variation in ψ and leveraged to predict ψ for each corresponding POD using the predict() function. For hABC, using the function abc() from the r package abc (Csilléry et al., ), accepted tolerance levels of 0.0050, 0.0010 and 0.0005 were executed per POD against the corresponding reference table, and the mean, median and mode of the according posterior distributions were calculated for point estimates of ψ.For PLS, the plsr() function in the r package pls (Mevik & Wehrens, ) was applied to a random subset of 10,000 simulations against variation in ψ per reference table. The PLS for each reference table was subsequently utilized to transform the remaining 990,000 simulations and corresponding PODs into as many component values as needed to explain ≥95% of the total variance in the original summary statistics. The same hRF and hABC protocols were then executed on the remaining transformed reference tables. For hCL, a custom pipeline that calls dadi to calculate the expected SFS (Gutenkunst et al., ) and incorporates the multinomially distributed CL equation utilized in fastsimcoal2 (Excoffier et al., ) and the BFGS optimization algorithm (Liu & Nocedal, ) was implemented (). […]

Pipeline specifications