Computational protocol: Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data

Similar protocols

Protocol publication

[…] Solving the multi-population diffusion equation is substantially more demanding than the single-population case . This is primarily because the boundary conditions are more complex, and the numerical grid of population frequencies must be much coarser to be computationally tractable, because it is of dimensions. For example, a previous single-population study used a uniform grid of order values between 0 and 1. Extending this grid to a three-population simulation would require an infeasible array of size . Instead, we use a nonuniform grid and extrapolation to enable accurate computation using of order 100 values along each dimension, for a final array size of order .We solve the diffusion equation on a regular nonuniform grid, using a finite difference scheme inspired by the method of Chang and Cooper (). Mutations in population arise at frequency . The diffusion approximation applies when , but the minimum frequency in our numerical simulation is that of the first grid point, denoted . To overcome this, we extrapolate our results to an infinitely fine grid. We use a quadratic extrapolation on the logarithm of the AFS entry, modeling the bias introduced by the finite initial grid point as(5)Here is an AFS element calculated at grid size and is the extrapolated value. Given three evaluations at different grid sizes , we solve for and use this value when calculating likelihoods. This vastly increases both the speed and accuracy of our calculation (Supplementary Figure 3 in ). While higher-order extrapolations may improve accuracy in some cases, they may also be more sensitive to numerical noise. Our empirical experience is that a quadratic approximation provides a good compromise between accuracy, efficiency, and robustness.The computational cost for a single likelihood evaluation scales as where is the number of grid points used. In our experience, for stability and accuracy should somewhat larger than the largest population sample size. Although our theoretical framework extends to an arbitrary number of populations, the exponential scaling of computation with limits our current applications to three simultaneous populations. Importantly, our likelihood calculation is deterministic and numerically smooth, so numerical derivatives can be used in optimization. We use the the quasi-Newton BFGS method , which converges in order steps, where is the number of free parameters.Our implementation of these methods, , is written in cross-platform Python and C, making use of the NumPy , Scipy , and Matplotlib libraries . It is distributed under the open-source BSD license. All calculations herein were performed with version 1.1.0.We estimated parameter uncertainties by both conventional bootstrap (fitting data sets resampled over loci) and parametric bootstrap (fitting simulated data sets). To generate simulated data we used the coalescent program ms , a region-specific recombination rate, and the detailed EGP sequencing strategy ().The confidence intervals reported in and derive from a normal approximation to the bootstrap results. For the conventional bootstrap, confidence intervals were calculated as . For the parametric bootstrap, biased-corrected intervals were calculated as . The maximum-likelihood value is denoted , while and denote the mean and standard deviation of the bootstrap results. Aside from the growth rates , all our model parameters are positive by definition, so in those cases we used their logarithms when calculating confidence intervals.Pearson's goodness-of-fit test was performed using all 213−2 = 9259 bins in the AFS. Results are similar if we restrict our analysis to entries in which the expected value is greater than 1 or greater than 5. […]

Pipeline specifications

Software tools Numpy, matplotlib
Application Miscellaneous
Organisms Martes pennanti, Homo sapiens
Chemicals Amino Acids