Computational protocol: Amount of Information Needed for Model Choice in Approximate Bayesian Computation

Similar protocols

Protocol publication

[…] We test the power of ABC by comparing two simple population genetic models, each simulated under a coalescent model. The coalescent is a backward in time simulator of a population of gametes that can be subjected to a number of evolutionary forces. The genetic variation in a population of gametes is determined by the mutation rate per generation (), and the effective population size (). The level of genetic variation in a population is then defined as the product of these two parameters: . Two coalescent models were considered, the first of which consisted of a population of effective size (haploid individuals) that remain constant through time (SNM). For the second scenario, a bottleneck model (BNM) was considered where an instantaneous reduction in the population size occurs at 0.2 coalescent time units () in the past (measured in generations), and persists for a period of before returning to its original size. For each demographic model, 1000 datasets were simulated whereby the levels of nucleotide diversity (assuming an infinite sites model) and number of samples and loci were varied. For the majority of analyses, we considered a sample size () of 10 or 20 individuals (where a population consists of gametes), with 15 or 30 loci sequenced (750 bp each in length) and two levels of genetic variation with per base pair scaled mutation rates of or (where ). For the majority of simulations in the BNM, the relative population size during the bottleneck () was , although we also varied this parameter (, , , and ) to assess the performance of ABC in detecting bottlenecks of varying severity. The population scaled recombination rate, , was set to 0.01/bp in each model.For each of these simulated datasets the mean, standard deviation and 5% and 95% quantiles across loci were calculated for Watterson's (, , , ; ), nucleotide diversity (, , , ; ), Fay & Wu's estimate of (, , , ; ), haplotype diversity (, , , ; ), Tajima's D (, , , ; ), Fay & Wu's non-standardized H (, , , ; ) and the number of segregating sites (, , , ). The relative site frequency spectrum (SFS) was also summarized by the average proportion of segregating sites that occur within each of three or five evenly sized frequency classes (s1, s2, s3, s4, s5). These represent population genetic statistics that are thought to summarize population sequence data in the most informative way (see for example, , , ). For the analysis of the power and false positive rate of ABC we combined a number of summary statistics: TPH (, , ), SFS3 (folded site frequency spectrum in 3 bins), T+SFS3 (, folded site frequency spectrum in 3 bins), SFS5 (folded site frequency spectrum in 5 bins) and TPH+DH (, , , , ). When analyzing the relationships among summary statistics, the calculation of correlation coefficients and the performance of Principal Component Analysis (PCA) was implemented using the python library NumPy. [...] Joint posterior densities were simulated for each of the two ABC models using draws from uniformly distributed priors. The prior bounds for the SNM and BNM were (0, 0.01) and (0, 0.02) for and , respectively. Time was measured on a scale of generation. For the BNM, the time () of the bottleneck (looking backwards in time) was sampled from a prior with bounds (0, 1.5), with the relative population size during the bottleneck () having bounds of (0, 1). The relative ancestral population size and the duration of the bottleneck were fixed as 1 and respectively. ABC was performed using the python library EggLib , which implements the Euclidean distance-based, local linear regression method described in . Model choice was performed using the rejection-based method implemented in EggLib, with model probabilities being defined as the proportion of simulations belonging to each model after the ABC rejection step. Bayes factors were calculated as the ratio of the model probabilities, with a Bayes factor 3 (unless stated otherwise) being considered an acceptable level of significance . Power () was defined as the probability of correctly rejecting the SNM and was assessed by calculating the proportion of replicates with a Bayes factor when the true model was the BNM. False positives were considered as instances where the SNM was falsely rejected and was given by the proportion of replicates with Bayes factors when the SNM was the correct model. Parameter estimation was carried out using the local linear regression method of . The accuracy of parameter estimation was assessed by comparing the true parameter value with that estimated in ABC using the relative bias, , and the relative mean square error, . The tolerance level () for both model choice and parameter estimation was fixed at 0.001 unless otherwise stated. Coalescent simulations, ABC analyses and calculation of summary statistics were performed using EggLib. Any additional custom code is provided in the Github repository: […]

Pipeline specifications

Software tools Numpy, EggLib
Applications Miscellaneous, Population genetic analysis