## Similar protocols

## Protocol publication

[…] All microsatellite loci were tested for within-population deviations from Hardy-Weinberg equilibrium and for linkage disequilibrium among loci pairs using the online version of **GENEPOP** [, ] with 10,000 dememorizations, 1,000 batches, and 10,000 iterations per batch for both tests. To correct for multiple testing, a Bonferroni correction was applied at the 0.05 α level of significance.Observed heterozygosity (HO) and expected heterozygosity (HE) were calculated using the software **GenAlEx** 6.5 [, ], and allelic richness was estimated by rarefaction (N = 30) using the software HPRARE []. [...] To identify likely genetic clusters and possible origins for each cluster, we used a Bayesian clustering method implemented by the software STRUCTURE v. 2.3.4 []. STRUCTURE identifies K genetic clusters and estimates what proportion of each individual’s ancestry is attributable to each cluster, with no a priori location information about the individuals. Twenty independent runs were conducted at K = 1–15 for the full set of CA and North American reference populations and at K = 1–12 for the subset of just CA populations. We ran each for 600,000 generations with 100,000 discarded as burn-in, assuming an **admixture** model and correlated allele frequencies. The optimal number of K clusters was chosen using the guidelines from Prichard et al. [] and the Delta K method [, ]. The results were visualized using the program **DISTRUCT** v.1.1 [].To further explore population structure, discriminant analyses of principle components (DAPC), Principle Component Analyses (PCA), and plots illustrating FST values were created using the **Adegenet** package v. 2.0.2. [], available on R software v. 3.2.4 and RStudio v.0.99.893 []. DAPC optimizes variation between clusters while minimizing variation within them. Data are transformed using a PCA and then clusters are identified using discriminant analysis. We assessed genetic differentiation among population pairs by calculating FST values with **GenoDive** v. 2.0b27 []. We also ran isolation by distance (IBD) analyses for all populations and for all CA populations using Genodive v. 2.0b27.While the SNP chip has 50,000 probes, only 27,674 passed the initial stringent testing requiring unambiguous genotyping, biallelic and polymorphic markers, and Mendelian inheritance []. Further filtering was done using **PLINK** v.1.9 to exclude alleles showing up in <1% of samples as these could be genotyping errors, as well as loci not conforming to Hardy-Weinberg expectations (threshold of 0.00001), and those that genotyped in <98% of the samples [, ]. These filtering parameters are standard for SNP chip data [–]. The dataset contained 15,698 SNPs after this filtering.For SNP data, we ran four runs with the Bayesian program fastSTRUCTURE to estimate the number of genetic clusters and calculate ancestry fractions for each individual given K numbers of genetic clusters []. The results were visualized using DISTRUCT v.1.1. For comparison, we also used the maximum likelihood software Admixture 1.3.0 and the CV error method described in the software’s manual to estimate the number of genetic structures and visualize the ancestry fractions calculated for each individual []. PCA analyses were conducted in both Adegenet and PLINK and plotted in R v. 3.2.4. [...] We inferred demographic history and estimated relevant parameters using microsatellite data and Approximate Bayesian Computation methods [] as implemented by the program **DIYABC** []. Four colonization scenarios were tested to determine if the current Californian populations are more likely to be the result of one or two introduction events (). In the first scenario, Northern California populations originate from an invasion from South Central US, and Southern California populations from an invasion from the Southwest US. In the second scenario the origins are reversed; Northern California comes from Southwest and Southern California from South Central. The third scenario depicts just one invasion into California, and the fourth scenario is a neutral model in which all four populations branch from a common ancestor at the same time. We ran the analysis using two different datasets. First, to reduce the excess noise that can result from grouping disparate populations together, we ran the analyses with regional groups that were made up of 133 randomly chosen individuals from populations that were representative of the geographic regions, as identified in STRUCTURE () and noted . For example, San Mateo, Madera, and Fresno always clustered together, so they were chosen to represent Northern California (). To assess whether we were biasing the analysis by excluding some populations in this first test, we ran the analysis again this time forming the four regional groups from 217 randomly selected individuals from all populations from the respective geographic region. In both cases, the number of individuals was chosen based on the number of individuals in the smallest of the four groups.For the DIYABC analyses, we first simulated 1,000,000 datasets for each scenario, resulting in 4,000,000 total simulated datasets. To determine which scenario was most supported by the data, we evaluated the relative posterior probability using a logistic regression on the 4,000 (1%) simulated datasets closest to the observed dataset. To estimate demographic parameters, we chose scenario 1 and estimated posterior distributions of parameters taking the 1,000 (1%) closest simulated datasets, after applying a logit transformation of parameter values. To evaluate confidence in the posterior probability of scenarios (in the form of Type I and Type II errors), we used a logistic regression on 250 test datasets simulated for each scenario with the same values that produced the original dataset. Priors and parameters are provided in and Tables. […]