## Similar protocols

## Protocol publication

[…] Genetic relationship among accessions was assessed by a cluster analysis (CA) using the Neighbor-Joining (NJ) algorithm as implemented in the **MEGA** 6.0 software [] using a distance matrix assembled based on the proportion of alleles shared between two accession for all possible pair-wise combinations []. The bootstrap interior branch test [] was used to test reliability of interior braches on the tree. The principal components analysis (PCA) was performed on the multilocus genotype data using the R package **adegenet** []. The accessions were projected onto a two dimensional space bound by the first two principal axes to elucidate the genetic relationships within and among geographic groups.The genotypic data were subjected to a Bayesian model-based CA using the software package STRUCTURE 2.3.1 [] to determine the optimum number of groups reflecting the genetic structure. STRUCURE allocates individuals into clusters (K) based on multilocus genotype data, so as to minimize deviations from Hardy-Weinberg and linkage equilibrium. The program uses a Markov Chain Monte Carlo (MCMC) procedure to estimate P(X|K), the posterior probability that the data (X) fit the hypothesis of K clusters. The analysis assigns individuals into each of the K clusters based on the membership coefficient (Q-value) which sums to unity over the number of clusters (K) assumed. STRUCTURE was set to ignore population information, and to use an admixture model with correlated allele frequencies, as it is considered the best option for subtle population structure []. The degree of admixture (α) was allowed to be inferred from the data. α is close to zero when most individuals are from one population or another, while it is greater than one when most individuals are admixed []. The allele frequency parameter (λ) was set to one as suggested in the STRUCTURE manual. From a pilot study, we found that burn-in and MCMC simulation lengths of 100,000 replicate runs were optimum to achieve accurate parameter estimates. We let the number of clusters (K) vary between 2 and 18 with 20 replicate runs to quantify the variation of the likelihood for each K. The K value that provides the maximum likelihood (Ln P(D) in STRUCTURE) across runs is generally inferred as the most probable number of clusters. However, the interpretation of K should be treated with care as it merely provides an ad hoc approximation [] and sometimes genuine and subtle population structure may be missed by STRUCTURE. Therefore we used an ad hoc statistic ΔK to choose the optimum number of clusters (K) based on the second order rate of change in the log probability of data between successive K values as proposed by Evanno et al. []. [...] The multilocus genotype data were pooled into five geographic groups matching the results of the CA and subjected to analysis of total and within-group genetic diversity measures such as mean number of alleles per locus (A), observed (Ho) and expected (He) levels of heterozygosity, and fixation index (F) for different loci. Allelic richness (Ar) and private allelic richness (PAr) for each population were estimated using the rarefaction method [], which compensates for differences in sample size (i.e. rarified allelic richness) among populations as implemented in hp-rare 1.1 []. The estimates of Ar and PAr were geographically projected using an inverse distance weighted (IDW) interpolation tool implemented in the ArcMap 10.1 (ESRI, Redlands, CA USA). Gene diversity analysis was performed on the allele frequency data from the five geographic groups by following the method suggested by Nei []. The total gene diversity (HT) was partitioned into gene diversity due to variation within groups (HG), and the component due to variation between groups (DGT). Differentiation between groups was calculated as GGT = DGT/HT, where GGT can vary between 0 (when HG = HT) and 1 (when HG = 0), i.e. groups fixed for different alleles.The group-wise microsatellite data were also analyzed using the analysis of molecular variance (AMOVA) as implemented in the software package **ARLEQUIN** version 3.6 []. The total variance was partitioned into variation within and among groups. The variance components from AMOVA were used to estimate the population subdivisions within and among groups. Contingency χ2 analysis was performed to determine the heterogeneity among groups before performing AMOVA. A population pair-wise FST matrix was computed to assess genetic differentiation among different geographic groups. [...] The unfiltered data with 237 data points and two filtered data with 136 and 112 occurrence points, were subjected to model tuning using an R package **ENMeval** [] to identify the optimum data set for modeling current distribution of walnut. The ENMevaluate function in the ENMeval package performs tuning and evaluation of models by automatically implementing MaxEnt with a range of user-defined settings. It executes a series of tasks: (1) partitions occurrence and background data points into spatially independent evaluation bins using six different methods for k-fold cross validation [, ]; (2) builds a series of models with different user-specified feature classes (FCs) and regularization multipliers (RMs); and (3) computes five different evaluation metrics to aid in selecting optimum model settings. The evaluation metrics include: (i) the area under the curve (AUC) of the receiver operating characteristic (ROC) for the test data (AUCTEST []); (ii) AUCDIFF which is the difference between AUCTRAIN and AUCTEST []; (iii) minimum training presence omission rate (ORMTP []); (iv) 10% training omission rate (OR10 []); and (v) the Akaike information criterion (AICc []). AUCTEST measures the model’s ability to discriminate conditions at test occurrence localities from those at background localities averaged over k iterations, with higher values indicating better discrimination. AUCDIFF is positively associated with the degree of overfitting. Omission rates provide information regarding the ability to discriminate between suitable and unsuitable sites as well as quantify model overfitting by comparing threshold-dependent omission rates with theoretically anticipated levels of omission. ORMTP indicates the proportion of test localities with suitability values lower than those associated with the lowest-ranking training locality with values greater than zero typically indicating model overfitting. OR10 indicates the proportion of test localities with suitability values (relative occurrence rate corresponding to MaxEnt’s raw output) lower than those excluding the 10% of training localities with the lowest predicted suitability. Under either threshold rule, pixels with values equal to or higher than the threshold are considered suitable. Omission rates greater than the theoretical expectation for a given threshold typically indicate model overfitting. The AIC corrected for small sample size (AICc) reflects both model goodness-of-fit and complexity, where the best model has the lowest value (i.e. ΔAICc = 0).We applied the “block” method to partition both occurrence and background data, which splits data along the latitude and longitude lines, and allocates equally into four bins for cross validation. It is the best method for studies involving model transfer across space and time []. We built models with the RMs ranging from 1.0 to 5.0 at increments of 0.5 and six FC combinations: Linear (L); Linear and Quadratic (LQ); Hinge (H), Linear, Quadratic, and Hinge (LQH); Linear, Quadratic, Hinge, and Product (LQHP); and Linear, Quadratic, Hinge, Product, and Threshold (LQHPT) with 10000 background points. The RM imposes a penalty on model complexity and FC determines the shape of response curves, both act in concert with each other to reduce complexity of models. Computation of all evaluation metrics used MaxEnt raw output values, which is interpreted as relative occurrence rate (ROR) []. The model with ΔAICc equal to zero is considered the best model []. We computed Schoener’s D statistic that considers the geographic variability pixel-by-pixel to quantify pair-wise similarity among different models. Based on model tuning for different data sets, we selected the data set filtered at 10 km with 137 occurrence points as the best for hindcasting LGM and LIG distributions of walnut. We ran MaxEnt modeling with settings identified as optimum by model tuning to produce the current climatic projection and to hindcast past distributions of walnut with the Gaussian kernel density bias grid file to account for any residual sampling bias in the data set. Predicted habitat suitability maps for the current, LGM, and LIG distributions of walnut showing the relative rate of occurrence were generated in ArcMap 10.1. […]

## Pipeline specifications

Software tools | MEGA, adegenet, Arlequin, ENMeval |
---|---|

Applications | Phylogenetics, Population genetic analysis |