Computational protocol: RAD sequencing reveals genomewide divergence between independent invasions of the European green crab (Carcinus maenas) in the Northwest Atlantic

Similar protocols

Protocol publication

[…] RAD data for 242 individuals were cleaned and demultiplexed using process_radtags in Stacks v.1.21 (Catchen, Hohenlohe, Bassham, Amores, & Cresko, ). We used ustacks for de novo loci formation with a minimum depth of 5 to create a stack and a maximum distance of 4 between stacks. One individual from Kejimkujik (KJI) was removed due to low sequence coverage, and so, we retained 241 individuals overall. We then used cstacks to construct a locus catalog based on sequence identity with 1 mismatch allowed between stacks. We used settings of M = 4 as the maximum distance between two loci to merge and n = 1 to allow two catalog loci to merge in case they are differently fixed versions of the same locus. All RAD tags were 80 bp in length. These were then filtered in the Stacks populations module to include only RAD tags present in each population and in ≥75% of individuals and a minor allele frequency (MAF) greater than 5%. This dataset was filtered for missing data and deviations from Hardy–Weinberg equilibrium using PLINK (Purcell et al., ), where individuals missing >25% of loci and loci missing >5% of genotypes were removed. Finally, only one SNP per RAD tag was retained to remove the effects of physical linkage. The resulting panel was converted for subsequent analyses using PGDSpider v2.0.8.3 (Lischer & Excoffier, ) and the R package genepopedit (Stanley, Jeffery, Wringe, DiBacco, & Bradbury, 2017). [...] Multiple approaches were used to identify loci putatively experiencing adaptive selection (outliers). We first employed BayeScan v2.1 (Foll & Gaggiotti, ) with 50,000 burn‐in and 100,000 iterations, with prior odds set to 10. Second, we used a nonhierarchical island model in Arlequin v. (Excoffier & Lischer, ) to detect outlier loci using 200,000 simulations and 100 demes simulated with default expected heterozygosity settings. Third, we used the hierarchical island model in Arlequin with the same settings as the nonhierarchical model and 10 simulated groups to further test for outlier loci. While these methods have been shown to produce higher numbers of false positive outliers in some demographic scenarios, they also show increased power and perform similarly to other methods such as FLK (Bonhomme et al., ) under recent demographic expansion models (Lotterhos & Whitlock, ). Only loci identified as outliers by all three methods were considered to reduce the number of false positives, and our putatively neutral panel was comprised only of SNPs that were not detected as outliers by any of these methods (Figure ). Outlier loci were subset from the full RAD panel using the subset_genepop function in genepopedit (Stanley et al., ). Observed (H o) and expected (H e) heterozygosities were calculated in the package hierfstat (Goudet, ) in R (R Core Team ) for each population within each SNP panel, as well as the average H o and H e for each panel. Locus‐specific heterozygosities and F ST values were computed in hierfstat and diveRsity (Keenan, McGinnity, Cross, Crozier, & Prodöhl, ), respectively, with 1,000 bootstrap replicates. Heatmaps of linkage disequilibrium r 2 values and standardized allele frequencies, where the highest major allele frequency per SNP among populations is given a value of 1.0 and the lowest major allele frequency a value of 0, for outlier and neutral SNPs, were created using custom R scripts. [...] Spatial structuring in neutral and outlier SNPs was explored using several methods. We conducted a discriminant analysis of principal components (DAPC) using adegenet 2.0.0 (Jombart, ; Jombart & Ahmed, ) in R to first find the number of clusters to best represent our 11 populations using the find.clusters function. All principal components were retained at this step. This function defines the best number of clusters using a Bayesian information criterion (BIC). We then conducted the DAPC using the defined clusters as prior information and retained the first 50 principal components to determine the best value of K for our data. The DAPC was conducted on the full SNP panel, as well as on the neutral and outlier datasets separately. Second, we used principal components analysis on the outlier and neutral SNP datasets to account for intrapopulation variance and compare with the results of our DAPC. Finally, Cavalli‐Sforza and Edwards’ () chord distances were calculated from each panel of SNP data in POPULATIONS 1.2.33 (Langella, ). Neighbor‐joining (NJ) trees based on chord distances were constructed with 1,000 bootstrap replicates for all three SNP panels, and these were midpoint‐rooted and visualized in FigTree v1.4 ( Geographic visualizations of F ST similarities were built using GenGIS version 2.5 (PMID 23922841). Maps used in GenGIS were obtained from the Natural Earth project ( structuring in the SNP and COI datasets were also directly compared. We used Arlequin v3.5.2 (Excoffier & Lischer, ) to calculate pairwise F ST values between each sampling location for the COI data and each of the three SNP panels. We then used a Mantel test with 1,000 permutations in the R package ade4 (Dray & Dufour, ) to compare pairwise F ST values between the COI data and the full SNP panel. Additionally, we conducted a Mantel test with 1,000 permutations on linearized pairwise F ST (Slatkin, ) values and geographic distance in ade4 for the full SNP panel and the COI dataset as a test of isolation by distance. Least‐cost distances between each pair of sampling locations were estimated in marmap (Pante & Benoit, ) implemented in R. An analysis of molecular variance (AMOVA) with 1,000 permutations was conducted for each dataset in Arlequin based on our optimal K = 2, where each sampled population was considered part of either a northern or southern group. […]

Pipeline specifications