Computational protocol: Oceanographic connectivity and environmental correlates of genetic structuring in Atlantic herring in the Baltic Sea

Similar protocols

Protocol publication

[…] Eight samples, each from a different site, were put through PCR and genotyping twice with all loci as controls for error checking. In addition, all samples from the sites FI-VAASA, EE-NARVANLAHTI and DE-RUGEN were run twice through PCR and genotyping to check the results. No genotyping error was detected. The Microsatellite Toolkit () was used to summarise the alleles at each site for each locus. This data was checked manually to identify cases of repeat units that did not match with the repeat motif length, such as one base-pair differences. The remaining loci were run through Microchecker (Van Oosterhout et al. ) to check for the presence of null alleles, large allele dropout, and scoring errors caused by stutter patterns, and for linkage disequilibrium (LD) using in Genepop on the Web (), testing all pairs of loci in each site. Six loci (Her53, Her54, Her61, Her100, Her111, CPA102) were rejected during manual checking of Microsatellite Toolkit outputs on the basis of inconsistent and unreliable scoring, leaving 62 loci in the dataset. Two loci (Her28 and CHA1005) showed null alleles in all sites using Microchecker, and these were all significant at 5% after sequential Bonferroni correction. These two loci were removed leaving 60 loci. No locus pairs were shown to be consistently in significant LD across the majority of sites, and so no further loci were removed, leaving a final dataset of 60 loci (45 transcriptome-derived markers, and 15 genome-derived markers). Individuals that were successfully genotyped at fewer than 40 loci out of the total 60 loci were then removed (11 individuals), leaving a final dataset of 694 individuals. [...] Two selection test methods were used: Fdist () implemented in Lositan (), and BayeScan (). BayeScan has the advantage of allowing estimation of population specific FST values, thus allowing for different demographic histories and drift between populations. Lositan was run using 100 000 simulations, a ‘neutral’ mean FST (potentially non-neutral loci are removed before calculating the initial mean FST), confidence intervals of 95%, a false discovery rate (FDR) of 0.05 (implemented within Lositan), with both the Infinite Alleles and the Stepwise mutation models. BayeScan was run with burn in = 50 000, thinning = 50, sample size = 1000, number of iterations = 300 000, number of pilot runs = 20, length of pilot runs = 5000, and an FDR of 0.05 (implemented within BayeScan). Outlier loci were considered to be those that were identified by both methods as being significant outliers. Based on the results of the outlier testing, three datasets were created consisting of: (i) all 60 loci, (ii) 59 loci, excluding the most severe outlier (Her14), and (iii) the most severe outlier, Her14, alone.BLAST searches were performed for loci Her14 and CPA107 using the NCBI website ( We used blastn searches against the nucleotide collection (nr/nt), reference RNA sequence (refseqrna), and the three-spined stickleback (Gasterosteus aculaetus) genome, and blastx searches against non-redundant protein sequences (nr) and reference proteins (refseq_protein). [...] Allelic richness (AR, corrected for sample size) and expected heterozygosity (HE) were calculated for the three different datasets and for each sampling site using FSTAT version (). A one-way anova was performed using the add-in ‘Analysis ToolPak’ in Excel, to test for differences in AR and HE between sites. Departures from HWE (heterozygote deficiency) using FIS values were tested for each locus in each site using GenePop on the Web (). The average FIS with 95% confidence intervals was calculated for each site using Genetix version 4.05 ().Matrices of pairwise FST values based on ) theta estimator of FST were calculated for the three datasets and tested using the exact G-test in Genepop 4.1.0 (). Overall FST was calculated for each locus, and overall FST for all loci over all sites were calculated using GenePop version 4.1.0 (). To visualise the pairwise estimates of FST, multidimensional scaling (MDS) plots were generated using the cmdscale function in R () for the different datasets.Structure version 2.3.1 () was used to assign individuals to clusters using no prior information on which sites the individuals belonged to. Structure was run using 800 000 iterations, with a burn in of 400 000, for 10 independent runs for K = 2–10, using correlated allele frequencies, and for both an admixture model and a no-admixture model. These settings were used for the three different datasets. A no-admixture model was thought to be more appropriate due to the regime of sampling only spring spawning herring, where little mixing is thought to occur between spawning regions (; ). These analyses were run on the University of Oslo Bioportal cluster ( The results files were then used in Structure Harvester () in order to estimate the uppermost optimal number of clusters in the datasets using Evanno's delta K method (). The Structure Harvester outputs for the optimal value of K were then run through CLUMPP () using the ‘greedy’ algorithm in order to generate consensus data for the 10 independent runs. Finally, DISTRUCT was used to visualise the data (). As one strong cluster was identified (DE-RUGEN, SE-STROMSTAD, DK-FREDRIKSHAVN, and LV-LIEPAJA), the analyses were repeated excluding this cluster in order to detect any hierarchical structure.In order to check for the impact of outlier loci on the results, shorter Structure runs (iterations = 40 000, burnin = 40 000) were performed with and without the detected cluster for K = 2–10 for the following datasets: (i) all loci detected as outliers under putative positive selection by either Lositan or BayeScan, (ii) all loci detected as outliers under putative balancing selection, (iii) all loci detected outliers under either positive or putative balancing selection, (iv) all loci except those detected as outliers under putative positive selection, (v) all loci detected as outliers under putative balancing selection, and (vi) all loci except those detected as outliers.Markers derived from the transcriptome, and thus gene-associated and potentially affected by directional or purifying selection, may show a different level of differentiation compared to loci derived from genomic libraries, and putatively selectively neutral. In order to test this we performed Welch Two Sample t tests (unpaired, not assuming equal variance, two-tailed) to test for differences between the mean locus specific FST values, expected heterozygosity, and allelic richness of each marker type. […]

Pipeline specifications

Software tools Genepop, BayeScan, BLASTN, BLASTX, Structure Harvester, CLUMPP, DISTRUCT
Databases BioPortal
Application Population genetic analysis
Organisms Clupea harengus