Computational protocol: Genomic population structure of freshwater‐resident and anadromous ide (Leuciscus idus) in north‐western Europe

Similar protocols

Protocol publication

[…] Population genomic data were generated using the Genotyping‐by‐Sequencing (GBS) approach (Elshire et al. ) – a method that is both economical and provides a relatively high output of single nucleotide polymorphisms (SNPs) distributed across the genome. Extracted DNA was processed by the GBS service provided by Cornell University's Institute of Biotechnology following their standard pipeline (Elshire et al. ). Initial sample optimization indicated the six base cutter restriction enzyme EcoT22I (target site ATGCA|T) exhibited effective genome fragmentation, and this was used for the GBS library generation. All samples were sequenced on Illumina HiSeq 2000 technology, using single read 64 bp chemistry (including library barcodes). Raw data are available from NCBI, accession SRP067014.Initial data analysis used the zebra fish Danio rerio, Cyprinidae genome (NCBI assembly number GRCz10) as reference for the Tassel 4.3 pipeline (Bradbury et al. ). However, as only ca. 4% of the reads mapped, this approach was abandoned in favor of the UNEAK3 pipeline (Lu et al. ). Tags were defined as groups of more than five identical reads in the UMergeTaxaTagCountPlugin. To ensure a minimal amount of false SNPs to be included in the dataset, we set an Error Tolerance Rate (ETR) of 0.01 on the UTagCountToTagPairPlugin, and a minimum minor allele (MAF) frequency of 0.02 on the UMapInfoToHapMapPlugin. For all downstream analyses the SNPs with more than two alleles were removed. Finally, using Plink 1.9 (Purcell et al. ) all SNPs with more than 5% missing data were removed. [...] GenoDive 2.0 (Meirmans and Van Tienderen ) was used to calculate summary statistics, including the observed frequency of heterozygotes within sampling sites (HO), the expected frequency of heterozygotes within sites (HS), also known as gene diversity, and the expected frequency of heterozygotes over all populations (HT). Other general statistics included were number of alleles, number of effective alleles, fixation index (F ST) and deviations from Hardy‐Weinberg equilibrium described as inbreeding coefficients (GIS) (10,000 permutations), with positive results meaning heterozygote deficiency, and negative results meaning excess of heterozygotes. Isolation by distance was assessed for F ST values and Euclidean geographic distances, as well as F ST values and waterway distances (Table S1) for all sample sites, using the Isolation‐By‐Distance Web Service (IBDWS) (Jensen et al. ). The IBD analyses were undertaken for both the full dataset and a dataset excluding the geographic distant KRO sample site.Admixture version 1.23 was used to estimate ancestral relations of the sample sites (Alexander et al. ). The analysis was performed for 2–14 clusters, using default settings, and convergence was assessed by running the algorithm until the log‐likelihood difference between iterations was less than 10−4. Admixture output was plotted using an in‐house script, available from the authors upon request. The ancestral fractions for the most likely number of clusters, according to the Cross Validation (CV) error, were summed up for each sample site and plotted on a geographic map (Fig. ) using ArcMap 10.3. The map also includes mean salinity data from the Baltic Sea (1999–2009) and North Sea mean salinity (2007–2008) (www.myocean.eu). Also displayed on the map was 10 × 10 km squares from which ide have been observed since 1995. This data were provided by the Natural History Museum of Denmark's extensive national fish atlas database (Carl & Møller, unpublished data). Also included were coastal areas of Sweden where ide have been reported (Kullander et al. ). In order to assess the connectivity of the samples and sample sites, a principal component analysis of the SNP dataset was undertaken using the SmartPCA software of the Eigensoft package (Patterson et al. ). The dataset were reduced to 10 eigenvectors, and vectors 1 and 2, and 1 and 3 were plotted using the Perl script Ploteig also included in the Eigensoft package. […]

Pipeline specifications

Software tools TASSEL, PLINK, Genodive, IBDWS, ADMIXTURE, EIGENSOFT
Applications Population genetic analysis, GBS analysis, GWAS
Organisms Danio rerio, Homo sapiens