Computational protocol: Genomic diversity and differentiation of a managed island wild boar population

Similar protocols

Protocol publication

[…] To examine genetic differentiation among the Sardinian population and the other wild and domestic populations, we performed Principal Component Analysis (PCA) using FLASHPCA (). To avoid the possible confounding effect related to the presence of loci in linkage disequilibrium the calculation was performed on a subset of 30 127 SNPs (referred to as 30 K) obtained by removing SNPs in linkage disequilibrium (r2>0.5) with PLINK 1.09 (). On the basis of the PCA results and in concordance with previous findings (), we grouped the populations into four main groups (domestic breeds—DP, Sardinian WB—WSar, Italian WB—WIta and the remaining European wild populations all together—WEur). The 50 K data set was used to assess variability levels of the populations by calculating minor allele frequencies (which indicates the abundance of rare alleles through the genome), as well as expected (HE) and observed (HO) heterozygosity within populations in PLINK. FIS and Hardy–Weinberg equilibrium were tested in the Sardinian population using GENEPOP 4.1 (), adjusting significance levels by the sequential Bonferroni procedure (). Genetic differentiation among populations was estimated by calculating pairwise FST values in GENEPOP using the 30 K subset to avoid a bias owing to physical linkage between loci ().With the aim to explore the genetic structure across populations and to assess the source of putative immigrants or introgressed individuals in the Sardinian population, a Bayesian clustering assignment was performed using the 30 K data set in STRUCTURE 2.2 (), accessing to the Bioportal cluster at the University of Oslo (). Ten independent runs of 100 000 iterations, following a burn-in of 80 000 iterations, were performed for each value of K between 1 and 20, neglecting prior population information, assuming independence among loci, and allowing admixture. The Evanno method () was applied to infer the most reliable number of genetic clusters (K). Within the identified K-value, among the 10 runs, we selected the one with the highest log-likelihood value and calculated the proportion of membership to each of the K clusters for any single individual (qi).To evaluate the extent and nature of genetic introgression in the Sardinian WB, we first identified at each value of K, which of the K clusters was associated to the Sardinian sample, that is, the cluster to which the WSar population showed the highest average membership (QSar). Then we studied the variation of QSar at increasing values of K in the cumulative sample including all other populations in the data set, and selected the value of K at which this contribution levelled off (that is, K=12). In doing so, we minimised the possible noise due to an incomplete partitioning of Sardinian vs non-Sardinian populations. We subsequently investigated the distribution of qSar among WSar individuals at the selected K (K=12) and ordered individuals by qSar (from the purest to the most introgressed).As both diversity measures and detected levels of differentiation can be affected by the number and nature of the individuals under study, we recalculated minor allele frequencies, HE, HO and pairwise FST and repeated the PCA using a subset of 15 individuals (equalling the smallest population) for each group. In so doing, we used three different subsamples of the Sardinian population: a random subsample of 15 individuals (Random), the 15 individuals at the upper extreme (Top) and the 15 at the lower extreme (Bottom) of the distribution of qSar. Similarly, a possible bias in the Bayesian analysis due to unequal sample size among populations was addressed by re-running STRUCTURE with the same parameters described above (K=1–10) on the random subset of 15 individuals per population.Runs of homozygosity (ROH) were identified across the genome of Sardinian WB and compared with reference populations. Sliding windows of 10 Kbp (20 SNPs), allowing for one heterozygous call for each window, were checked across the genome using PLINK. To avoid overestimation of ROHs owing to rare allele removal, no additional filtering for low allele frequencies was applied. As ROH analysis is sensitive to sample size and within-population heterogeneity, we repeated it with the same parameters considering only homogeneous clusters on the basis of STRUCTURE results at K=12 and an even sample size (equalling N=8, that is, the smallest homogenous cluster in our data set), achieved by random sampling of individuals within clusters. For the WSar population, we also calculated ROH for the eight individuals at the two extremes of the qSar distribution (Top and Bottom). Differences between the normalised distributions of ROH values from the complete WSar data set and the reduced ones were tested using a two-sample Kolmogorov–Smirnov tests in R 3.0.2 ().To assess population-specific differences in the spatial distribution of ROHs across the genome, ‘ROH hotspots' were identified as locations of the genome for which the proportion of individuals showing a ROH (fROH) exceeded the 99th percentile of the distribution of fROH in the population (). In so doing, we grouped individuals into the four previously used populations (that is, WSar, WIta, WEur and DP). Then we checked the amount of ROH hotspots that were shared by different populations. Finally, for each individual, the number and cumulative size of ROHs in the genome was calculated and results averaged by population. As the chip methodology is likely to underestimate the amount of small ROHs in the genome (), only ROHs>10 Mbp were considered in this calculation. […]

Pipeline specifications

Software tools FlashPCA, PLINK, Genepop
Applications Population genetic analysis, GWAS
Organisms Sus scrofa, Homo sapiens, Sus scrofa domesticus