Computational protocol: Population differentiation or species formation across the Indian and the Pacific Oceans? An example from the brooding marine hydrozoan Macrorhynchia phoenicea

Similar protocols

Protocol publication

[…] One representative of each MLG per site was used for further analyses. All tests in this study were corrected for false discovery rate (FDR) in multiple tests (Benjamini & Hochberg, ). We used Micro‐Checker v.2.3 (van Oosterhout et al., ) to check for scoring errors and to estimate null allele frequencies. Linkage disequilibrium (LD) was tested using Arlequin v.3.5 (Excoffier & Lischer, ) among all pairs of loci within each site with 103 permutations. Observed (H O) and expected (H E) heterozygosities and tests for Hardy–Weinberg equilibrium (HWE) were computed using the software Arlequin v.3.5 (Excoffier & Lischer, ) within all sites and over all loci. Average allelic richness and private allelic richness were compared among each site using HP‐RARE (Kalinowski, ) software to correct for uneven sample sizes by rarefaction. The software sampled 11 individuals at random from each site to match the smallest sample size (i.e., MAY1; Table ). [...] We investigated population differentiation and structure using four different approaches: pairwise comparisons among sites, discriminant analysis of principal components (individual level), Bayesian clustering (individual level), and network construction (site and individual levels). First, the geographic origin of individuals (i.e., site) was treated as an a priori defined population, except in clustering analyses. Pairwise F ST (Wright, ) comparisons among sites was conducted with Arlequin v.3.5 (Excoffier & Lischer, ); the significance of the observed F ST‐statistics was tested using the null distribution generated from 5 × 103 nonparametric random permutations. Jost's D (Jost, ) comparisons among sites were conducted with GENODIVE v.2.0 (Meirmans & van Tienderen, ); the significance of the observed Jost's D‐statistics was tested with DEMEtics (Gerlach et al. ), which uses a bootstrap method (1,000 bootstrap repeats) to estimate p‐values. Fisher's exact tests of site differentiation based on genic frequencies (Raymond & Rousset, >) were performed in Genepop v.4.6 (Raymond & Rousset, ). To understand the mechanisms that may be responsible for the observed patterns of population structure, we compared estimates of genetic differentiation to geographic distances among sites. We used a Mantel test (Mantel, ) to evaluate the correlation between linearized genetic differentiation [Slatkin's distance: (F ST/(1 − F ST)] and the straight‐line geographic distance [ln(distance)] among sites (Table ). This relationship is expected to be positive and linear in the context of a two‐dimensional Isolation by distance (IBD) model (Rousset, ). All Mantel tests were performed using the program GENODIVE v.2.0 (Meirmans & van Tienderen, ) with 104 random permutations to assess significance. [...] Population structuring was also assessed without a priori stratification of samples. We first performed a discriminant analysis of principal components (DAPC) using the package adegenet (Jombart, ; Jombart et al., ) in R v.3.2.3 (R Development Core Team ). DAPC is a non‐model‐based method that maximizes the differences among groups while minimizing variation within groups without prior information on individuals’ origin. In addition, the method does not assume HWE or absence of LD. We used the function find.clusters() to assess the optimal number of groups with the Bayesian information criterion (BIC) method (i.e., K with the lowest BIC value is ideally the optimal number of clusters). Note that BIC values may keep decreasing after the true K value in case of genetic clines and hierarchical structure (Jombart et al., ) and that retaining too many discriminant functions with respect to the number of populations may lead to overfitting the discriminant functions, resulting in spurious discrimination of any set of clusters. Therefore, the rate of decrease in BIC values was visually examined to identify values of K after which BIC values decreased only subtly (Jombart et al., ); we tested values of K = 1–30. The dapc() function was then executed using the best grouping, retaining axes of PCA sufficient to explain ≥70% of total variance of data, and coloring individuals according to their sampling site.The population clustering was also explored using the software Structure v.2.3.2 (Pritchard, Wen, & Falush, ; Pritchard et al., ), with the admixture model and correlated allele frequencies (Falush & Pritchard, ). This analysis assumes that within the analyzed dataset reside K populations, and individuals are assigned probabilistically to each population in order to maximize HWE and minimize LD. Due to the important size of our dataset and following the recommendations of Rosenberg et al. () and Jakobsson et al. (), we studied our dataset using a hierarchical approach. For each group of sites (Figure ) and each tested value of K (K varying from 1 to 10), three independent runs were conducted with a burn‐in period of 5 × 104 steps followed by 5 × 105 Markov chain Monte Carlo iterations. We used the statistic proposed by Evanno, Regnaut, and Goudet (), implemented in Structure Harvester v.1.0 (Earl & vonHoldt, ), to estimate the best number of K for each group of sites. The software CLUMPP v.1.0 (Jakobsson & Rosenberg, ) was used to summarize results, and they were formatted with DISTRUCT v.1.1 (Rosenberg, ). The software Arlequin v.3.5 (Excoffier & Lischer, ) was then used to perform hierarchical analyses of molecular variance using clusters identified by Structure as populations, which mostly corresponded to islands/archipelagoes, and provinces as groups.Finally, network analyses were performed on individuals and sites. The pattern of genetic relationship among individuals was illustrated by networks built with two measures integrating genetic information in terms of time and divergence history: the Rozenfeld Distance index (RD) and the Shared Allele Distance index (SAD). RD has been developed from the Goldstein distance index. It provides a parsimonious representation of the genetic distance between individuals based on the difference of the microsatellites allele lengths (Rozenfeld et al., ). On the other hand, SAD provides the genetic distance between individuals based on the proportion of shared alleles (Chakraborty & Jin, ). RD helps to resolve ancestral polymorphism through allele lengths impinged on slow evolutionary processes, while SAD helps to understand recent gene flow characterized by direct allelic exchange.The global pattern of genetic relationships among sites was illustrated by networks built with two different measures: the Goldstein distance index (GD) and F ST fixation index (F ST). The GD groups sites considering their historical origin, while F ST takes into account the site structure. Once the matrices of genetic distances between individuals or sites were estimated, different networks were built considering individuals/sites and genetic distances as nodes and links between them, respectively. For the network construction, links were included for all distances and were removed in decreasing order until the percolation threshold (Dpe) was reached (Rozenfeld et al., ), threshold below which the network fragmented into small clusters. The average clustering coefficient < C > of the whole network was estimated for each of the four built networks. These analyses were performed using EDENetworks software (Kivelä, Arnaud‐Haond, & Saramäki, ). […]

Pipeline specifications