Computational protocol: Insight into the roles of selection in speciation from genomic patterns of divergence and introgression in secondary contact in venomous rattlesnakes

Similar protocols

Protocol publication

[…] In a previous study (Schield et al., ), we generated restriction site‐associated DNA sequencing (RADseq) data for 43 samples from throughout the range of C. atrox. To increase sampling, we generated new RADseq data from 32 additional individuals including: 15 samples west of the Continental Divide, seven samples east of the west Texas mountains, and 10 samples from the Chihuahuan region (Figure a; ). Sampling of these populations enabled the comparative approach illustrated in Figure b. We also generated RADseq data for six individuals of the Mohave Rattlesnake (C. scutulatus; ), a close relative of C. atrox. These were used as the outgroup comparison for lineage delimitation analyses. New data generation followed Schield et al. (), which used a slightly modified version of the double‐digest RADseq approach of Peterson, Weber, Kay, Fisher, and Hoekstra (), targeting ~20,000 genomic loci per individual (see for detailed methods). We removed PCR clones from raw sequencing reads using the Stacks clone_filter module (Catchen, Hohenlohe, Bassham, Amores, & Cresko, ), and trimmed the adapter sequence and UMI bases from filtered reads (see ). Reads were demulitplexed to individuals using the Stacks process_radtags module. Trimmed paired reads were assembled into a consensus “pseudo‐reference” genome using the de novo assembly option in dDocent (Puritz, Hollenbeck, & Gold, ) to form contigs, using program defaults. This yielded 24,115 de novo C. atrox loci with an average length of 192 bases (consistent with our paired‐read length). Individual read files were mapped to the pseudo‐reference using the BWA‐mem algorithm (Li & Durbin, ), specifying paired‐read inputs, an open gap penalty of 3, and a mismatch penalty of 2. SAMtools v1.3 and BCFtools v1.3 (Li et al., ) were used to process mapping files and to call single nucleotide polymorphisms (SNPs). Variants were filtered using VCFtools (Danecek et al., ) to retain SNPs with quality scores above 30; to reduce linkage among variant sites, we sampled one variant site from each locus. We also filtered to remove SNPs with minor allele frequencies below 0.05, as these lack useful ancestry information for analyses of differentiation and introgression (Gompert et al., ). We performed analyses of genes in proximity (i.e., genetically linked) to SNPs by inferring putative orthology between our pseudo‐genome contigs and the King Cobra genome (Vonk et al., ) using a BLAST search (Altschul, Gish, Miller, Myers, & Lipman, ). BLAST results were filtered by e‐value, and a single alignment with the lowest e‐value per variable contig was retained. Here, we assumed that genetic variation consistent with natural selection observed at RADseq loci is due to physical linkage to genic targets of selection. We used the King Cobra genome as a reference because it is currently the closest relative to C. atrox with a fully annotated genome sequence. [...] We used three measures to summarize nucleotide diversity and to estimate genetic differentiation among loci between eastern and western parental populations. To understand the relative allelic diversity within each population, we first estimated Nei and Li's () π across loci and then calculated the difference between πeastern and πwestern (referred to as Δπeastwest hereafter). Our expectation for Δπeastwest is that a locus with low diversity in both populations will yield a value near 0, while a locus with greater diversity in the eastern population than western will yield a positive value, and one with greater diversity in the western population than the eastern will return a negative value. We calculated relative population genetic differentiation between populations using Weir and Cockerham's F ST (Weir & Cockerham, ). F ST is a relative measure of divergence between populations, and may be influenced by differences in nucleotide diversity within populations (Cruickshank & Hahn, ); therefore, we also used an absolute measure of population differentiation, dxy, to characterize divergence between eastern and western populations, using the calculation described in Irwin, Alcaide, Delmore, Irwin, and Owens (). Weir and Cockerham's correction sometimes yields negative values of F ST, and we converted any negative estimates to zero (the lower bound indicative of panmixia at a given locus).We designated outlier loci with greater‐than‐expected values of population differentiation using a multivariate approach, in the program MINOTAUR (Verity et al., ), using information from distributions of F ST, dxy, and Δπeastwest to identify loci that differ significantly from the genomic background. An advantage of this approach is that it makes no specific assumptions based on a single summary statistic or the demographic history of the populations, but rather uses the combined information provided by multiple statistics and is robust for detecting genomic outliers derived from a wide variety of evolutionary scenarios. We specifically used the Mahalanobis distance (MD, defined as the distance from the multivariate centroid; Mahalanobis, ) measure to delineate outlier loci where differentiation between parental populations has likely been driven by selection. Specifically, loci were considered statistical outliers if the locus‐specific MD exceeded the 95th quantile of the genomic distribution, and were considered strong outliers if they exceeded the 99th quantile.We also wished to determine whether statistical outlier loci had estimates of allelic differentiation that would not be expected under a model of neutral evolution. To test this, we designed a software program, GppFst (Adams et al., ;, which conducts posterior predictive simulations (PPS) of F ST and dxy under a strict model of evolution in the absence of selection. To conduct PPS analysis, we estimated divergence time and population parameters for the parental eastern and western populations using the 7,031 SNPs detailed above (τeastwest, θwest, θeast, and θeastwest) via Markov chain Monte Carlo (MCMC) sampling implemented in the program SNAPP (Bryant, Bouckaert, Felsenstein, Rosenberg, & RoyChoudhury, ). We stress that parameterization in SNAPP for GppFst was done using only structured parental populations and did not include admixed individuals. Additionally, we found very strong Bayes factor (BF) support for two distinct C. atrox lineages (BF = −336; see section ); we would not expect strong support for this model in the BFD* framework if gene flow were pervasive between the eastern and western populations (Zhang, Zhang, Zhu, & Yang, ). The mean posterior estimates of population genetic parameters used for GppFst simulations from SNAPP were as follows: θeast = 0.217, θwest = 0.0592, θeastwest = 0.0615, and τeastwest = 0.00578. We ran MCMC for a total of 500,000 generations, sampling every 1,000 generations. We assessed posterior convergence and stationarity using Tracer (for all parameters ESS > 500; Drummond & Rambaut, ) and discarded the first 125,000 (25%) steps as burn‐in, leaving total of 375 MCMC samples used to generate a PPS F ST distribution. For each MCMC step, we then simulated 100 independent loci with a length of 190 base pairs (equal to the average length of variant loci) under a JC69 model using the R package phybase (Liu & Yu, ) with random sampling of individuals from the empirical distributions of locus coverage for eastern and western samples, respectively.We calculated F ST and dxy for a randomly sampled polymorphic site for each simulated locus to generate a theoretical distribution of values, which allowed us to compare empirical distributions of F ST and dxy to identify loci with levels of divergence that are poorly explained by the neutral model of divergence. Specifically, we calculated the probability that the proportion of empirical loci with a given value of allelic differentiation is observed in the posterior predicted distribution, and thus were able to reject a strict model of neutral evolution where this probability is very low. Additionally, the use of simulations allowed us to account for the potential influence of unevenness in sample sizes across loci and differences in effective population size between the two populations. We considered statistical outliers from multivariate outlier detection “positives” for selection if they were also poorly explained by a neutral model of evolution based on their comparison to simulated F ST and dxy distributions in GppFst. [...] We conducted Bayesian estimation of genomic clines using the program bgc (Gompert & Buerkle, ) to identify loci that defy neutral expectations of admixture compared to the genomic background. We segregated samples into western, eastern, and admixed populations (Figure ), and calculated locus‐specific allele frequencies for each population. bgc estimates the hybrid index (h) for each individual in an admixed population, representing the proportion of an individual's genome inherited from one parental population. Hybrid index and two locus‐specific genomic cline parameters (α, the genomic cline center parameter; and β, the genomic cline rate parameter) are used to estimate the posterior probability of inheritance from one parental population (Φ) at a given locus within the admixed population. Under this model, if both α and β are equal to 0, h and Φ will be equivalent (Gompert & Buerkle, ), and will match the neutral genomic background expectation (dashed line in the genomic cline panel of the “study design” in Figure b). Loci enriched for selection can be identified based on each cline parameter and classified based on their locus‐specific cline parameters relative to a genomewide distribution (Gompert & Buerkle, ; Gompert et al., ).We used two methods to identify outlier loci that exhibit exceptional introgression compared to the neutral genomic background expectation. First, we considered loci outliers if they had evidence of excess ancestry from one parental population (i.e., the 95% confidence interval of the α parameter did not include 0; see Gompert & Buerkle, ) and also had a median α within the tails of the 95th quantile of the median α distribution. Second, “strong” statistical outliers were determined using the locus‐effect quantiles for α (γ‐quantile) relative to a genomewide distribution (Gompert & Buerkle, ); loci were considered strong outliers if their γ‐quantile fell outside of the interval q n, where 1−n2

Pipeline specifications

Software tools GppFst, PhyloNet, bgc, ADMIXTURE
Application Population genetic analysis