Computational protocol: Absence of population structure across elevational gradients despite large phenotypic variation in mountain chickadees (Poecile gambeli)

Similar protocols

Protocol publication

[…] DNA was extracted from blood samples using a Qiagen DNeasy Blood and Tissue kit (Qiagen Inc., Valencia, CA, USA) and quantified using spectrophotometry on a QIAxpert machine (Qiagen, Inc). Reduced representation libraries for Illumina sequencing were constructed using a genotyping-by-sequencing (GBS) approach used in multiple previous studies [,–]. Specifically, DNA fragments were cut using two restriction enzymes, EcoRI and MseI, which have cut sites evenly distributed throughout avian genomes []. Barcoded adaptors were ligated to the EcoRI cut site, which included an Illumina adaptor, an 8–10 bp barcode unique to each individual bird, and the bases matching the cut site. An MseI adaptor consisting of the opposite Illumina adaptor was ligated to the opposite ends (MseI cutsites) of the fragments. DNA from all individual reactions were pooled and PCR-amplified using Illumina primers. Finally, fragments between 350 and 450 bases in length were size selected using a BluePippin quantitative electrophoresis unit (Sage Science, Beverly, MA, USA). Single end, 100 base reads were generated on one lane of an Illumina HiSeq 2500 at the University of Texas Genomic Sequencing and Analysis Facility (UTGSAF, Austin, TX, USA).The 100 bp reads generated on the HiSeq run were first filtered to remove contaminant DNA (e.g. Escherichia coli; PhiX) and low quality reads. A perl script was then used to identify individual barcodes, correct barcodes with errors, and remove reads containing sequences associated with Illumina adaptors or PCR primers. After this step, fragments were 86–88 bases in length. A random subset of 25 million reads was assembled de novo using the SeqMan ngen software (DNASTAR Inc.), specifying a minimum match percentage of 95 and a gap penalty of 30 (full details of parameter settings are available from the authors by request). Contigs were removed from the reference if they contained fewer than 10 reads, were over-assembled, or were not 84–90 bp in length. This step produced a reference of genomic regions sampled with our GBS approach, providing a template for subsequent reference guided assembly. DNA sequences from each chickadee were subsequently aligned to the reference with bwa v7.5 []using the aln and samse algorithms and an edit distance of 4. Because all sampled genomic regions begin with the EcoRI cut site and all HiSeq reads contained 100 bases of sequence, these alignments produced consistently rectangular contigs with even positional coverage.Variant sites (i.e. SNPs) were called and quantified using samtools v.0.1.19 and bcftools v.0.1.19 [,]. SNPs were considered if at least 90% of individual birds had at least one read at the position, the site was biallelic, and the minor allele frequency was greater than 5%. For reference contigs containing multiple SNPs, a single SNP was randomly selected to increase independence of SNPs and to decrease the effect of linkage disequilibrium on subsequent analyses. For each bird, genotype likelihoods were calculated for each SNP using bcftools. Genotype likelihoods were initially stored in Variant Call Format (.vcf) and then converted to a composite genotype likelihood format. Genotype likelihood matrices and assembly related files are available at Dryad and additional information regarding parameter settings is available from the authors upon request.To account for uncertainty associated with variation in coverage depth, a hierarchical Bayesian model [] was employed to estimate genotype probabilities based on the genotype likelihoods estimated above. This model treats population allele frequencies as priors and simultaneously estimates both allele frequencies and genotype probabilities after accounting for variation in coverage. Essentially, individuals with low coverage for a given locus will have genotype probabilities more heavily informed by the prior (i.e. the allele frequency), while high coverage loci will have genotype probabilities with higher certainty of the homozygous or heterozygous genotypes. The model was run for 10 000 Markov chain Monte Carlo steps (thinning every other step) with a 6000 step burn-in to obtain the posterior estimates of genotype probabilities, which were subsequently stored as convenient composite genotype values (ranging from 0 to 2; 0 and 2 for homozygous and 1 for heterozygous genotypes) for use in all downstream analyses.The distribution of genetic variation among individuals was first summarized with principal component analysis (PCA) using the prcomp function in R []. As a complementary approach to PCA, we performed discriminant analysis of principle components (DAPC) [] using the adegenet package [,] in R. While PCA is inherently constrained to maximize the total variance explained in the data, DAPC maximizes the proportion of variance explained among groups of individuals []. Additionally, DAPC is useful for GBS datasets because it finds the most likely number of clusters within a dataset using k-means clustering and calculates individual assignment probabilities much faster than more computationally intensive Bayesian clustering algorithms commonly used in molecular ecology studies (e.g. Structure []). The most likely number of clusters for DAPC was calculated using the find.clusters function and the appropriate number of retained PCs was estimated using the optim.a.score function. For both PCA and DAPC, permutational multivariate analysis of variance (PERMANOVA []) implemented in the vegan package [] of R was used as a post hoc test for genetic differentiation among sampling sites based on Euclidean distances of the first two ordination axes.We summarized the degree of genetic differentiation among sampling sites by calculating Nei's genetic distance (D []) for each pairwise combination of sites. We also calculated pairwise genome-wide FST using Hudson's estimator [] as an additional metric of differentiation among sites, and tested the significance of these estimates using a permutation-based approach. To test for the effects of geographical distance and elevational distance on genetic distances among sampling sites, we used a multiple regression on distance matrices (MRM []) using the ecodist package [] in R, with pairwise genetic distance (D) as the response variable and pairwise elevational and geographical distance as the predictor variables. We estimated geographical distances among sampling sites by calculating Haversine distances from latitude and longitude coordinates using the fossil package in R []. Finally, we examined how hierarchical variation was distributed among individuals, among transects, and among the high and low elevation sites within each transect using AMOVA [] as implemented in the R poppr package [,]. For this analysis we only used genotype probabilities with relatively high certainty of either the homozygous or the heterozygous genotype (point estimates within 0.1 from 0, 1 or 2), and treated others as missing data. We tested for significance of ϕ statistics with a permutation-based approach using the randtest function in the ade4 R package [].To test for patterns of parallel locus specific allele frequency shifts across low and high elevation groups, we quantified locus specific FST estimates for all loci for the high and low elevation contrasts at each of the three geographical areas. We considered loci residing above the 97th, 98th and 99th quantiles of the FST distribution for each elevational contrast as outliers potentially residing in genomic regions experiencing directional selection. We evaluated potential parallel differentiation using a permutation-based approach to assess significance of the counts of loci that had extreme FST values in more than one elevational contrast. More specifically, the loci in all three elevational comparisons were permuted and then we counted the number of loci that resided in the upper quantile in multiple elevational comparisons; permutations were conducted 10 000 times to create a null distribution. […]

Pipeline specifications

Software tools BWA, SAMtools, bcftools, adegenet
Application Population genetic analysis
Organisms Ilex paraguariensis
Chemicals Nucleotides