Computational protocol: Genetic Population Structure Analysis in New Hampshire Reveals Eastern European Ancestry

Similar protocols

Protocol publication

[…] Controls less than 65 years of age were selected using population lists obtained from the New Hampshire Department of Transportation. Controls 65 year of age and older were chosen from data files provided by the Centers for Medicare & Medicaid Services (CMS) of New Hampshire. We interviewed a total 1191 controls throughout the state, of which 70% were confirmed to be eligible for the study. Informed consent was obtained from each participant and all procedures and study materials were approved by the Committee for the Protection of Human Subjects at Dartmouth College. Consenting participants underwent a detailed in-person interview, usually at their home. Subjects were asked to provide a blood sample (buccal sample was requested if a blood sample could not be drawn).Genotyping was performed on all DNA samples of sufficient concentration (864 control individuals) using the Golden Gate Assay system by Illumina's Custom Genetic Analysis service (Illumina, Inc., San Diego, CA). Samples repeated on multiple plates yielded the same call for 99.9% of SNPs and 99.5% of samples submitted were successfully genotyped. Genotype calls were 99% concordant between genotyping platforms (Taqman). We obtained genotype information from 1529 single nucleotide polymorphisms (SNPs) in suspected cancer susceptibility genes scattered throughout the genome. After filtering the data for SNPs in Hardy-Weinberg disequilibrium, we used the tagSNP software within Haploview to tagSNP the data (r2 = 0.8) to be sure that the clustering was not driven by LD. The 960 remaining tag SNPs were then used in the structure analysis. Only control individuals (no history of bladder cancer) were used in this study to prevent case/control status from confounding the analysis. [...] In order to determine if genetic subpopulations are present in the New Hampshire population we used Bayesian clustering as implemented in the structure program to cluster individuals using the remaining 960 SNPs. Structure iteratively clusters based on a user-supplied “K” number of populations. The genotype data were analyzed using the structure(v. 2.2.3) admixture model, without population data assigned (burnin of length 30,000, followed by 100,000 iterations) for 10 repetitions of each K from 2 to 10 –. This is far beyond the default number of iterations for structure, but high consistency between runs even at large K's were observed at values higher than the default. We concurrently ran random genotype data as well as the sample data from the structure software website as positive and negative controls. CLUMPP (v. 1.1.1) was used to align the repetitions for each K, using G′. The output from CLUMPP was used for both the ancestry and LD analyses. [...] Using a subset of the data with high LD removed, we were able to find genetic clustering using Bayesian clustering. A subsequent question was whether distinct patterns of LD could be discerned within subpopulations using the full dataset. Patterns within individual genes would lend further support or explanation to our model, as LD is known to be highly influenced by personal ancestry. The genotyped SNPs were distributed evenly throughout the genome, focusing on suspected cancer susceptibility genes. The 6 genes with the most assayed SNPs (CYP19A1, GHR, GSK3B, KRAS, PGR, PMS1, TNKS) were used to compare LD between the clusters. D' was calculated using Powermarker . Individuals had to have a q value of at least 0.5001 in order to be included as part of a for the LD analysis. Other genes were entirely in LD for all populations or did not differentiate between populations (data not shown).In order to statistically compare LD between each of these four populations, an association analysis between haplotypes and population membership was conducted between each of the populations and between each population and all the individuals in other populations. The analysis was conducted in R using the haplo.stats package which conducts association between traits and haplotypes using score statistics as estimated by an expectation-maximization algorithm . […]

Pipeline specifications

Software tools SNPinfo, Haploview, CLUMPP, PowerMarker, haplo.stats
Applications Population genetic analysis, GWAS
Organisms Homo sapiens