Computational protocol: Discovery and replication of SNP-SNP interactions for quantitative lipid traits in over 60,000 individuals

Similar protocols

Protocol publication

[…] Individuals were genotyped on the ITMAT-Broad-CARe (IBC) array. This array consists of ~50,000 SNPs across ~2100 loci. Selection criteria for SNPs to be included on the IBC array have been described in detail previously []. Quality control filters were applied after the cohorts were merged into the full discovery dataset. A summary of the full quality control and analysis pipeline is shown in Fig. . All quality control procedures were implemented with the PLINK software package [] unless otherwise specified. SNPs with a genotype missing rate > 95% or that were not in Hardy-Weinberg equilibrium (p < 1.0 x < 10−7) were removed from the analysis. After SNP genotyping quality control, 44,750 markers remained. Individuals with SNP genotype missing rates >90% were excluded from the analysis. For cohorts that contained known trios, non-founders (i.e. offspring) were removed. To address unknown or cryptic relatedness, identity-by-descent (IBD) estimates were calculated, and one individual from each pair with pi-hat >0.3 was removed. The TG values were log transformed to improve normality. Four new datasets were created for each of the quantitative lipid traits: HDL-C (n = 13,030), LDL-C (n = 12,853), TC (n = 16,849), and TG (n = 13,031).Fig. 1 Additional quality control metrics were applied to the individual lipid datasets for each of the statistical analyses. For both the main effect filter and Biofilter analyses, SNPs with missing phenotypes were removed along with variants with minor allele frequency (MAF) < 0.05 or missing genotype rate > 5%. For the main effect filter analysis, SNPs were pruned to remove high levels of SNP correlation, or LD from the data. No LD pruning was done for the Biofilter interaction analyses, as these models are specifically generated using SNPs that are in different genes. This was performed by removing one SNP from all pairs of SNPs with an r2 > 0.6 using PLINK. SNPs with a main effect p < 0.001 based on a previous GWAS regression analysis were selected for interaction testing []. We had two specific motivations for selecting this threshold for our study: 1. to allow for interactions that may be present in the absence of large, genome-wide significant main effects, and 2. to reduce the SNP set to a size that allowed for a manageable exhaustive SNP-SNP interaction analysis. SNP-SNP models were generated by creating an exhaustive list of all SNP pairs. Importantly, we did not LD prune for the Biofilter analysis due to the method used to generate SNP-SNP models. Biofilter 2.0 is a software package that identifies SNP-SNP models based on probable gene-gene interactions identified in various online sources including Gene Ontology GO and KEGG. The Biofilter method has previously been described in greater detail [, ]. Briefly, SNPs are mapped to genes using a 50 kb upstream or downstream inclusion criterion. Gene pairs that may be more likely to interact are then identified in various curated biological knowledge databases. A score is given based on the number of sources that indicate a possible interaction. For this analysis, models were included if at least five knowledge sources identified the gene-gene interaction model. The SNPs are then mapped back to the genes to create the SNP-SNP models for statistical testing.To test for SNP-SNP interactions, we used an R script that automatically tests the models according to user input parameters []. We tested for significant interactions using a linear regression framework. We adjusted for age, sex, smoking status, type 2 diabetes status, BMI, medication use (use or no use of lipid lowering drugs), and potential population substructure (top 10 principal components) by including these as covariate terms in the linear regression models for each of the four lipid traits. We included these covariates to control for any factors outside of genetics that may have an effect on lipid levels and to remain consistent with the previous GWAS from which the SNPs for the main effect filter analysis were chosen. In the previous study that used the same lipid measurements for a gene-centric meta-analysis of main effects [], an additional adjustment for medication was done by multiplying a constant percentage to account for lipid lowering medication. The two adjustment methods (covariate and multiplication) gave similar results; therefore, we only included the covariate adjustment results in this manuscript. We chose to include the top 10 principal components to remain consistent with the previous GWAS and to control for any residual variation as we were performing these analyses in a combined cohort that included individuals from various parts of the country. Models were selected for replication testing with likelihood ratio test p -values <0.001 (comparing the full and reduced linear regression models (Eqs. and )). We adjusted the threshold using a Bonferroni correction based on the total number of number of models that were tested for each filtering methods. We estimated these models to be independent due to the LD-pruning in the main effect filter analysis and the SNPs being in different genes for the Biofilter analysis (Fig ). 1reduced:y=α+β1SNP1+β2SNP2+β3age+β4BMI+β5med.+β6T2D+β7smoking+β8sex+β9−18PC1−PC10 2full:y=α+β1SNP1+β2SNP2+β3age+β4BMI+β5med.+β6T2D+β7smoking+β8sex+β9−18PC1−PC10+β19SNP1∗SNP2 The full model consisted of the same SNP and covariate terms as the reduced model, but with an additional multiplicative interaction term for the SNP-SNP model. We generated “proxy” models by identifying SNPs in high linkage disequilibrium (LD) (r2 > 0.8) with model SNPs based on the HapMap European CEPH (CEU) population in 1000 Genomes Project Pilot 1 data (2010 release) using SNAP []. We generated a list of proxy SNP-SNP models using the SNPs in high LD with the original model SNPs to represent the original model from the discovery set. The purpose of these models was to capture signals in the replication data that may have been missed due to allele frequency differences between the discovery and replication cohorts. The original and proxy models from the discovery analysis were tested in each of the replication cohorts. […]

Pipeline specifications

Software tools PLINK, Biofilter
Application GWAS
Organisms Homo sapiens
Chemicals Cholesterol