Computational protocol: Exhaustive Genome-Wide Search for SNP-SNP Interactions Across 10 Human Diseases

Similar protocols

Protocol publication

[…] At the level of marginal (individual-SNP) effects, test statistic inflation due to population stratification or other sources was assessed via calculations of the genomic inflation factor λ. Values of λ greater than 1.05 were considered evidence of test statistic inflation. Marginal test statistics were derived using genome-wide single-SNP logistic regression analyses performed within all 10 sets of discovery and replication datasets. The cardiac disease (λ = 1.14), dermatophytosis (λ = 1.23), type 2 diabetes (λ = 1.08), dyslipidemia (λ = 1.11), and hypertensive disease (λ = 1.30) data showed evidence of inflation (Table S4). For cardiac disease and dermatophytosis, this inflation was largely removed after adjusting for the first two principal components. For the other conditions, adjustment for the first two components either had no effect (type 2 diabetes) or reduced but did not remove the inflation [dyslipidemia (λ = 1.07) and hypertensive disease (λ=1.10)]. Adjustment for the full 10 principal components yielded no further improvement in inflation (data not shown). All principal components were those that were provided with the GERA dataset. Since subjects were genotyped using one of two different kits, we also sought to determine whether the type of kit had any effect on test statistic inflation. In our final analyses (after all exclusions), 1.5% of all subjects were genotyped on the less commonly used kit; when these subjects were excluded, the observed values of λ decreased by 0.02 or less, which was determined to be a negligible amount (data not shown). As a result, no exclusions based on kit type were made in the final analyses.Interaction test statistic inflation was visually evaluated using quantile-quantile plots. To do this, a random selection of approximately 10,000 SNPs was obtained from each condition-specific dataset, and all possible pairs of SNPs (approximately 5 × 107 pairs) were tested for interaction using the FastEpistasis analytical approach (described below). Observed vs. expected-under-the-null −log P-values were plotted, and no evidence of inflation was observed (Figure S1; shown only for the discovery datasets). [...] All SNPs were diallelic. The analytical referent allele was the major allele (by allele frequency in the discovery dataset), while the nonreferent allele (which may also be known as the “alternative” or “coded” allele) was the minor allele. To assess marginal effects, logistic regression models were used, with SNPs coded additively (based on the number of copies of the nonreferent allele). Unadjusted and adjusted (for birth year category, age, and the first two principal components) analyses were performed for all condition-specific datasets. For ranking of SNPs by marginal effect P-value, the adjusted analyses in the discovery datasets were used.To assess interactions on a genome-wide scale, all pairwise interactions among all postQC SNPs were assessed in each condition-specific discovery dataset using two different approaches: FastEpistasis and BOOST (as implemented in the version of Plink listed below). To avoid spurious evidence of interaction due to linkage disequilibrium between SNPs, an interaction was excluded if its two SNPs were located within 1 Mb of each other. From these genome-wide analyses, interactions with an interaction P-value < 10−7 were selected for follow-up. For interactions selecting from FastEpistasis analyses, the follow-up consisted of reanalysis of the selected interactions using logistic regression modeling, in both the discovery and replication datasets. Unadjusted and adjusted (for birth year category, age, and the first two principal components) logistic regression analyses were performed. In these models, SNPs were coded additively, and the nonreferent allele was the minor allele in the discovery dataset. Models included main effects for each SNP and an interaction term for the SNPs; statistical significance was based on an explicit test of the interaction term. For interactions selected from BOOST analyses, the follow-up consisted of analysis of the selected interactions in the replication dataset, also using BOOST. For final ranking of interactions by P-value, the adjusted analyses of the logistic regression modeling, or the BOOST analyses, from the discovery datasets were used. […]

Pipeline specifications

Software tools FastEpistasis, PLINK
Application GWAS
Organisms Homo sapiens