Computational protocol: Causal associations between risk factors and common diseases inferred from GWAS summary data

Similar protocols

Protocol publication

[…] We applied the methods to test for causal associations between seven health risk factors and common diseases using data from multiple large studies. The risk factors are BMI, waist-to-hip ratio adjusted for BMI (WHRadjBMI), HDL cholesterol (HDL-c), LDL-c, triglycerides (TG), systolic blood pressure (SBP), and diastolic blood pressure (DBP). We chose these risk factors because of the availability of summary-level GWAS data from large samples (n = 108,039–322,154) (Supplementary Table ). We accessed data for BMI, WHRadjBMI, HDL-c, LDL-c and TG from published GWAS–, and data for SBP and DBP from the subgroup of UK Biobank (UKB) with genotyped data released in 2015. We selected SNPs at a genome-wide significance level (PGWAS < 5 × 10–8) using the clumping algorithm (r2 threshold = 0.05 and window size = 1 Mb) implemented in PLINK (Methods). Note that the GSMR method accounts for the remaining LD not removed by the clumping analysis. There were m = 84, 43, 159, 141, 101, 28, and 29 SNPs for BMI, WHRadjBMI, HDL-c, LDL-c, TG, SBP and DBP, respectively, after clumping. These SNP instruments are nearly independent as demonstrated by the distribution of LD scores computed from the instruments for each trait (Supplementary Fig. ). We only included in the analysis the near-independent SNPs for the ease of directly comparing the results from GSMR with those from other methods that do not account for LD (e.g., MR-Egger). Our simulation result suggests that the gain of power by including SNPs in LD is limited (Supplementary Fig. ). Moreover, although the GSMR approach accounts for LD, including many SNPs in moderate to high LD often results in the V matrix being non-invertible (Methods).The summary-level GWAS data for the diseases were computed from two independent community-based studies with individual-level SNP genotypes, i.e., the Genetic Epidemiology Research on Adult Health and Aging (GERA) (n = 53,991) and the subgroup of UKB (n = 108,039). We included in the analysis 22 common diseases as defined in the GERA data, and added an additional phenotype related to comorbidity by counting the number of diseases affecting each individual (i.e., disease count) as a crude index to measure the general health status of an individual (Supplementary Table ). We performed genome-wide association analyses of the 23 disease phenotypes in GERA and UKB separately (Methods). We assessed the genetic heterogeneity of a disease between the two cohorts by a genetic correlation (rg) analysis using the bivariate LD score regression (LDSC) approach. The estimates of rg across all diseases varied from 0.75 to 0.99 with a mean of 0.91 (Supplementary Table ), suggesting strong genetic overlaps for the diseases between the two cohorts. We therefore meta-analyzed the data of the two cohorts to maximize power using the inverse-variance meta-analysis approach. Because OR is free of the ascertainment bias in a case–control study, the effect size (logOR) of a SNP on disease in the general population can be approximated by that from a case–control study assuming that disease in the case–control study is defined similarly as that in the general population. Therefore, GSMR can be applied to data with SNP effects on the risk factor from a population-based study and SNP effects on the disease from an ascertained case–control study, and the estimated causative effect of risk factor on disease should be interpreted as that in the general population. We therefore included in the analysis summary data for 11 diseases from published case–control studies (n = 18,759–184,305) (Supplementary Table ). The estimated SNP effects and standard errors (SE) for age-related macular degeneration (AMD) were not available in the summary data, which were estimated from z-statistics using an approximate approach (Supplementary Note ).We applied the HEIDI-outlier approach to remove SNPs that showed pleiotropic effects on both risk factor and disease, significantly deviated from a causal model (Methods). The LD correlations between pairwise SNPs were estimated from the Atherosclerosis Risk in Communities (ARIC) data (n = 7703 unrelated individuals) imputed to 1000 Genomes (1000G). Using the large data sets described above, we identified from GSMR analyses 45 significant causative associations between risk factors and diseases (Supplementary Data ; Fig. ). We controlled the family-wise error rate (FWER) at 0.05 by Bonferroni correction for 231 tests (PGSMR threshold = 2.2 × 10−4). For method comparison, we have also performed the analyses with MR-Egger and the methods in Pickrell et al. (Supplementary Data ).Fig. 2 […]

Pipeline specifications

Software tools PLINK, LDSC
Databases UK Biobank pGWAS
Application GWAS
Diseases Alzheimer Disease, Diabetes Mellitus, Type 2
Chemicals Cholesterol