Computational protocol: A Comprehensive Analysis of Shared Loci between Systemic Lupus Erythematosus (SLE) and Sixteen Autoimmune Diseases Reveals Limited Genetic Overlap

Similar protocols

Protocol publication

[…] Genotypes from all subjects were imputed using the program IMPUTE version 0.5 for SNPs not genotyped or poorly genotyped. Imputation was performed using high quality genotype data from the corresponding study (SLEGEN, MN and UCSF) and phased HapMap Phase II (NCBI B35 assembly) genotype data from 60 CEU HapMap founders. We used SNPs that met the following quality criteria: 1) no statistically significant differences in the proportions of missing genotype data between cases and controls (i.e., P>0.05); 2) overall <10% missing genotype data; 3) Hardy-Weinberg Expectations (HWE) in controls P>0.01, HWE in cases P>0.0001; and 4) minor allele frequencies (MAFs) of controls within a 95% or 99.99% confidence interval for ethnicity matched HapMap MAFs, for genotyped and imputed SNPs, respectively. Retained SNPs had an estimated MAF>0.01 in the control samples, an information score >0.50 and a confidence score >0.90. Imputed SNPs were analyzed using SNPTEST with probabilistic genotypes .We combined the genotypic and imputed data from the three cohorts described above and performed a joint- and a meta-analysis. In the tables with the results we report which SNPs and cohorts were imputed vs. directly genotyped. We used SNPs that met the same quality criteria as described above. To account for potential population stratification, we computed Principal Components (PCs) and adjusted these analyses for four PCs, as described . The genome-wide inflation factor in the joint analysis was λ = 1.15. We include the joint analysis of these loci after applying quality control to each individual cohort as the joint analysis can provide increased power for some genetic models for more modest allele frequencies (e.g., recessive model). From our list with 446 autoimmune SNPs, 424 total unique SNPs were genotyped or imputed in our SLE cohorts. Of these, 237 (55.9%) met our QC thresholds, while 187 (44.1%) failed as follows: 6 (1.4%) have 10–20% missing genotype data, 1 (0.2%) have MAF<0.01 in controls, 3 (0.7%) failed Hardy-Weinberg Equilibrium thresholds, 91 (21.5%) have >20% missing genotype data, and /or have significant differences in missingness between cases and controls, and 86 (20.3%) did not meet imputation QC thresholds. We report uncorrected P-values, though we also corrected for multiple comparisons using a False Discovery Rate (FDR) procedure for the 237 SNPs that passed QC. As such, our multiple comparison strategy consisted of only selecting those variants that met FDR significance, that is, with a FDR-adjusted P-value<0.05. Although we computed the FDR-adjusted P-value for the smallest P-value (under the additive, dominant or recessive model), this smallest P-value is virtually always within one order of magnitude different from the additive P-value, which is hence comparable to computing the FDR for P-values under the same model. We performed a weighted Z-score meta-analysis as implemented in METAL (www.sph.umich.edu/csg/abecasis/metal), with weights being the square root of the sample size for each dataset; thus, the meta-analysis incorporates direction, magnitude of association and sample size. We report the minimum P-value based on hypothesis tests considering additive, dominant and recessive modes of inheritance; however, because these tests can be affected by low genotype counts, we required at least 30 homozygotes for the minor allele to consider the recessive, and 15 to consider the additive model, otherwise the results under the dominant model are reported. All genetic models were defined relative to the minor allele. Associations with SLE susceptibility were considered statistically significant if they met a FDR-adjusted threshold of P<0.05.We used Quanto (http://hydra.usc.edu/gxe/) to calculate the power of our sample size. We assumed an additive genetic model, population risk of 0.1%, and α = 0.001.In order to examine the global similarity between ADs based on their reported risk loci (defined based on LD, as described above), we performed a hierarchical clustering analysis of ADs with at least 10 reported loci (binary yes/no). ADs with less than 10 reported loci were excluded as their lower count of reported loci may reflect a less intensive assessment of genetic risk factors (i.e. fewer genome-wide investigations often with smaller sample sizes). We restricted the analysis to associations reported from populations of European ancestry, and excluded those reported for MS severity or age of onset. So as not to inform the clustering of ADs based on the presence of joint analyses, we excluded associations from studies of pooled phenotypes including IBD, RA with CelD, CD with CelD, and CD with SA. This produced a final dataset of 330 loci reported across nine ADs. We also included loci reported from the GWAS catalogue for 4 control diseases (height, breast cancer, coronary heart disease, and bipolar disorder), similarly using LD to define specific genomic loci. We computed the dissimilarity between ADs and the control diseases using distance metric appropriate for binary data, performing hierarchical clustering using the hclust function for the R Statistical Programming Language . We evaluated the uncertainty in the clustering analyses using a multiscale bootstrap resampling approach implemented within the pvclust package for R . […]

Pipeline specifications

Software tools IMPUTE, SNPTEST, Hclust, Pvclust
Databases GWAS Catalog
Application GWAS
Diseases Arthritis, Rheumatoid, Autoimmune Diseases, Colitis, Ulcerative, Crohn Disease, Diabetes Mellitus, Lupus Erythematosus, Systemic