Computational protocol: Risk Alleles for Systemic Lupus Erythematosus in a Large Case-Control Collection and Associations with Clinical Subphenotypes

Similar protocols

Protocol publication

[…] Prior to merging, individual datasets were filtered for individuals with<90% genotyping and SNPs with<90% genotyping, minor allele frequency (MAF)<1%, or HWE p-value<0.00001. SNPs in the 550K but not the 317K platform were imputed in the parent study 2 (SLEGEN) dataset using IMPUTE , retaining SNPs with >90% confidence, >90% concordance in two test datasets (500 cases and 500 controls from parent study 1 with known genotypes removed), and >90% imputed genotype rate. In the final merged dataset of genotyped and imputed SNPs, SNPs were again filtered for >90% genotyping (using typed or imputed values). From this dataset, SLE risk SNPs or their proxies were obtained. Out of 22 loci selected for inclusion based on p<5×10−8 in a previous study , 16 were directly genotyped in all of our subjects. Three SNPs were imputed in the SLEGEN dataset, and a proxy SNP (r2>0.8) was found for 6 SNPs using the HapMap (http://www.hapmap.org) CEU population (with one overlap, a proxy SNP imputed in the SLEGEN dataset). Imputed and proxy SNPs are shown in .Principal components analysis using EIGENSTRAT was performed using the above merged dataset of directly genotyped SNPs, with SNPs having at least 90% genotyping (thus on both the 317K and 550K platforms). SNPs in regions of known high LD (chr 5: 44–51.5 Mb, chr 6: 25–33.5 Mb, chr 8: 8–12 Mb, chr 11: 45–57 Mb, and chr 17: 40–43 Mb) were removed prior to analysis. Individuals with values more than 6 standard deviations away from the mean of any of the first 10 PCs (n = 21) were considered genetic outliers and were removed. Four PCs were used for ancestry adjustment, based on leveling off of the PCA scree plot and due to significant differences between cases and controls for the first 4 PCs.The GRS was defined as the number of risk alleles at each locus multiplied by the OR for SLE susceptibility in our dataset. For example, two STAT4 risk alleles contribute 2*1.5 = 3 to the GRS. For a protective SNP, the risk alleles are the major alleles. Since not counting sporadic missing data would underestimate the number of risk alleles, the GRS, and the sub-GRS, we used best-guess imputed missing genotypes (using IMPUTE version 2) for these calculations. The GRS was analyzed both continuously and by comparing the highest and lowest tertiles to aid in interpretation, with comparison of tertiles being a compromise between more extreme tails of the distribution (having less power) and dichotomizing (having less differentiation).Subphenotypes and covariates studied are shown in . In each study, subphenotype status was confirmed by chart review. Autoantibody status was determined by chart review and/or serologic testing; subjects were considered auto-antibody positive if there was any positive test indicated in the reviewed medical records or serologic tests. Negative status required that at least one negative test be documented and no positive tests. Positive anti-dsDNA status is a subset of the immunologic criteria; other qualifiers are anti-Sm antibodies or the presence of anti-phospholipid antibodies. Where appropriate, e.g. logistic regression and bar graphs, the age at diagnosis was dichotomized into high-low halves or split into tertiles. For regression, in addition to the ancestry principal components described above, additional covariates were sex, disease duration, and study (two parent studies or eight sources, see ). All subphenotypes were heterogeneous by study source (data not shown).We first looked at the adjusted association between each outcome and the continuous GRS (). As we have a high percentage of missing data for disease duration (18.5%, see ), adjustment was done two ways: a) using only the subset of subjects having disease duration, and b) using multiple imputation of the missing disease duration values. Multiple imputation was performed using Stata ICE with predictive matching. Covariates age at diagnosis, study source, and sex were used in the imputation. Differences in results using these methods were very slight for subphenotypes associated with the GRS. We used actual data without imputation in subsequent GRS analyses. For the sub-GRS computations (below), we used single imputation based on the same variables as above.In subphenotype associations, the SLE GRS may have less power than a risk score which utilizes the SNPs and effect sizes appropriate for that subphenotype. Thus we also tested a subphenotype-specific sub-GRS for each subphenotype, defined via a discovery-replication approach. First, for each subphenotype we used the associations in parent study 1 (the “discovery” study for this analysis) to determine the rank and OR of each risk SNP association with the subphenotype. Then a series of 22 candidate sub-GRS(n) scores were computed incrementally adding in the OR weights by rank, where n is the number of SNPs included. (The first candidate sub-GRS(1) is equal to the top SNP weights, the second candidate sub-GRS(2) adds in the second SNP weights, and so on). The associations in the discovery set for the resulting sub-GRS(n) candidates are shown in ; p-values are for the likelihood ratio test of differences between models with the sub-GRS(n) plus covariates versus a model with only covariates. This method can accumulate random associations as well, as illustrated for comparison purposes by sample “null” subphenotypes with 50–50 random associations (highest and lowest associations out of ten samples are shown); hence the importance of a discovery-replication approach. Finally the peak association sub-GRS(n) candidate for the replication and discovery sets with the minimum number of SNPs was used as the final sub-GRS for each subphenotype; this assumes that post-peak SNPs in either set are likely to be false positive associations.Stata 9.2 was used for regressions and ROC curve analyses. Plink was used for quality control filters, regressions and tests for 2×2 interactions. HelixTree SVS Version 7.2.3 (www.goldenhelix.com) was used for likelihood-ratio tests of logistic regressions of the sub-GRS(n) series. The R programming environment Version 2.11.1 was used for GRS density curves. […]

Pipeline specifications

Software tools IMPUTE, PLINK, HelixTree
Application GWAS
Diseases Arthritis, Hematologic Diseases, Kidney Diseases, Lupus Erythematosus, Systemic