Computational protocol: Fourteen sequence variants that associate with multiple sclerosis discovered by meta-analysis informed by genetic correlations

Similar protocols

Protocol publication

[…] Icelandic samples were genotyped on Illumina HumanHap300, HumanCNV30, HumanHap610, HumanHap1M, HumanHap660, Omni-1, Omni2.5 or Omni Express bead chips at deCODE Genetics. Prior to imputation, samples with <97% call rate were excluded as well as all SNPs with genotyping yield <95% or MAF < 1%. Some samples were genotyped on more than one chip and in those cases, all SNPs with substantial difference in call rate between chip types were excluded. Further, all SNPs showing P < 0.001 for deviation from Hardy–Weinberg equilibrium or a > 0.1% inheritance error rate were removed. Subjects were long range-phased and imputation into both chip-typed individuals and their close relatives was based on a panel of 8453 whole genome sequenced Icelanders. This process has been described in greater detail elsewhere.– Briefly, regions of identity by descent are identified and used to phase haplotypes with great certainty. Making use of genealogy information, it is possible to deduce haplotypes for individuals that have not been genotyped, provided some of their relatives have been genotyped. Association testing was performed using logistic regression, adjusting for age and county of birth.Genotyping of the Swedish cohort was carried out at deCODE using Illumina Omni chips. Phasing was performed using SHAPEIT2,, and imputation was carried out using IMPUTE2, based on the 1000 Genomes phase I integrated haplotypes generated using SHAPEIT2. Prior to imputation, SNPs having yield <95%, Hardy–Weinberg equilibrium P-values <1 × 10−5, or either A/T or G/C allele combinations were removed. Samples having <95% genotyping yield or evidence of non-European ancestry based on results from Structure using European (CEU), Chinese and Japanese (CHB + JPT) and Nigerian (YRI) individuals from the HapMap project as reference samples, as well as one of each pair of duplicate samples were also excluded. Association analysis was carried out using SNPTEST2 with 20 principal components included as covariates. Principal components were calculated using EIGENSOFT. Genotyping of the Norwegian controls was carried out at deCODE using the Omni series of Illumina bead chips but Norwegian cases were genotyped on Human660-Quad at the Sanger institute in a collaboration with the International MS genetics consortium and the Wellcome Trust case control consortium. Samples were phased and imputed together based on the SNPs found on both chip types using the same methods and quality control as for the Swedish cohort. Association analysis was carried out using SNPTEST2 with ten principal components included as covariates. [...] We carried out an inverse-variance weighted meta-analysis under the assumption of fixed effect using the METAL software in two steps. First, we combined publicly available summary statistics from the largest study of MS to date, referred to as the IMSGC study, with summary statistics from our three Nordic cohorts. This resulted in combined statistics for 117,990 SNPs, which survived quality control in the IMSGC data and two or more of the Nordic cohorts. In the second step, we included in the Swedish cohort 1670 cases and 1534 controls that were excluded from the first analysis on the basis of overlap with the IMSGC study. Imputation and principal component calculations were repeated for the Swedish cohort after adding these samples. This resulted in combined summary statistics for 6,694,339 SNPs that survived quality control in all three Nordic cohorts.Conditional analysis of the IMSGC data was performed using GCTA and the genotypes of 6500 randomly selected Icelanders as LD reference.For some of the candidate markers, for example in the case of rs175126, the adjusted P-value was reported in Supplementary table  of the IMSGC paper. Where this information was available, we used the P-value provided by the IMSGC as that more accurately reflects the LD structure of all the study cohorts. Locus plots were generated using LocusZoom, displaying only variants surviving quality control in all cohorts (all Nordic cohorts only for rs1801133) and unconditioned P-values. [...] PRSs were calculated based on the summary statistics of the training sets previously listed (Supplementary Table ), excluding the extended MHC region (25–35 Mb of chromosome 6, build hg38) to ensure no variants in LD with the MHC region were included in the score. Markers found in the training data were matched with a set of in-house SNPs and only autosomal, biallelic SNPs with MAF > 1% and info > 0.9 in Iceland were included. We furthermore excluded AT/GC SNPs to avoid strand matching issues.As variants within the MHC region show very strong association with all the diseases studied here, the exclusion of the MHC might be a source of controversy. However, its exclusion is critical to the study. PRSs can be used to establish biological pleiotropy by testing a score composed of a set of genetic variants contributing to the risk of a given trait for association with another. In this way, genetic overlap between traits can be detected, even in the absence of significantly associating signals. An underlying assumption is that the effect of a variant represents the effect of a single biological process common to both traits and variants are pruned so that only the variant showing the strongest evidence of association within a LD block is retained. However, due to extensive LD within the MHC region, the effect of a variant within that region is likely to be composed of the combined effect of several different genes on the disease. Some of these genes might contribute to both diseases while others will not. Excluding the MHC is therefore critical for avoiding the detection of spurious pleiotropy.PLINK 1.9 was used to prune SNPs in a sliding window of 500 kb, retaining the SNP which showed the strongest evidence of association with the phenotype in the training data and removed SNPs having r 2 > 0.1 with that SNP. A set of 960 whole genome sequenced Icelanders, unrelated at six meioses served as LD reference. We calculated a polygenic score for each individual, j, in the target data at ten different P-value inclusion thresholds using the formula1PRSj= ∑i∈Sβi×Gj,where S is the set of SNPs retained after pruning that have P-values below the inclusion threshold, β is the effect and G j is the sum of the probability of the effect allele being found on either of individual’s j chromosomes. The pipeline used for calculating the PRS shares many features with the PRSice software. Each PRS, except those calculated for JIA and Cel, was tested for association with its corresponding disease in Iceland using generalized additive regression with smoothed age, sex and the first five principal components as covariates. The best P-value inclusion threshold was identified for each disease and the score at this threshold was calibrated so that a unit increase in the score represented a doubling in risk of its corresponding phenotype. This can be written as follows:2PRS′=PRS×βPRSlog2,where PRS and PRS′ are the uncalibrated and calibrated polygenic risk scores, respectively, and β PRS represents the log odds of the disease corresponding to the score in a logistic regression.The calibrated score was then tested for association with disease status in each of the other target cohorts, using the same model as described above. Models were compared against null models that consisted of the covariates only and results were considered significant if P < 5.0 × 10−4.For JIA and Cel, PRS were normalized to have a mean of 0 and a standard deviation of 1 in our sample of 150,656 Icelanders. As the most predictive threshold could not be determined, we arbitrarily selected the P-value inclusion threshold 0.001 and tested those scores for association with disease status in the same manner as described above. The un-calibrated PRSs were always normally distributed in the population and remained normally distributed after calibration and the use of these models is therefore justified. The ratio of the variance in PRS between cases and controls never exceeded 1.13 for case-lists with >500 cases and never exceeded 1.43 for case-lists with <500 cases.Population stratification was estimated by randomly selecting 10,000 variants with minor allele frequency >5% from all over the genome and testing them for association with disease in each of the target cohorts. We calculated genomic inflation factor λ for the target phenotype in Iceland and adjusted the P-values of PRS-disease association accordingly. Nagelkerke’s pseudo-R 2 was used as a measure of the variance explained.We tested the PRSs for MS, PBC and T1D for association with the MSSS using generalized additive regression with smoothed year of birth, gender and the first 20 principal components as covariates. The PRSs were standardized so that a unit increase corresponded to doubling in risk of the respective disease in Iceland. MSSS data were available for 5173 Swedish cases, 1466 of which overlapped with the IMSGC study. When testing the PBC-PRS and T1D-PRS for association with MSSS, all subjects were included but the overlapping samples were excluded before testing the association of the MS-PRS with MSSS. […]

Pipeline specifications

Software tools SHAPEIT, IMPUTE, SNPTEST, EIGENSOFT, GCTA, LocusZoom, PLINK, PRSice
Applications Population genetic analysis, GWAS
Diseases Autoimmune Diseases, Brain Diseases, Liver Cirrhosis, Biliary, Multiple Sclerosis
Chemicals Methionine