Computational protocol: Genomic prediction of celiac disease targeting HLA-positive individuals

Similar protocols

Protocol publication

[…] While SNP arrays only produce unphased alleles, statistical methods can exploit patterns of linkage disequilibrium to impute haplotypes (phased alleles) and other forms of genetic variation, based on a reference panel where both the SNPs and the imputed variation are known (a training dataset). This procedure is conceptually the same as used when imputing, for example, millions of 1000 Genome genotypes based on a smaller set of hundreds of thousands of assayed genotypes.As the majority of CD-related variation is found within the HLA, we imputed HLA variation, employing two complementary methods. First, we imputed HLA-DQA1 and HLA-DQB1 alleles from the SNPs using the R package HIBAG 1.2.3 [], using the European hg18 HLA4 reference dataset. Based on the imputed HLA alleles, we inferred each individual’s heterodimer type as one of DQ2.5 heterozygous, DQ2.5 homozygous, DQ2.2, or DQ8, according to the mapping in []. Following [], the HLA risk score was assigned as low for individuals that did not have any of the CD risk heterodimers (DQ2.2, DQ8, DQ2.5-heterozygous, and DQ2.5-homozygous). High risk was assigned to individuals with DQ2.5-homozygous or those with both DQ2.5-heterozygous and DQ2.2. Medium risk was assigned to all other remaining individuals. The HLA risk profiles were coded as 0 for low, 1 for medium, and 2 for high risk. We did not examine the 57 non-HLA SNPs used in [] as these are only present on Immunochip arrays and not on the genome-wide arrays, and were not well tagged by the existing SNPs on the genome-wide arrays.Another imputation approach, SNP2HLA [], has proven to be particularly useful for fine-mapping of the HLA region with regards to association with several autoimmune diseases, including CD [], and for explaining more of the heritability of disease. SNP2HLA employs the assayed SNPs, together with a reference panel, to impute HLA SNPs (if they were not already on the genotyping array), HLA alleles (HLA-A, HLA-B, HLA-C, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, and HLA-DRB1), and known amino acid polymorphisms within those HLA genes. These imputed variants can then be analyzed in the same way as assayed genotypes.Hence, in addition to using HIBAG, we employed SNP2HLA v1.0.2 [] to impute 8,961 HLA SNPs, eight HLA alleles, and amino acid polymorphisms, based on the T1DGC reference panel, in the European GWA, Immunochip, and North American dataset. The non-SNP imputed markers were coded as present/absent. Quality control for the combined SNP + imputed marker data included: (1) removal of imputed SNPs that were already assayed on the array; (2) within each dataset (UK2, NL, Finn, IT), removal of SNPs/markers with MAF <1 %, missingness >10 %, deviation from Hardy-Weinberg equilibrium (HWE) in controls P <5 × 10−6, differential case/control missingness P <10−3, and removal of individuals with >10 % missingness; (3) removal of SNPs/markers that were not present in the four European datasets (UK2, NL, Finn, IT); (4) removal of SNPs/markers with differential case/control missingness P <10−3 across the combined data; and (5) removal of SNPs/markers not on the North American imputed dataset. For the Immunochip data, QC included: (1) removal of imputed SNPs already assayed; (2) SNP/marker filtering by MAF <0.5 %, missingness >10 %, deviation from HWE in controls P <10−3, and (3) differential case/control missingness P <10−3, and removal of individuals with missingness >10 %. We verified that the imputed markers included in genomic risk scores had high imputation accuracy (r2 > 0.8) in the training data. The final SNPs + imputed marker European data consisted of 507,321 markers (500,821 assayed SNPs and 6,500 imputed markers) over 11,912 individuals (5,552 of which were DQ2.5+); after removal of individual assayed in the UK GWA data, the Immunochip dataset had 7,803 individuals, of which 4,732 were DQ2.5+, with 24,555 SNPs/markers (approximately 17,800 assayed SNPs and approximately 6,700 imputed markers) common with the other datasets. For the DQ2.5-specific GRS (GRS-DQ2.5), only SNPs present in the North American dataset were used in cross-validation, so that all SNPs present in the model could be used to determine the score in external validation. SNP2HLA had approximately 100 % concordance with HIBAG’s DQ2.5+ classification. […]

Pipeline specifications

Software tools HIBAG, SNP2HLA
Application Drug design
Diseases Autoimmune Diseases, Celiac Disease