Computational protocol: Exon-focused genome-wide association study of obsessive-compulsive disorder and shared polygenic risk with schizophrenia

Similar protocols

Protocol publication

[…] Extensive quality control filtering of the obtained genotypes was performed using standard procedures in PLINK. In particular, SNPs were removed from the study if they did not pass any of the following filters: (i) genotyping call rate above 95%, (ii) no significant departures from Hardy–Weinberg equilibrium in control samples (P>1 × 10−3); and (iii) no significant difference in call rate between cases and controls (P>1 × 10−3).The samples presenting any of these conditions were removed: (i) call rate below 95% (ii) discordant gender between the one recorded in our database and the one inferred from the genotypes; (iii) heterozygosity levels departing three standard deviations from the mean; (iv) cryptic relatedness, detected by proportion identity-by-descent (PI_HAT) values greater than 0.05, as recommended by PLINK. In that case, the sample of the pair with lower call rate was removed. Finally, the genotype data of 3410 ancestral informative markers present in our samples were used to identify individuals with less than 90% Spanish ancestry using Structure v2.3.3 (ref. ) and the HapMap samples from European ancestry (CEU), African ancestry (YRI) and Asian ancestry (JPT+CHB) as reference sets. The structure was run under the admixture model with 100 000 replications for burnin period and 100 000 replications after burnin for parameter estimations. [...] Imputation was performed for autosomal chromosomes following a pre-phasing/imputation stepwise approach with IMPUTE2/SHAPEIT, using default parameters., As recommended, the chromosomes were divided in chunks of 5 Mb for the imputation and all the SNPs with minor allele frequency (MAF) over 0.01 were included as the input (94 096 variants). The 1000 Genome Project data set was used as the reference. After imputation, only the genotypes with an imputation info score >0.9 were considered for further analysis. Any SNP with imputation data on less than 95% of the sample was removed for further analysis. [...] After imputation, association analysis at SNP level was performed using logistic regression under an additive model, considering those SNPs at MAF>5%. The first 10 dimensions of multidimensional scaling, calculated from genotyped data at MAF>5%, were included as covariates to control for population stratification. The analysis was performed as implemented in PLINK 1.9. Manhattan plots and quantile-quantile plots were created with the R package qqman (https://github.com/stephenturner/qqman). Meta-analysis with previous GWAS data was performed using METAL. [...] Polygenic risk analysis was performed as previously described., Basically, a polygenic risk model was constructed from GWAS data on a discovery sample. The model included the associated allele and its effect, measured as the logarithm of the OR, at each one of the SNP under a specific threshold of association P-values (Pthreshold). SNPs with alleles A/T or C/G were excluded to avoid strand ambiguity. Several different Pthreshold, from 0.01 to 1 (that is, inclusion of all the SNPs) were considered. Correlated SNPs due to linkage disequilibrium were pruned, using the clumping algorithm of PLINK, considering an r2=0.2 and a window size of 500 kb. The polygenic risk model was tested on a target sample, obtaining a polygenic risk score for each sample as the sum of the number of risk alleles carried by that sample weighted by its effect. The significance of the results, based on a Wald test for the coefficient of the score, was tested by comparison of two logistic regression models, one considering only the first 10 dimensions of multidimensional scaling to control for stratification, and another considering additionally the polygenic risk score. Nagelkerke's pseudo-R2 was calculated as a measure of the variance explained on the observed scaled.Two different analyses were done, using PRSice. The first one considered the discovery phase of the second Psychiatric Genomics Consortium schizophrenia case–control mega-analysis, Psychiatric Genetics Consortium schizophrenia data set (PGC-SCZ2), as discovery sample and our OCD case–control samples as the target sample. By this way, the existence of common genetic susceptibility for both disorders was tested. Only genotyped SNPs or imputed SNPs with an imputation info score >0.9 were selected from the schizophrenia data. As an internal control, 100 permutations of the case–control labels at the OCD data were used as target samples.The second analysis used the OCD data generated at this work as discovery sample and our previous data of a schizophrenia–control study on exonic SNPs using the Affymetrix 20k cSNPs array as target sample to test for the shared polygenic risk in an additional sample. The OCD data in polygenic risk analysis were restricted to all genotyped autosomal SNPs.Power calculation for polygenic risk analysis was performed by the method of Palla and Dudbrigde (2015), as implemented in AVENGEME. These authors estimated several parameters of relevance in our calculation, such as additive genetic variance in schizophrenia explained by SNPs at common GWAS arrays equal to 0.3; and additive genetic covariance between schizophrenia and major depression or schizophrenia and bipolar disorder explained by SNPs at common GWAS arrays equal to 0.165 or 0.185, respectively. For power calculation, we assumed an additive genetic covariance between schizophrenia and OCD of 0.15. […]

Pipeline specifications

Software tools PLINK, ADMIXTURE, IMPUTE, SHAPEIT, qqman, PRSice, AVENGEME
Applications Population genetic analysis, GWAS
Organisms Homo sapiens
Diseases Genetic Diseases, Inborn