Computational protocol: Identification of 15 genetic loci associated with risk of major depression in individuals of European descent

Similar protocols

Protocol publication

[…] DNA extraction and genotyping were performed on saliva samples by National Genetics Institute (NGI), a CLIA licensed clinical laboratory and a subsidiary of Laboratory Corporation of America. Samples were genotyped on one of four genotyping platforms. The V1 and V2 platforms were variants of the Illumina HumanHap550+ BeadChip, including about 25,000 custom SNPs selected by 23andMe. The V3 platform was based on the Illumina OmniExpress+ BeadChip, with custom content to improve the overlap with theV2 array. The V4 platform use most recently is a fully custom array, including a lower redundancy subset of V2 and V3 SNPs with additional coverage of lower-frequency coding variation. The platforms contained 586,916; 584,942; 1,008,948; and 570,000 SNPs, respectively. Samples that failed to reach 98.5% call rate were re-analyzed. Individuals whose analyses failed repeatedly were re-contacted by 23andMe customer service to provide additional samples, as is done for all 23andMe customers.Participant genotype data were imputed against the September 2013 release of 1000 Genomes Phase1 reference haplotypes, phased with ShapeIt2. We phased using an internally developed phasing tool, Finch, which implements the Beagle haplotype graph-based phasing algorithm, modified to separate the haplotype graph construction and phasing steps. Finch extends the Beagle model to accommodate genotyping error and recombination, to handle cases where there are no consistent paths through the haplotype graph for the individual being phased. We constructed haplotype graphs for European samples on each 23andMe genotyping platform from a representative sample of genotyped individuals, and then performed out-of-sample phasing of all genotyped individuals against the appropriate graph.In preparation for imputation, we split phased chromosomes into segments of no more than 10,000 genotyped SNPs, with overlaps of 200 SNPs. We excluded SNPs with Hardy-Weinberg equilibrium P<10−20, call rate < 95%, or with large allele frequency discrepancies compared to European 1000 Genomes reference data. Frequency discrepancies were identified by computing a 2×2 table of allele counts for European 1000 Genomes samples and 2000 randomly sampled 23andMe customers with European ancestry, and identifying SNPs with a chi squared P<10−15. We imputed each phased segment against all-ethnicity 1000 Genomes haplotypes (excluding monomorphic and singleton sites) using Minimac2, using 5 rounds and 200 states for parameter estimation.For the X chromosome, we built separate haplotype graphs for the non-pseudoautosomal region and each pseudoautosomal region, and these regions were phased separately. We then imputed males and females together using Minimac2, as with the autosomes, treating males as homozygous pseudo-diploids for the non-pseudoautosomal region. [...] In the GWAS and replication analysis, we computed association test results by logistic regression assuming additive allelic effects. For tests using imputed data, we use the imputed dosages rather than best-guess genotypes. We included covariates for age, gender, and the top 5 principal components to account for residual population structure. While we could justify the choice of 5 PCs based on the preceding ancestry analysis, we actually chose to use 5 based on computational considerations, and others have noted this to be a reasonable choice.For quality control of genotyped GWAS results, we removed SNPs that were only genotyped on our “V1” and/or “V2” platforms due to small sample size, and SNPs on chrM or chrY because many of these are not genotyped reliably. Using trio data, we flagged SNPs that failed a test for parent-offspring transmission; specifically, we regressed the child’s allele count against the mean parental allele count and flagged SNPs with fitted β<0.6 and P<10−20 for a test of β<1. We removed SNPs with a Hardy-Weinberg P<10−20 in Europeans; or a call rate of <90%. We also tested genotyped SNPs for genotype date effects, and removed SNPs with P<10−50 by ANOVA of SNP genotypes against a factor dividing genotyping date into 20 roughly equal-sized buckets. For imputed GWAS results, we removed SNPs with average r2<0.5 or minimum r2<0.3 in any imputation batch, as well as SNPs that had strong evidence of an imputation batch effect. The batch effect test is an F test from an ANOVA of the SNP dosages against a factor representing imputation batch; we removed results with P<10−50. Prior to GWAS, we identified, for each SNP, the largest subset of the data passing these criteria, based on their original genotyping platform – either v2+v3+v4, v3+v4, v3, or v4 only – and computed association test results for whatever was the largest passing set. After quality control, the 23andMe discovery GWAS included results for 13,474,321 imputed variants, and 60,949 genotyped variants that did not have imputed results passing our filters, for a total of 13,535,270 variants. Of these, 15,774 test results could not be computed due to logistic regression fitting problems, leaving 13,519,496 tests. HWE and batch-effect pvalues are presented in .Results from 23andMe were adjusted for variance inflation by multiplying the variance (i.e. square of the standard error) of each genetic effect estimate by the intercept of 1.0598 as calculated by LD score regression. Meta-analysis with PGC was conducted by inverse-variance fixed effects meta-analysis on overlapping SNPs after adjusting the standard errors of each individual analysis for its own lambda (LD score regression intercept in PGC was 1.0243). Final results from the meta-analysis were further adjusted for the overall LD score regression intercept of 1.0025 (for more details on LD score regression methods see section on LD score regression). […]

Pipeline specifications

Software tools SHAPEIT, BEAGLE, Minimac2, LDSC, Overall LD
Application GWAS