Computational protocol: Variants near CHRNA3/5 and APOE have age- and sex-related effects on human lifespan

Similar protocols

Protocol publication

[…] Our study closely followed standard methods for GWAS meta-analysis studies of a quantitative trait (GWAMA), with human lifespan as the phenotype, with the following exceptions. First, the phenotyped subjects were parents of participants in UKB. Parental phenotypes were regressed on offspring genotypes, in effect imputing parent genotype from offspring. The effects measured are thus of offspring genotype on parental phenotype and the expected allelic doses in the parental generation are thus half the measured doses (in offspring). Effect estimates per parental allele are thus twice that of offspring allele, which is what we report unless otherwise stated. Second, as all subjects were genotyped on only two highly overlapping arrays, focus was on array, rather than imputed genotypes. [...] UKB participants were genotyped on two slightly different arrays and quality control was performed by UKBiobank. In brief, ∼50,000 were genotyped as part of the UK BiLEVE study using a newly designed array, with the remaining samples (∼100,000) genotyped on an updated version (UK Biobank Axiom array), both manufactured by Affymetrix (96% of SNPs overlap between the arrays). Samples were processed and genotyped in batches (used as covariates to control for confounding due to batch effects). In brief, SNPs or samples with high missingness, multi-allelic SNPs and SNPs with batchwise departures from Hardy–Weinberg equilibrium were removed from the data set. Analysis was restricted to the 753,488 autosomal SNPs passing these filters. After quality control, genotypes were available for 152,732 subjects. UKB provided 15 principal components of genetic relatedness (Biobank field id 22,009) and a binary assessment of whether subjects were genomically British (biobank field id 22,006), based on principal components analysis of their genetic data.Imputed data were prepared by UKB and used for fine-mapping signals in two genomic regions of interest. In summary, autosomal phasing was carried out using a version of SHAPEIT2 (ref. ) modified to allow for very large sample sizes. Imputation was carried out using IMPUTE2 (ref. ) using the merged UK10K and 1,000 Genomes Phase 3 reference panels to yield higher imputation accuracy of British haplotypes. The result of the imputation process is a data set with 73,355,667 SNPs, short indels and large structural variants . Our analysis for the Chr15 region (77,826,182–79,826,157 bp) included 54,767 variants and 54,474, respectively, for the Chr19 region (44,422,982–46,422,870 bp).The Estonian Biobank genotypes were from the Illumina OmniExpress array. A total of, 647,357 autosomal SNPs were available for 7,950 subjects. Standard quality control practices were followed, including removing SNPs and individuals with high missingness, samples with mismatching genomic and self-declared gender, ethnic outliers, duplicates, cryptic relatives, SNPs out of Hardy–Weinberg equilibrium. Imputation followed similar workflow to UKB, using SHAPEIT2, IMPUTE2 and the 1,000 Genomes Phase 3 reference panel. [...] Although our analysis followed the broad strategy of best practice GWAMA, the unique characteristics of our data set and, to a lesser extent, trait meant we modified the design slightly. All subjects were genotyped on only two, very similar, arrays. This meant we could perform GWAS directly on array SNPs, without losing the majority of SNPs due to missingness, an issue usually addressed through imputation. The computational burden of fitting a full Cox model and GWAS one SNP at a time across all array SNPs was infeasible for 116,279 subjects. So, we first calculated Martingale residuals for survivorship in the Cox proportional hazards model (CPHM) excluding genotype using the R package survival and then used PLINK 1.9 (ref. ) to perform a GWAS scan at array SNPs for these residuals. Manhattan plots were created in R using the package qqman. For the two clearly associated regions, we then created locus zoom plots using imputed genotypes and the Martingale residuals (the discovery phase). The top array SNP for mothers and fathers combined in each unambiguously associated region was then analysed using a full CPHM simultaneously modelling array genotype and other covariates, allowing us to determine P values directly and calculate allelic effects on the more natural scales of hazard ratio and life years (verification phase). Finally, the association between the top array SNPs and lifespan was tested in three other populations (replication phase).Association testing was conducted under the following CPHM, with h0(x) being the baseline hazard given the parental age and dead/alive status, X the non-reference allele count of the marker and Z1, …, Zk the other covariates in the model: subjects' (not parental) sex, indicators of assessment centre and genotyping batch, Townsend deprivation index (a measure of socio-economic status), and 15 principal components of genetic relatedness for all 152,732 considered together.Where shown, results for fathers and mothers combined were calculated using inverse-variance meta-analysis on the log hazard ratios.In the discovery phase (using unambiguously genetically British), Martingale residuals of the Cox model (without genotype) were calculated using the R package ‘Survival'. The Martingale residuals are defined as, where δi and τi are the event indicator (1—died, 0—survived at the end of follow-up) and follow-up time of the ith individual, are the estimates of the coefficients of the covariates Z1 … Zk in the model from , omitting the genetic markers. In case of non-zero β that model should have a linear association with the marker score Xi. These residuals were subsequently rank-normal transformed to give a normally distributed survival score for each subject. These scores were then tested for association with genotype using PLINK. This process was then repeated but conditioned on the two top SNPs (rs429358 and rs10519203), to determine if there was evidence of additional allelic effects local to their regions.In the verification phase the two top SNPs were analysed singly using the full Cox model (including genotype), in R, to give hazard ratios for additive allelic effect and with Kaplan–Meier survival curve being used to calculate expected lifetime for each genotype. The bivariate analysis of rs4420638, simultaneously fitted the three SNPs.In the replication phase, rs429358 and rs10519203 were analysed under the full Cox model. In UKB each sub-population was analysed separately (self-declared British—but not unambiguously genomically British; African, Caribbean, Chinese, Indian, Irish, Pakistani) for the effect of allele count on log hazard ratio. Results were combined for UKB non-self-declared British using a two-sided test and inverse variance meta-analysis.We performed an enhanced replication in the Estonian Biobank using participant lifespans that simultaneously tested whether the effects on parental lifespan translated into an influence on survival in a prospective cohort. All-cause mortality was analysed using Cox modelling in 5,196 individuals including 1,499 deaths from ages 40 to 103, over a mean follow-up time of 6 years, using the first three principal components of genetic relatedness as covariates, in addition to the effect allele count of the sentinel SNPs. Three models were fitted, a sex-stratified model and sex-specific models for males and females. Given the data structure, where an excess number of mortalities were included in the genotyped set compared with the full cohort, baseline hazard and survival estimates are not valid, but hazard ratios are. As a sensitivity analysis, a case–cohort analysis was performed, resulting in identical effect estimates and P values.The replication cohorts' results (Estonian Biobank, self-declared British and self-declared non-British) were then combined using inverse variance meta-analysis.Finally, for the unambiguously genomically British samples only (the discovery phase), we looked at the effect of the risk SNPs over two separate age ranges 40–75 and >75 (approximately the mean age at death). To do this, the same Cox models were rerun, but for the first age range, upper age was truncated at 75, and subjects who lived past aged 75 were recorded as survivors. For the second age range, only subjects who had survived to age 75 were entered into the analysis, survival to age 75 was thus complete, and the analysis only examined survival beyond age 75.The expected allelic effect of one allele in offspring is 0.5 alleles in parents. This can be seen, for an allele (A1) with frequency p, because if the offspring is homozygous for the other allele (A0), the expected dosage of A1 in a parent is p (that is, the probability that the non-inherited allele is A1).if the offspring is homozygous A1 the expected dosage in the parent is 2−q=1+p.if the offspring is heterozygous, the expected dosage in a parent is the average of the previous two values. if the offspring is homozygous for the other allele (A0), the expected dosage of A1 in a parent is p (that is, the probability that the non-inherited allele is A1).if the offspring is homozygous A1 the expected dosage in the parent is 2−q=1+p.if the offspring is heterozygous, the expected dosage in a parent is the average of the previous two values.Measured effects of offspring genotypes, other than the variances explained in have therefore all been doubled, to give measurement on the natural scale—the effect on parents' lifespan of one allele in parents. […]

Pipeline specifications

Software tools GWAMA, SHAPEIT, IMPUTE, PLINK, qqman
Databases UK10K UK Biobank
Application GWAS
Organisms Homo sapiens
Chemicals Acetylcholine