Computational protocol: Novel Bayes Factors That Capture Expert Uncertainty in Prior Density Specification in Genetic Association Studies

Similar protocols

Protocol publication

[…] BFs compare the probability of the observed data under two models or hypotheses. For our purposes, the BF can be defined as (1) BF =P(data|H1)P(data|H0).BFs are also used to update prior odds (δ/(1−δ)) to posterior odds (Δ/(1−Δ)) via Δ/(1−Δ)=δ/(1−δ)× BF , where in our case Δ and δ are the posterior and prior probabilities of “true” association, respectively. By “true association,” we mean causally linked to disease risk rather than being associated through LD or sampling variation. Here, a BF greater than one indicates that the data are more likely under the alternative than the null hypothesis. BFs require the specification of a likelihood and prior on all model parameters. Both BFs and posterior probabilities can be used to fine‐map genomic regions in case‐control studies by using the likelihood from a logistic regression model. For SNP i in single‐SNP logistic regression models, the probability (yij) of subject j, with xij copies of the minor allele, being a case is (2)yij=eβ0i+β1ixij1+eβ0i+β1ixij.With this definition, β1i can be interpreted as the SNP‐specific per‐allele natural logarithm of the OR comparing the minor to the major allele. For SNP i, BFi is calculated comparing the hypotheses H0:β1i=0 and H1:β1i≠0 [Stephens and Balding, ].The BF, as given in Equation , is the ratio of marginal likelihoods which can lead to intractable integrals for many prior densities. For nontractable BFs, it is common to use a Laplace approximation [Kass and Raftery, ]. The Laplace approximation is implemented in software packages, including snptest2 [Marchini et al., ]. Wakefield [2008, 2009] derived a tractable approximation to the BF (which we abbreviate as WBF). We found excellent agreement between the WBF and Laplace approximations from snptest2 for sample sizes ⩾10, 000 for a variety of ORs and MAFs (data not shown). Both methods are based on asymptotic approximations and, given the large sample sizes in the types of dataset we consider, should provide accurate approximations to the true BF.Using the definition of the BF in Equation , the Wakefield approximate BF is (3) WBF =VV+Wexpβ1^2W2V(V+W).In Equation , β1, the logOR of causal SNPs in the genomic region under consideration, is assumed to follow a normal distribution given by β1∼N(0,W). β1^ is the maximum likelihood estimator (MLE) of β1. Rather than consider the logistic likelihood, Wakefield used the asymptotic distribution of the MLE: β1^∼N(β1,V) which leads to the WBF given in Equation . Note that the WBF we specify in Equation , and use in the rest of this paper, is the reciprocal of the WBF given by Wakefield [2009]. [...] Three of out four forms of BF use the genotype data to inform the prior through V, the asymptotic variance of the estimate of the logOR. V will be different for each SNP since it depends on, among other quantities, the MAF. We used simulated data to generate realistic values of V and examined their effect on the prior density of W for values of V corresponding to SNPs that have an MAF not less than 0.005, as these are the SNPs that we might have sufficient power to detect with current sample sizes. These datasets were simulated using hapgen2 [Spencer et al., ] and the European haplotypes of the August 2010 release of the 1,000 genomes data with large sample sizes reflecting those now being generated by disease‐specific consortia.There has been some suggestion that the effect size of causal SNPs may increase with decreasing MAF [Wang et al., ]. We investigate whether the three empirical forms of prior implicitly have this property. To assess this we examine how E(W) changes with V, over a support relevant to studies with sample sizes of 2,000 or more. Since SNPs with lower MAFs have larger V [Slager and Schaid, ], an appropriate prior would possess the property that E(W) is a nondecreasing function of V. Then as the MAF decreases, V increases and rarer SNPs have a priori larger effects on average.We also tested the use of the novels BFs as a method for filtering (narrowing down the set of candidate causal variables) in a fine‐mapping study by carrying out such an analysis on simulated datasets with known causal SNPs. We give results for scenarios in which the causal SNP has an MAF of 0.08, for ORs of 1.10, 1.14, and 1.18 and for total sample sizes of 2,000, 4,000, and 20,000 (with an equal number of cases and controls). We simulated 1,000 datasets for this scenario, and illustrate the results using receiver operating characteristic (ROC) curves. Fawcett [] outlines several ways to determine ROC curves when they are used to represent a summary of multiple analyses (in this case those on each of the datasets). Our ROC curves present the true and mean false‐positive rates (FPRs) over the multiple analyses, but give no indication as to the variation in FPRs between datasets. Fawcett calls the method that we use threshold averaging. [...] The Collaborative Oncological Gene‐environment Study (COGS) Consortium have recently carried out a number of studies using a specially developed Illumina array, known as the iCOGS array [Michailidou et al., ]. This was designed to fine‐map regions that had been previously identified by GWAS, by concentrating a large number of SNPs in regions of interest where there is already thought to be a causal association with breast, ovarian, or prostate cancer. One such region comprises base positions 201500074–202569992 of chromosome two, including the CASP8 gene. In this region, 585 SNPs were originally genotyped on breast cancer case and control samples from the Breast Cancer Association Consortium and 501 passed quality control checks. A further 1,232 were successfully imputed using impute2 (Marchini and Howie, ), resulting in genotypes for 1,733 SNPs in 46,450 cases and 42,600 controls (total sample size: 89,050). We used both the full dataset and a subset of 5,238 individuals (2,721 cases and 2,517 controls) to assess the impact of our priors on both smaller and larger studies.Prior to receiving the data, we carried out elicitation with a breast cancer genetics expert who had previously been involved in studies into the CASP8 region. We then determined the prior distribution that best matched their beliefs and used it to calculate BFs and carry out filtering on the genotype data from iCOGS. […]

Pipeline specifications

Software tools SNPTEST, HAPGEN, IMPUTE
Application GWAS
Diseases Breast Neoplasms