Computational protocol: Determining Genetic Causal Variants Through Multivariate Regression Using Mixture Model Penalty

Similar protocols

Protocol publication

[…] The cost function shown in Equation (6) is conceptually similar to the penalty function proposed by Ročková and George (). The authors, using a mixture Laplace pdf, estimate the prior probability of the effects using a coordinate-wise optimization routine. We, however, specify different prior probabilities for each variant, so as to incorporate any information on LD or functional annotations. Furthermore, our motivation to use a mixture model stems from our understanding of the genetic architecture of the human genome.The likelihood ratio and the cost function are weighed equally, so that the minimum error solution is sparse. Unequal weights can be specified, say higher for L if it is known that the genetic architecture is highly polygenic, and low if only a few genetic causal variants influence the phenotype under consideration.Variants of the proposed method could be obtained by changing the pdfs used in constructing the mixture model—for example, Laplace or non-local pdfs (Johnson and Rossell, ) could be used instead of two normal pdfs. These however, are minor modifications, and our main contribution lies in proposing an explicit mixture model pdf as a penalty function.Instead of considering cost associate with individual SNPs, the SNPs can be clustered through specification of suitable correlation. This could possibly capture the underlying LD information. However, this requires incorporation of LD metrics such as r2 into covariance structure of the clustered SNPs. Such studies have not been pursued in this present work, and will likely be a part of future efforts.Higher prior probabilities can be specified to cluster of SNPs in a given LD block that is envisioned to contain the causal signal. SNPs not in this LD block may be provided lower or zero prior probability.SNPs that belong to certain functional annotation category have higher likelihood of being causal. Hence, SNPs in these regions are deemed to be enriched, i.e., have higher probability of influencing a particular phenotype. Typically SNPs tagging regulatory and coding regions are considered to be enriched in comparison with introns and intergenic SNPs (Schork et al., ). SNPs in the MHC region can be considered to be enriched when studying immune related diseases (Ellinghaus et al., ).Using existing packages such as CAVIAR, DAP, PAINTOR, RiVIERA, and S-LDSC (Finucane et al., ), one could obtain a quantitative assessment of the causal nature of individual SNPs. These results can be directly used as prior probabilities (π~1) in the proposed optimization routine. Probabilities could also be based on GWAS p-values. However, these values tend to alter with increase in power.As mentioned earlier, for each regression based method, there exists a Bayesian equivalent. In the Bayesian methods, assuming a prior pdf, samples are drawn from the posterior distribution using MCMC. The proposed method avoids sampling from the prior and posterior pdf of the effect sizes by specifying the prior information explicitly as a penalty function. This distinguishes the method from the Bayesian LASSO and BSLMM.Fine-mapping methods typically require data from dense genotyping arrays, which are further imputed using reference panels, such as 1,000 Genomes (1000 Genomes Project Consortium et al., ). The mixture-model method, on the other hand, uses whole genome wide data to locate the causal signal. In this aspect, genotype data preferred for fine-mapping studies, may be unsuitable for the proposed method. [...] Hapgen2 (Su et al., ) and 1,000 Genomes (1000 Genomes Project Consortium et al., ) is used for simulating realistic genotypes for an European population of size 100,000 considering all the 22 chromosomes (80378054 SNPs). True effect sizes are simulated based on the understanding that a proportion of the SNPs are causal with effect sizes distributed as N(0, 1). […]

Pipeline specifications

Application GWAS