Computational protocol: Assessment of Whole-Genome Regression for Type II Diabetes

Similar protocols

Protocol publication

[…] Covariates baseline model (CBM). The CBM included non-genetic covariates only. The linear predictor for CBM is ηi=α0+α1si+α2ci+α3li Where η i is represented as the sum of an intercept(α 0), plus a regression on the ‘fixed effects’ of sex (s i, as dummy variable), cohort (c i, a dummy variable indicating whether participants were from the Original or Offspring cohort), the age at last contact or death (l i, ranging from 34 to 104) to control for different exposure times or observational periods, and α = (α 1,…, α 3)′, are the corresponding regression coefficients. The sample from FHS includes subjects from two cohorts and each cohort starts at a different year, a few of the measures could have different protocols and a different data collection team. We included cohort information in the models to correct for these factors. 65-SNP Model (M-65SNP). The CBM model was first extended by adding 2 marker-derived principal components (PC1 and PC2) and 65 SNPs that have been consistently associated with T2DM. The PCs were derived from 1,000 European-ancestry informative markers reported by []. This model is a second benchmark or baseline model to compare with WGR. The linear predictor could then be expressed as, ηi=α0+α1si+α2ci+α3li+α4PC1i+α5PC2i+∑j=165γjxij, where α 4 and α 5 are regression coefficients associated with PCs 1 and 2 respectively; x ij is the genotype of the i th individual (i = 1,…, 5,245) at the j th marker (j = 1,…,65), expressed as the count of one of the two alleles x ij ∈ {0,1,2}, for the imputed SNPs x ij ∈ [0,2] (a real number) and the γ = {γ i}’s are marker effects. When absent in the platform, these SNPs were imputed with IMPUTE2 with 1,092 subjects from the 1,000 Genomes data as reported previously [–]. [...] Subsequently, we evaluated several WGRs using the CBM as the base and included the genomic effects (u i) modeled with whole-genome markers. These were comprised of a high density array of p = 249,798 SNPs and were regressed on a function of the phenotype evaluated in this study. SNP effects were included in the models using either Bayes A, Bayes Cπ, Bayesian LASSO (BL), and G-BLUP. The linear predictor for these models could be written in general as,In addition to the joint conditional probability of the data, given the unknown coefficients, the prior density of the unknowns was flat for α, i.e. p(α) ∝ 1. This yields estimates of effects comparable to those obtained with maximum likelihood. The genomic effect term u i is different in every one of the Bayesian models evaluated. The definition of u i and its prior probability completes the Bayesian model. We describe them below for each Bayesian model evaluated. Bayes A ( BA ). In Bayes A models [], ui=∑j=1pxijβij and the prior density of the SNPs effects is assumed to follow a t distribution, T(β j|df β, S β), (j = 1,…,65), which could be re-expressed as ∫N(βj|0,σβj2)χ−2(σβj2|dfβj,Sβj)∂σβj2 where σβj2 is the variance of the marker effects corresponding to the j th position; see []. Thus, the conditional distribution of marker effects β j is normal with mean 0 and variance σβj2, at the next level of hierarchy we assigned a scaled-inverse chi squared distribution to the variance of marker effects. The corresponding hyper-parameters for the scaled-inverse chi-squared distribution were set according to the rules given in [] and implemented in the BGLR package. Bayes C π ( BC ). In Bayes C models [], ui=∑j=1pxijβij and here the prior for the marker effects is a two component mixture. One of the components is a point of mass at zero and the other component is a normal distribution. The prior for the marker effect for this model can be written as, p(βj|π,σβ2)=π×1(βj=0)+(1−π)×N(βj|0,σβ2) where π is the proportion of markers with non-null effects and the prior assigned to π was Beta(p 0, π 0) (see[]). We assigned a scaled-inverse chi-squared distribution to σβ2, the corresponding hyper parameters were set using the rules given in de los Campos et al. (2013). Bayesian LASSO ( B-LASSO ). In the Bayesian LASSO [], ui=∑j=1pxijβij and the prior density of the SNPs effects can be expressed as N(βj|0,τj2), where the prior distribution of τj2 is exponential, i.e. Exp(τj2λ2) and the prior density assigned to λ 2 is a gamma distribution, G(λ 2 | δ 1, δ 2), with the hyper-parameter rate set to 0.0001 and shape 0.55; (For further details on priors for this model see []). G-BLUP. In this model [], u = {u i} is a random effect in the regression which distribution is N(u|0,Gσu2) where G = {G ii′} is an n × n matrix of relationships based on the p SNP genotypes such that, Gii′=1n×∑j=1p(xij−2qj)(xi′j−2qj)2∑j=1pqj(1−qj), where q j is the estimated j th allele frequency; and σu2 is an ‘additive’ genetic variance parameter. We assigned a scaled-inverse chi-squared distribution to σβ2, the corresponding hyper parameters were set using the rules given in de los Campos et al. (2013). The marker effects were obtained with the equivalent Bayesian Ridge Regression model, as described elsewhere [].The parameters of the above-described model were estimated in a Bayesian framework using the BGLR package[,] in R []. Priors used were relatively non influential []. We used 40,000 MCMC iterations with 15,000 samples taken as burn in. Convergence was assessed by visual inspection of the trace plots, e.g. and . [...] The covariates included in all models were selected based on significance. This evaluation was done with the generalized linear model (glm function) from the R base package []. Models were compared based on effect estimates from published GWAS. We present distribution of effects and scatter plots of these estimates for each model. Additionally, we assessed the models' prediction accuracy using a 10-fold cross-validation. Since Framingham is a family based study, we randomized and assigned entire families, according to the pedigree, to folds such that when the model is trained, neither the subject to be predicted nor the subjects in the same fold—which include all subjects in one family—are used to fit the predictive model. The testing sets of the cross-validation yielded predictions of risk scores {ηi^}, which were derived without using the i th observation or any relative of the i th observation. AUC was computed using the pairs of points that included the presence/absence of diabetes and the risk score was predicted using cross validation {yi,ηi^}. Since realization of diabetes (y i) is a binary response (0/1), it is more appropriate to report results in terms of false positive rate and Area Under the Receiver Operating Characteristic Curve (AUC, see []). We estimated the former statistics using the R package ROCR []. […]

Pipeline specifications

Software tools IMPUTE, BGLR, ROCR
Applications Miscellaneous, GWAS
Diseases Diabetes Mellitus, Type 2