Computational protocol: Gene-based partial least-squares approaches for detecting rare variant associations with complex traits

Similar protocols

Protocol publication

[…] The general idea used in the proposed GBPLS approach can be summarized using the diagram in Figure . The approach specifies two sets of relationships: (1) the outer model, which links the SNPs within a given gene with a latent variable (LV; the Z variables in Figure ) by simply aggregating them through a projection; and (2) the inner model, which specifies the relationships between predictors (LVs, non-SNP covariates, and the first 10 PCs) and the trait Qj (j = 1, 2). The coefficients corresponding to the outer and inner models are called outer and inner coefficients, respectively. We consider two specific GBPLS algorithms, GBPLS1 and GBPLS2. Both algorithms consists of stage 1 and 2, corresponding to the calculations of the outer and inner weights, respectively. However, they differ by the approaches used for the calculation, as we describe in the following.In the GBPLS1 algorithm, partial least-squares path modeling (PLSPM) is used to calculate the outer coefficients. PLSPM is a statistical method that was developed for the analysis of structural equation models with latent variables. A formal presentation of the partial least-squares approach to latent variable path models is given by Wold []; a more recent reference can be found in Tenenhaus et al. []. In the PLSPM used for the GBPLS1 algorithm, genotype information from SNPs within the same gene are combined into a single LV (i.e., gene score) by constructing a linear function of the SNPs in , (i = 1, 2, …, 100). The resulting coefficients of the SNPs are the outer weights/coefficients.In the second stage of the GBPLS1 approach, the inner model coefficients are estimated by ordinary least squares in the multiple regression model given by:(1)for j = 1, 2, where Zi is the LV for gene i. We then order the absolute values of the inner model coefficients for the gene scores (i.e., βi, i = 1, 2, …, t) to identify the q most important ones in explaining the trait Qj, where q in the analysis is taken to be 25, 35, and 50. To determine the relative importance of the SNPs in the construction of the q most important gene scores, we ordered the absolute values of the outer weights and recorded the corresponding ranks (called SNP ranks) for each SNP.Although the GBPLS1 approach can deal with a dimension corresponding to 100 genes, it has computational problems when we run it for more than 100 genes because the algorithm simply uses the least-squares estimator that would fail with large dimensionality. To remedy this problem, we propose a similar algorithm, GBPLS2, that incorporates a partial least-squares and penalized regression for calculating the outer and inner weights, respectively, to handle higher dimensions.Partial least-squares regressions aim to derive the orthogonal latent components using the cross-covariance matrix between the response variable and the explanatory variables []. We calculate outer weights and gene scores Ti by solving the following maximization problem:(2)where y is the trait and the gene score is taken to be the projection of on the outer coefficient vector , that is,(3)for t = 100, 250, 500, 750 and i =1, 2, …, t.In the second stage, we apply the LASSO (least absolute shrinkage and selection operator) penalty [] to implement regression analysis of traits on the gene scores Ti and other non-SNP covariates and the first 10 PCs in which only gene scores were penalized. The penalty parameter was determined for each replicate by 10-fold cross-validation. The genes with nonzero inner coefficients and the rankings of the corresponding outer coefficients for the SNPs within these genes are the outcomes of the GBPLS2 algorithm.We carried out our GBPLS analyses using R packages plspm (the factor scheme), glmnet, and plsgenomics, which were downloaded from [http://cran.r-project.org/]. […]

Pipeline specifications