Computational protocol: Multivariate Methods for Genetic Variants Selection and Risk Prediction in Cardiovascular Diseases

Similar protocols

Protocol publication

[…] The analysis of binary traits offers several alternatives that draw from both frequentist and Bayesian methods (Table ). In order to identify informative sets of genetic and non-genetic variables expected to jointly affect a disease phenotype, stepwise logistic regression is one of the most consolidated approaches. The first step of this approach consists in testing simultaneously an initial set of SNPs in a logistic regression model as predictors of disease status which is represented by the binary-dependent variable. Then, different models are subsequently compared with the initial model to estimate whether a different set of predictors improved the fit, which is measured by goodness of fit metrics such as deviance or log-likelihood (). Identifying the optimal model can be performed by a forward search strategy (the selection starts with the intercept of the regression, and then sequentially adds into the model the predictor that most improves the fit), a backward search strategy (it starts by including all variables, and sequentially deletes the predictor that has the lowest impact on the fit), or a combination of both (). However, it is important to consider that this approach may prove computationally intensive when large sets of variables need to be analyzed, making the task of feature selection difficult.The Least Absolute Shrinkage and Selection Operator (LASSO) () is a shrinkage method that represents a sound alternative to stepwise regression for the identification of informative genetic variants. The LASSO approach silences non-informative variables by setting their regression coefficient to 0 through a penalty parameter called lambda (λ). The optimal value to be assigned to λ can be learned by a resampling strategy performed on the data: the value guaranteeing the lowest average classification error on the test sets will be applied to the regression model. Vaarhorst and colleagues () used LASSO to identify predictors of coronary heart disease (CHD), starting from a set of candidate variants, whereas Hughes and colleagues () applied the algorithm to the identification of genetic variants to define a risk score for coronary risk prediction. The elastic net () is an extension of the LASSO that is robust to extreme correlations among predictors, which also provides a more efficient, effective system for handling the analysis of unbalanced datasets.Bayesian methods, such as the binary outcome stochastic search (BOSS) () and bags of naive Bayes (BoNB) () algorithms, also provide alternative approaches. BOSS is a feature selection approach deriving from the method described in Ref. () based on a latent variable model that links the observed outcome to the underlying genetic variants mapping to candidate regions of interest. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor in determining the latent variable profile (). A latent variable profile is defined as a stochastic vector of same size of the number of SNPs; the vector may assume 0/1 values, thus expressing the fact that a marker is considered (value equal to 1) or not (value equal to 0) as a predictor of the outcome. The model estimates the posterior probability of such latent variable; as a consequence, the most likely latent variable will determine the set of SNPs with the highest risk prediction potential for developing a disease. BoNB () is an algorithm for genetic biomarkers selection from the simultaneous analysis of genome-wide SNP data based on the naive Bayes (NB) () classification framework. The predictive value (marginal utility) of each genetic variant is assessed by a resampling strategy. By randomly shuffling the genotypes of an informative variant, an overall decrease in terms of classification accuracy will be observed, and if an uninformative variant is permuted, no substantial loss will be observed. This strategy, coupled with appropriate statistical tests, allows BoNB to identify informative sets of SNPs. These methods have been tested on real datasets on type 1 (, ) and type 2 diabetes (), respectively.Classification and regression trees (RTs) methods () fall under the category of decision tree learning. In these tree structures, leaves represent the predicted phenotypic outcome, whereas nodes and branches represent the set of genetic variants and clinical covariates that predict the phenotypic outcome. These methods recursively partition data into subsets according to the variables’ values: each partition corresponds with a “split” based on the set of variables being considered, defining a tree-like structure (). Classification trees (CTs) are designed to analyze categorical traits and facilitate the identification of informative interactions between variables and stratifications in the data starting from a limited numbers of predictors.Random forests (RFs) () are based on CTs, as they aggregate a large collection of de-correlated trees, and then average them (). RFs generate a multivariate ranking of the analyzed variables according to their predictive importance with respect to the outcome. Even more, they can be easily applied to analyze unbalanced datasets, and they are able to account for correlation and informative interactions among features. Such characteristics make this approach particularly appealing for high-dimensional genomic data analysis (). RFs have been applied to identify genetic variants influencing coronary artery calcification in hypertensive subjects (), bicuspid aortic valve condition (), and high-density lipoprotein (HDL) cholesterol level (). Maenner and colleagues () applied RFs to identify SNPs involved in gene-by-smoking interactions related to the early-onset of CHD using the Framingham Heart Study data.ABACUS is an Algorithm based on a BivAriate CUmulative Statistic, which allows identifying combinations of common and rare genetic variants associated with a disease by focusing on predefined SNPs-sets (e.g., belonging to specific pathways) (). ABACUS calculates a statistic for each pair of SNPs within each SNPs set and generates an aggregated score measuring the cumulative evidence of association of the SNPs annotated in the SNP set. This method has been tested on GWAS on type 1 and type 2 diabetes ().Specific implementations of LASSO, elastic net, CTs, RFs, and stepwise Cox proportional hazard regression () have been also proposed for the identification of SNPs associated with time to event outcomes (Table ). […]

Pipeline specifications

Software tools Lasso, ABACUS
Application GWAS
Diseases Cardiovascular Diseases