Computational protocol: Kernel canonical correlation analysis for assessing gene–gene interactions and application to ovarian cancer

Similar protocols

Protocol publication

[…] To assess the properties of our KCCA procedure for gene–gene interaction testing, we consider simulation studies that evaluate type-I error control and power. We generated a population of haplotypes for two genes using real genotype data from our case study. fastPHASE was applied to the genotypes from the controls to estimate haplotypes for two genes of comparable size (25.9 and 30.6 kb), followed by use of HapSim to simulate 10 000 haplotypes for each gene. The respective numbers of polymorphic sites for each gene were 79 and 92. Genotype data for a hypothetical individual were simulated by combining pairs of randomly selected haplotypes for each gene.Let represent the derived minor allele frequency (MAF) from our simulated haplotype populations for marker j in gene i. Next, we randomly selected a fixed number of common () markers to be causal for each gene. We then used a similar approach to effect size definition used by Wu et al, in which effects are a function of the MAF. Let for an interaction effect, such that . Here defines the maximum possible interaction effect. For example, for two markers with MAF=0.20 and τ=5, the interaction effect is .Given the difficulty in genome-wide detection of gene–gene interactions without the presence of marginal effects, we only considered disease models with solely epistatic effects. We defined the probability of being a case conditional on genotype via a logistic regression framework, such that where Ω1 and Ω2 define the subsets of markers which are causal, and represents the disease prevalence. Sampling of cases and controls was then completed from a sufficiently large number of simulated genotype-phenotype pairs.To comparatively evaluate the performance of our KCCA method, we included additional methods on the basis of similar analysis principles: the original CCA-based approach, PC-based logistic regression (PC-LR), and a composite-LD method (CLD). PC-LR obtains principal components from SNP measurements for each gene, which are then fit in simple logistic regression. The CLD method is a covariance-based approach, which evaluates the difference between block interactions across case-control status. Additional details for each approach can be found in their respective publications. For PC-LR, we evaluate the significance of the interaction coefficient between the first principal component of each gene, and for the CLD approach we use 5000 permutations to characterize the reference distribution of the test statistic. All declarations of statistical significance are made at an α-level of 0.05. For both Type-I error and power simulations, we consider whether or not explicit marginal effects are included in the disease model. Each simulation scenario is conducted with case-controls status sample sizes of 500, 1000, and 1500, with a total of 1000 iterations each. […]

Pipeline specifications

Software tools fastPHASE, HapSim
Applications Population genetic analysis, GWAS
Diseases Ovarian Neoplasms