Computational protocol: Multivariate Dimensionality Reduction Approaches to Identify Gene-Gene and Gene-Environment Interactions Underlying Multiple Complex Traits

Similar protocols

Protocol publication

[…] The datasets used for the analyses described in this article were obtained from the database of Genotypes and Phenotypes (dbGaP). All the study protocol and forms/procedures were approved by the Institutional Review Board of the University of Alabama at Birmingham.In the original MDR method for a case-control study , a set of m attributes either genes or discrete environmental factors are chose to span an m-dimensional contingency table. Each subject is allocated into a cell in this m-dimensional space based on the observations on these attributes and every nonempty cell can be labeled as either “high-risk” if the ratio of cases to controls in the cell is larger than a pre-specified threshold or “low-risk” otherwise. A new dichotomous attribute (i.e., a classification model) is formed by pooling the high-risk cells and the low-risk cells into the high-risk group and the low-risk group, respectively, thus changing the space of the data from originally higher dimensions to one dimension. The resulting model is evaluated in its ability to classify the phenotype; accuracy, defined as the proportion of the correct classifications (i.e., cases in the high-risk group and controls in the low-risk group), is a commonly used measure. Cross-validation and/or permutation testing can be integrated into the above process for evaluation of model, and the optimal subset(s) of features can be selected in terms of the classification ability measured by accuracy or its derivatives such as p-value.While sharing the same variable construction algorithm as in MDR, GMDR uses a general statistic, instead of affection status, to classify the two divergent groups. The statistic of an individual corresponding to a certain cell in a given contingency table can be generally expressed as the product of its membership coefficient belonging to this cell and its residual under the null hypothesis, which will be elaborated in the following subsections, respectively. [...] To illustrate use of the GEE-GMDR approach proposed here, a real data set from the Study of Addiction: Genetics and Environment (SAGE) was analyzed to identify interactions among genes. A large proportion of SAGE samples were unrelated except a few siblings. After quality control, a total of 3,897 individuals from three subsamples: the Collaborative Study on the Genetics of Alcoholism (COGA) (1,178 individuals), the Collaborative Study on the Genetics of Nicotine Dependence (COGEND) (1,427 individuals) and the Family Study of Cocaine Dependence (FSCD) (1,292 individuals) were obtained. Using Illumina Human 1M platform, 1,069,796 SNP markers were genotyped for each participant. Self-reported ethnicities indicate that about 35% of the participants are black and 65% are white. Detailed genotype information and demographic characteristics of SAGE cohort can be obtained from the database of Genotypes and Phenotypes (dbGaP) through dbGaP accession number phs000092.v1.p. Three common different measurements of ND were selected from the recorded traits: the lifetime score on FagerstrÖm Test for Nicotine Dependence (FTND), the DSM4 Nicotine Dependence (DSM4ND) and the largest number of cigarettes smoked in 24 hours (MC).We excluded SNPs that had missing genotype rate >0.1, minor allele frequency <0.05 and a Hardy-Weinberg equilibrium test p<10−7 using PLINK software . In total, 744,511 SNP markers were left after quality control. A total of 2,082 individuals were available for the phenotypic traits and also passed the quality control. We also generated a pruned subset of SNPs that are in approximate linkage equilibrium with each other using PLINK software. With the SNP information (dbSNP, Build 135) and the remained SNPs passing the control process, 5 SNP markers in the nicotinic acetylcholine receptor (nAChR) α4 subunit (CHRNA4), 3 in the nAChR β2 subunit (CHRNB2), 56 in the neurotropic tyrosine kinase receptor 2 (NTRK2, also known as the tyrosine kinase receptor gene, TrkB), and 18 in the brain derived neurotropic factor (BDNF) were chosen to detect gene-gene interactions among the four genes. In total, 15,120 (5×3×56×18) tetragenic interactions with one SNP from each of the four genes were examined.Owing to the fact that self-identified ethnicity often partially reflects one’s genetic ancestral origins, especially for populations that have complicated migration or admixture histories, the principal components analysis was performed using GCTA software for the SAGE data in which both unrelated samples and relatives are included to identify the population structure . The residual score statistics of GEE-GMDR were computed using methods described in the above subsection with gender and the top five principal components as covariates. Permutation testing was conducted to obtain empirical distribution of test accuracy based on 1,000 shuffles. According to the central limit theory, the p-value can be approximately calculated under the null distribution by the approximated Z score, which is . Due to the computational burden for permutations, only the tetragenic model, passing the sign test for test accuracy implemented in the GMDR software at the significant level of 0.05, would be evaluated using permutation testing. For the purpose of comparison, we also used GMDR to analyze the three traits individually. […]

Pipeline specifications

Software tools GMDR, PLINK, GCTA
Databases dbGaP dbSNP
Application GWAS
Organisms Homo sapiens