Computational protocol: Effects of normalization on quantitative traits in association test

Similar protocols

Protocol publication

[…] The normalization methods investigated were: logarithm, Box-Cox, inverse-logarithm (i.e. log [-[xi-min(x)]] where xi is the quantitative trait for sample i), and rank-based transformation []. Ranking of SNPs and its p-value from associative tests after multiple test corrections were used for performance assessment. Motivation for the inverse-logarithm transformation came from one of our quantitative traits where the data contained negative values and was highly skewed to the left, where neither logarithm nor Box-Cox would normalize it appropriately. All the transformations were provided or coded in Matlab.HAPSIMU simulates from a model where the effects of the allele are additive based on the phenotypic variance explained (PV) and frequency (f) of the disease susceptibility allele. It utilizes informative marker loci from the ENCODE regions genotyped in CEPH and YRI. A normally distributed trait is generated for each of the genotypes (AA, AB, BB). To avoid confounding factors such as polygenic effects and population admixture, one causal SNP was generated out of 100 SNPs in a homogenous population of YRI for our simulations. We noted that 100 SNPs is not representative of GWAS data, neither is one causal SNP the realistic scenario. However, given that most complex diseases are polygenic model involving probably thousands of causal SNPs, the scenario of 1 causal out of 100 SNPs can be reasonably extrapolated to a GWAS dataset of few thousands causal SNPs; a reasonable representation of a polygenic, additive disease model. The settings for HAPSIMU are shown in Figure . 120 sets of data were generated to evaluate the sensitivity and ranking of causal SNP. We also investigated the effects of normalization versus sample size and PV of causal SNP, generating datasets for 1000, 2000, 4000, and 8000 subjects, with PV of 0.01, 0.02, and 0.2. The PV can be seen as having small to large effects on the quantitative traits. In total, 1440 datasets comprising of 100 SNPs (1 causal) was generated.Quantitative traits generated by HAPSIMU are normally distributed with mean and standard deviation determined by PV and f. To investigate various trait distributions such as left- and right- skew as well as bimodal, traits were log and beta transformed. Four traits were obtained after HAPSIMU (i.e. normal, left-skew, right-skew, bimodal), and each of these traits was transformed using logarithm, inverse-logarithm, Box-Cox, and rank-based, i.e. total of 20 traits. Quantitative traits were tested using the common GWAS software Plink [] which implemented the likelihood ratio test and Wald test. It generated an output file with extension .qassoc that comprised of estimated regression coefficient, standard error, and asymptotic p-value. The p-value was used for assessment.Criteria for performance assessment were based on (1) displacement ranking of causal SNP out of the 100 SNPs, where expected rank was 1 (i.e. displacement is 0), and (2) Bonferroni corrected p-value for significant association was <5 Ă— 10-4. Significant causal SNPs were considered true positives while significant non-causal SNPs were false positives. Sensitivity or True Positive Rate (TPR) was computed for the 120 datasets using TP/(TP+FN)*100 where TP, FN were true positive and false negative respectively from the confusion table tabulated from 120 datasets. False Positive Rate (FPR) or Type I error was computed as FP/(FP+TN)*100 where FP and TN were false positive and true negative from the confusion table. Since there was only one causal SNP, false negative was either 1 or 0 for each simulation, so sensitivity was synonymous with power. […]

Pipeline specifications

Software tools HAPSIMU, PLINK
Applications Population genetic analysis, GWAS