Computational protocol: Genome-Wide Association Study and Pathway-Level Analysis of Tocochromanol Levels in Maize Grain

Similar protocols

Protocol publication

[…] The diversity panel was previously genotyped with the Illumina MaizeSNP50 BeadChip () and various other SNP genotyping assays (; ). The MaizeSNP50 genotypic data set for the maize association panel from is available for download from the Panzea database ( In this study, the diversity panel was also genotyped with a genotyping-by-sequencing (GBS) protocol () to further enhance genome-wide marker density. The GBS marker data set (partially imputed genotypes; January 10, 2012 version) is available for download from the Panzea database ( Removal of monomorphic and low-quality SNPs generated a data set with 591,822 SNPs, of which 294,092 SNPs had minor allele frequencies (MAFs) greater than or equal to 0.05 in the panel. Of the 294,092 SNPs used for association analysis, 7964 of them were redundant as a result of loci overlapping among the three SNP data sets. These redundant SNPs were included in the association analysis because of the variable missing data patterns and error rates attributable to the different genotyping technologies. Before association analysis, missing SNP genotypes were conservatively imputed with the major allele.With BLUPs of each tocochromanol trait, we conducted a GWAS with 294,092 genome-wide SNPs using a univariate unified mixed linear model () that eliminated the need to recompute variance components (i.e., population parameters previously determined, or P3D) () in the Genome Association and Prediction Integrated Tool package (). To control for population structure and familial relatedness, the mixed model included principal components () and a kinship (coancestry) matrix () that were calculated using the 34,368 nonindustry SNPs from the Illumina MaizeSNP50 BeadChip. For BLUPs of each tocochromanol trait, the Bayesian information criterion () was used to determine the optimal number of principal components to include as covariates in the mixed model. A likelihood-ratio-based R2 statistic, denoted R2LR (), was used to assess the amount of phenotypic variation explained by the model. The procedure was used to control for the multiple testing problem at false-discovery rates (FDRs) of 5% and 10%.In addition to the unified mixed linear model, a multilocus mixed model (MLMM) () was implemented to clarify complex association signals that involved a major effect locus. The MLMM employs stepwise mixed-model regression with forward inclusion and backward elimination, thus allowing for a more exhaustive search of a large model space. In contrast to the unified mixed model with P3D, the MLMM re-estimates the variance components of the model at each step. Specifically, all SNPs on a chromosome (i.e., chromosome-wide) with a major effect locus were considered for inclusion into the final model. The optimal model was selected using the extended Bayesian information criterion (). We then reconducted GWAS for each trait with MLMM-identified SNPs included as covariates in the unified mixed linear model for better control of major effect loci. [...] LD between pairs of SNPs was estimated by using squared allele-frequency correlations (r2) in TASSEL version 3.0 (). Local LD (r2) and common haplotype patterns were also assessed in Haploview version 4.2 (). Haplotype blocks were defined with the confidence interval method of . Only SNPs with a MAF ≥0.05 and less than 0.10 missing data were used to estimate LD. In contrast to the GWAS, missing SNP genotypes were not imputed with the major allele before LD analysis. […]

Pipeline specifications

Software tools GAPIT, TASSEL, Haploview
Databases Panzea
Applications GBS analysis, GWAS
Organisms Zea mays
Chemicals Vitamin E, Tocopherols, Tocotrienols