Computational protocol: Phenome wide Association Study Relating Pretreatment Laboratory Parameters With Human Genetic Variants in AIDS Clinical Trials Group Protocols

Similar protocols

Protocol publication

[…] The PLINK program and R statistical programming language were used for QC procedures [, ] (summarized in Figure ). Polymorphisms were censored for call rates <98%. We excluded 22 samples in which genetically inferred sex differed from clinical data, or missing sex status that could not be inferred. We excluded 30 samples from Phase II for specimen/data labeling inconsistency. We excluded 54 samples with overall genotyping call rates <98%. We excluded 4 samples with cryptic relatedness based on identity-by-descent estimates > 0.3 from approximately 100 000 pruned SNPs. This yielded 2811 samples for imputation.Post-QC data from each phase were imputed to 1000 Genomes [] after converting to genome build 37 using liftOver [] and stratifying by chromosome to parallelize imputation processing. ShapeIt2 [] was used to check strand alignment and to phase data. The IMPUTE2 algorithm [] was used to impute additional genotypes that were available in the 1000 Genomes reference panel but not directly genotyped. Each chromosome was segmented into 6 Mb regions with at least 3500 reference variants in each region. Imputed genotypes were included if posterior probabilities exceeded 0.9.Quality of imputed data was assessed following the Electronic Medical Records and Genomics (eMERGE) protocol []. Each chromosome from each phase was checked for 100% concordance with genotyped data. No batch effects were found. We dropped imputed SNPs with imputation scores <0.3, genotyping call rates <98% and minor allele frequencies < 0.01. [...] Phase III represented many more subjects than Phase I or Phase II. Therefore, to seek replication, we divided the data into 2 comparable-sized groups: Dataset I (Phase III representing protocol A5202) and Dataset II (Phases I and II representing protocols ACTG384, A5095, and A5142). When linked with available clinical laboratory data, final datasets included 1181 subjects for Dataset I, 1366 subjects for Dataset II, and 5 954 294 SNPs for each dataset. Statistical analyses were limited to genetic loci shared by all genotyping phases after imputation.Using the R statistical package, continuous traits were modeled with linear regression and dichotomous traits with logistic regression []. The first 5 principal components, calculated using EIGENSOFT [], were used to adjust for global ancestry. Each analysis was also adjusted for sex and age. In the secondary analyses, we also adjusted for CD4 T-cell counts (square-root transformed), a marker of HIV-1 disease progression. All results presented herein are for PheWAS associations adjusted for square-root transformed CD4 T-cell counts. Results were not substantially different when not adjusted for this covariate (data not shown).We first identified SNP-phenotype associations with P values < .01 and with the same direction of association in both datasets, using replication to reduce the impact of multiple testing []. In addition to seeking internal replication across the 2 datasets (instead of P-value correction to control type-I error), for external replication we leveraged SNP associations posted to the GWAS Catalog []. [...] Biofilter [] was used to annotate PheWAS results with previously reported associations from the GWAS Catalog through October 2013, with GWAS Catalog P values < 1 × 10−5 []. Biofilter was also used to annotate (1) SNPs with gene information as well as (2) biological pathway information from the Kyoto Encyclopedia of Genes (KEGG) []. [...] Many SNPs were correlated with each other due to extensive genotypic coverage provided by imputation. Using all SNPs that passed our PheWAS P-value filter threshold for associations across Dataset I and Dataset II, we estimated SNP haplotype blocks with Haploview [] implemented in PLINK [] separately for each dataset. We grouped these SNPs into linkage disequilibrium (LD) haplotype blocks using 10 000 kb windows. For Dataset I, 4246 SNPs collapsed into 668 LD blocks, and 6338 SNPs did not collapse into an LD block. For Dataset II, 4428 SNPs collapsed into 694 LD blocks, and 6156 SNPs did not collapse into an LD block. Haplotype blocks and association results were imported into a MySQL database. This process allowed results to be collapsed based on LD, streamlining exploration of association signals across correlated SNPs and facilitating evaluation of groups of SNPs in LD associated consistently with multiple phenotypes.We provide all association results with haplotype block information for each dataset, as well as the nearest gene(s), and any known GWAS Catalog associations, at (Supplementary Table 1). [...] Synthesis-View [] was used to visualize results that replicated GWAS Catalog associations. Phenogram [] was used to visualize results for potentially pleiotropic SNPs. GGPlot2 [] was used to generate Manhattan plots. […]

Pipeline specifications