Computational protocol: A Genome-Wide Scan for Breast Cancer Risk Haplotypes among African American Women

Similar protocols

Protocol publication

[…] The inferred haplotype dosage estimates, , abbreviated as , can be used individually in a 1-degree-of-freedom (d.f.) test in testing for haplotype-specific associations with the disease using model (3),(3)or a global test simultaneously fitting all haplotypes within the haplotype block defined by a sliding window using model (4),(4)where denotes the total number of possible haplotypes within that block and the degrees of freedom of the global test in model (4) are therefore . In both models, X is the vector of covariates, including age, study, and the top ten eigenvectors of ancestral information estimated by principal components analysis to adjust for global ancestry differences. The eigenvectors are included in the model to control for potential confounding due to population stratification and admixture. In haplotype association analysis, a large fraction of the inferred haplotypes can be very rare, with frequency close to zero . It is customary to discard rare haplotypes that are less than 1% frequent to reduce the total d.f. of the model so that the power to detect risk effects of relatively common haplotypes can be well preserved. Suppose that there are haplotypes greater than 1% of frequency, where << holds true in many cases, the d.f. of the global test reduces to from as indicated in model (5)(5)We started with applying the global test throughout the whole genome to agnostically search for haplotype effects following the 5-SNP sliding window framework, while the 1 d.f. test of individual haplotype-specific effects was performed only when a potentially significant region was detected by the global test. For visualization purposes, haplotype effects were compared to the effects of the constituent SNPs at the same chromosomal region by an overlaid Manhattan plot showing the statistical significance, presented as –log10(p-value), of both haplotypes and SNPs. Haplotype effects would become interesting only if a noticeable haplotype effect peak was not accompanied by a similar significance peak involving the constituent SNPs. For regions exhibiting considerable haplotype effects, they were further extended both upstream and downstream by half of the original width to include more flanking SNPs and haplotypes, making the extended regions twice longer (). All possible individual haplotypes composed of 2 up to 10 SNPs (or the maximum number of genotyped SNPs contained in the extended region, whichever is smaller) with haplotype frequency >1% were investigated exhaustively to single out the particular haplotype(s) explaining the significant global test. The top individual haplotypes were further verified by a likelihood ratio (LR) test comparing the model with both the top haplotype and the best single SNP contained (model 6) to the nested model with the same best SNP only (model 7),(6)(7)where gi denotes the genotypes of the SNP carried by an individual i and an additive excessive effect of each risk allele on the disease is assumed. The novelty of the haplotype effects compared to the SNP effects was assessed using a LR test with 1 d.f. We were also interested in whether the haplotype effects could be otherwise captured by genotype imputation in the same region. The genotype imputation was performed by Mendel-GPU using the 1000 Genomes Projects (1 KGP) data as the reference panel . The much denser 1 KGP has a better genomic coverage of rare and low frequency markers and is reported to be capable of providing more statistical power to identify the underlying associations . The superiority of haplotype analysis to SNP imputation could be highlighted by the presence of haplotype signals where significant genotyped or imputed SNPs are absent. In regions with the strongest haplotype effects, we also inferred and adjusted for the local ancestry information for each marker residing near the haplotypes of interest (±250 kb). The local ancestry characterizes the proportions of European and African ancestry, represented by the posterior probabilities of carrying 0, 1, and 2 copies of a European allele at each SNP. The local ancestry was computed by HAPMIX with 240 HapMap EUR+YRI phased founder haplotypes per chromosome as input. The top haplotype effect was further adjusted for the inferred local ancestry in addition to adjustment for global ancestry (i.e. using the leading principal components), age, and study as described above. This additional adjustment for local ancestry could help eliminate false positive haplotype effects that were confounded by local ancestry .In addition, haplotype effects in the neighborhood of known breast cancer risk SNPs identified predominantly in European populations were investigated especially carefully. Twenty-one regions (1p11, 2q35, 3p24, 5p12, 5q11, 6q14, 6q25, 8q24, 9p21, 9q31, 10p15, 10q21, 10q22, 10q26, 11p15, 11q13, 14q24, 16q12, 17q22, 19p13, and 20q11) and their associated SNPs were of primary interest. Regions with potential of harboring unknown haplotype effects were scrutinized by inferring all possible individual haplotypes of frequency >1% consisting of 2–10 consecutive SNPs in the neighborhood of ±250 kb of known breast cancer risk hits (except for 8q24, where ±2 Mb was used –). As before, the important haplotype effects were compared with the significance of genotyped as well as with the 1 KGP imputed SNPs in the same region. The independence of these haplotype-disease associations were further verified by LR tests adjusting for the SNP effects from both the regionally best SNP and the known breast cancer risk SNP. Notable haplotypes residing in proximity to the known breast cancer risk hits were again corrected for local ancestry inferred from the same region to eliminate potential confounding due to local genetic ancestry admixture.PLINK was the primary software to conduct the association analyses. All regression models were adjusted for age, study, and global ancestry. For important haplotypes indentified through association analyses, local ancestry was additionally adjusted for. […]

Pipeline specifications

Software tools HAPMIX, PLINK
Applications Population genetic analysis, GWAS
Organisms Homo sapiens
Diseases Breast Neoplasms