Computational protocol: Genetic variation in the immunosuppression pathway genes and breast cancer susceptibility: a pooled analysis of 42,510 cases and 40,577 controls from the Breast Cancer Association Consortium

Similar protocols

Protocol publication

[…] For the BCAC studies, genotyping was carried out using a custom Illumina iSelect array (iCOGS) designed for the Collaborative Oncological Gene-Environment Study (COGS) project (Michailidou et al. ). Of the 211,155 SNPs on the array, 4246 were located within 50 kb of the selected candidate genes. Centralized quality control of genotype data led to the exclusion of 651 SNPs. The exclusion criteria included a call rate less than 95 % in all samples genotyped with iCOGS, minor allele frequency (MAF) less than 0.05 in all samples, evidence of deviation from Hardy–Weinberg equilibrium (HWE) at p value <10−7, and concordance in duplicate samples less than 98 % (Michailidou et al. ). A total of 3595 SNPs passed all quality controls and was analyzed.Per-allele associations with the number of minor alleles were assessed using multiple logistic regression models, adjusted for study, age (at diagnosis for cases or at recruitment for controls) and nine principal components (PCs) derived based on genotyped variants to account for European population substructure. We assessed the associations of SNPs with overall breast cancer risk as primary analyses, and then restricted to ER-positive (26,094 cases and 40,577 controls) and ER-negative subtypes (6870 cases and 40,577 controls) as secondary analyses. Differences in the associations between ER-positive and ER-negative diseases were assessed by case-only analyses, using ER status as the dependent variable. To determine the number of “independent” SNPs for adjustment of multiple testing, we applied the option “--indep-pairwise” in PLINK (Purcell et al. ). SNPs were pruned by linkage disequilibrium (LD) of r2 < 0.2 for a window size of 50 SNPs and step size of 10 SNPs, yielding 689 “independent” SNPs. The significance threshold using Bonferroni correction corresponding to an alpha of 5 % was 7.3 × 10−5.In order to identify more strongly associated variants, genotypes were imputed for SNPs at the locus for which strongest evidence of association was observed, via a two-stage procedure involving SHAPEIT (Howie et al. ) and IMPUTEv2 (Howie et al. ), using the 1000 Genomes Project data as the reference panel (Abecasis et al. ). Details of the imputation procedure are described elsewhere (Michailidou et al. ). Models assessing associations with imputed SNPs were adjusted for 16 PCs based on 1000 Genome imputed data to further improve adjustment for population stratification. To determine independent signals within imputed SNPs at STAT3, we ran a stepwise forward multiple logistic regression model including the most significant genotyped SNP rs1905339 and all imputed SNPs, adjusted for study, age and 16 PCs.SNP association analyses and case-only analyses were all conducted using SAS 9.3 (Cary, NC, USA). All tests were two-sided.For multiple associated SNPs located at the same gene, a Microsoft Excel SNP tool created by Chen et al. () and the software HaploView 4.2 (Barrett et al. ) were used to examine LD structure between these SNPs. To be able to inspect LD structures and also for gene-level analyses, allele dosages of imputed SNPs had to be converted into the most probable genotypes. Therefore, we categorized the imputed allele dosage between [0, 0.5] as homozygote of the reference allele, the value between [0.5, 1.5] as heterozygote, and the value between [1.5, 2.0] as homozygote of the counted allele. The regional association plot was generated using the online tool LocusZoom (Pruim et al. ). [...] In order to examine whether potential causative genes influence RNA expression in breast tumor tissue, we downloaded RNA sequence level 3 data from The Cancer Genome Atlas (TCGA) (). We retrieved the RNA expression level as the form of RNA-Seq by expectation–maximization (RSEM) based on the IlluminaHiSeq_RNASeqV2 array. Gene expression differences in RNA levels between 989 invasive breast cancer tissues and 113 matched normal tissues for four genes of interest (STAT3, PTRF, IL5, and GM-CSF) were analyzed using a two-sided Wilcoxon–Mann–Whiney test. In addition, data from 183 breast tissues in the GTEx (V6) () publically available online databases were evaluated to obtain information on whether the most interesting variants (rs1905339, rs8074296, rs146170568, chr17:40607850:I and rs77942990) were expression quantitative trait loci (eQTL) for any gene. Also, GTEx was queried to obtain information on whether the five variants were eQTL for STAT3 or PTRF. [...] To investigate potential regulatory functions of interesting polymorphisms, we used the Encyclopedia of DNA Elements (ENCODE) database through the UCSC Genome Browser as well as Haploreg v4 (Ward and Kellis ). […]

Pipeline specifications

Software tools PLINK, SHAPEIT, Haploview, LocusZoom, RSEM, HaploReg
Databases GTEx UCSC Genome Browser
Applications RNA-seq analysis, GWAS, Genome data visualization
Organisms Homo sapiens
Diseases Breast Neoplasms, Neoplasms
Chemicals Estrogens