Computational protocol: Elucidating the genetic architecture of reproductive ageing in the Japanese population

Similar protocols

Protocol publication

[…] BBJ participants had DNA genotyped on more than 950,000 variants using either (a) a combination of Illumina Human OmniExpress BeadChip and Infinium HumanExome BeadChip or (b) Infinium OmniExpressExome BeadChip alone. Variants overlapping across these two sets of genotyping arrays were extracted. Variants were then excluded according to the following criteria: (i) monomorphic in any chip; (ii) call rate <99%; (iii) minor allele count in heterozygotes <5; (iv) Hardy-Weinberg Equilibrium P-value <1 × 10−6 in any chip. For sample quality control, we excluded samples with (i) call rate <98%; (ii) discordant phenotypic and genotypic sex; (iii) excess heterozygosity; (iv) cryptic relatedness assessed by pi_hat measurement (>0.2) for identity by descent; or if (v) not from mainland Japan identified by principal component analysis using all samples from the 1000 Genomes Project. After quality control, 532,488 autosomal variants were phased using Eagle2 and subsequently imputed up to the reference panel from the 1000 Genomes Project Phase 3 using minimac3.Variants with good imputation quality (minimac rsq >0.3) were tested for associations with two quantitative traits, age at menarche and age at menopause, and two dichotomous traits, early menarche and late menarche, assuming additive allelic effects. Associations with ages at menarche and menopause were tested in linear regression models using mach2qtl. Associations with early and late menarche (both vs. the median group), or each age year of age at menarche (age 9, 10, 11, 12, 13, 15, 16, 17, 18, 19 and 20) vs. the same median (age 14) group), were tested in logistic regression models using mach2dat. In each model, ten principal components were included to adjust for cryptic population structure, in addition to birth year as a covariate.Variance explained by genetic variants in the current study were estimated using the Restricted Estimate Maximum Likelihood (REML) method implemented in BOLT-LMM. We tested different SNP sets: (i) Directly genotyped variants within 250 kb up- or down-stream of the previously reported European lead variants,, and (ii) all directly genotyped variants which passed quality control. The same methodology and default parameters were applied to both the BBJ and UK Biobank study, using ancestry-specific LD scores generated from the 1000 Genomes project. Pathway enrichment was performed on 3216 pathways defined in Gene Ontology, PANTHER, KEGG and Ingenuity using MAGENTA. The same approach was used to test the set of 20 PTPR family genes. All default parameters were used, and significance was based on an FDR<0.05 for the 75th enrichment centile. [...] Tissue enrichment analyses were performed using MAGMA (implemented through FUMA) and LDSC-SEG. Software supplied default parameters were used for both approaches, and East-Asian specific LD scores and LD information were used where appropriate. Tissue enrichment was performed on tissues from the V6p GTEx release. Individual locus expression and methylation data were incorporated using Summary Mendelian Randomization (SMR). We used a subset of 1344 unrelated East Asian individuals (based on principle component analysis and self-reported ancestry) from the UK Biobank study (500 K HRC release) as an LD reference panel for SMR analysis. Analysis was performed using the pre-supplied McRae et al. methylation data and V7 GTEx gene expression data ( All analyses used a Bonferroni correction based on the number of tests performed (i.e 0.05/N genes). [...] Known menarche and menopause European loci were defined as those discovered in the two largest reported GWAS meta-analyses to date,. As effect estimates reported in discovery meta-analyses are potentially inflated due to the ‘winners curse’ phenomenon, we derived more robust European effect estimates in independent samples from the UK Biobank study. For menarche, this required us to restrict the number of known European loci to the largest discovery meta-analysis prior to inclusion of UK Biobank. A total of 73,397 women with genotype and age at menarche were available from UK Biobank. Age of natural menopause was available for 32,545 UK Biobank women, using the same inclusion/exclusion criteria applied to BBJ women. This analysis was performed using a linear mixed model implemented in BOLT. Trans-ethnic meta-analysis was performed using Han and Eskin’s Random Effects model, implemented in Metasoft. [...] Total RNA was extracted from the MBH tissues of female rats and rhesus monkeys at different developmental stages using the RNeasy mini kit (Qiagen, Valencia, CA) following the manufacturer’s protocol. DNA contamination was removed from the RNA samples by on-column digestion with DNAse using the Qiagen RNase-free DNase kit (Qiagen, Valencia, CA) according to the manufacturer’s instructions. RNA concentrations were determined by spectrophotometric trace (Nanodrop, ThermoScientific, Wilmington, DE). Five-hundred ng of total RNA were transcribed into cDNA in a volume of 20 µl using 4 U of Omniscript reverse transcriptase (Qiagen, Valencia, CA).Relative mRNA abundance was determined using the SYBR GreenER™ qPCR SuperMix system (Invitrogen, Carlsbad, CA). Suitable amplification primers were designed using the PrimerSelect tool of DNASTAR 14 software (Madison, WI) or the NCBI online Primer-Blast program (Supplementary Table ). PCR reactions were performed in a volume of 10 μl containing 1 μl of diluted cDNA, 5 μl of SYBR GreenER™ qPCR SuperMix and 4 μl of primers mix (1 µM of each gene specific primer). The PCR conditions used were as follows: 5 min at 95 °C, 40 cycles of 15 s at 95 °C and 60 s at 60 °C. To confirm the formation of a single SYBR Green-labelled PCR amplicon, each PCR reaction was followed by a three-step melting curve analysis consisting of 15 s at 95 °C, 1 min at 60 °C, ramping up to 95 °C at 0.5 °C per sec, detecting every 0.5 s and finishing for 15 s at 95 °C, as recommended by the manufacturer. All qPCR reactions were performed using a QuantStudio 12 K Real-Time PCR system (Thermo Fisher, Waltham, MA); threshold cycles (CTs) were detected by QuantStudio 12 K Flex software (Thermo Fisher, Waltham, MA). Relative standard curves were constructed as reported,. [...] RNA extracted from the MBH of prepubertal female rats was subjected to RNA-seq following a procedure we recently described. The existence of differential gene expression in the medial basal hypothalamus during pubertal development was determined by employing the edgeR analysis package. An initial trimming and adaptor removal step was carried out using Trimmomatic. Removal of the Illumina adapter sequences and default filtering parameters was performed as suggested in the program’s documentation, with exception of a hard clip of the first 13 bases of the reads; this latter step was based on FastQC visualization of the read qualities before trimming, which indicated significantly lower read qualities in those bases as compared to the remainder of the read. Trimmed reads that passed the Trimmomatic selection criteria were then aligned to the rn6 build of the rat genome with Bowtie2/Tophat2, using default parameters. Aligned reads were assigned to gene-level genomic features of the Ensembl 83 annotation set using the Rsubread featureCounts R function with countMultiMappingReads and allowMultipleOverlaps flags set to TRUE. Differential expression between time points was then analysed using the generalized linear modelling approach implemented in edgeR, based on the methods described in detail in that package’s documentation. Dispersion parameters for the regression model were estimated using sequential application of edgeR’s global and tagwise methods after normalization for library size. The design matrix used for the analysis included both indicator variables for time point and batch effects. The latter was employed to account for any variability arising from runs carried out on different dates and different flow cells; aside from time point and batch effects there were no other notable covariates of interest or potential confounders that we felt required consideration in the model. Because the model simultaneously considers the variance of all timepoints, in addition to potential batch effects, it effectively generates an integrated comparison of the entire data set and obviates the need to run repeated ANOVA tests on all ~34,000 individual genes; it also allowed us to more precisely identify differentially expressed genes based on significance values of pairwise comparisons of time points using the edgeR’s likelihood ratio test function glmLRT. In addition, we utilized the edgeR voom function to produce per-sample estimates of read counts per million for tabular visualization and to plot trends over time. Upon identification of the PTPR family as a gene set of interest, we used qRT-PCR to verify the changes detected by RNA-seq, and analysed the results by ANOVA followed by a multiple comparison test (see below). […]

Pipeline specifications

Software tools Minimac3, BOLT-LMM, MAGMA, FUMA, LDSC, METASOFT
Databases GTEx UK Biobank
Application GWAS
Organisms Homo sapiens, Rattus norvegicus
Chemicals Gonadotropin-Releasing Hormone, Tyrosine