Computational protocol: Large scale genomics unveil polygenic architecture of human cortical surface area

Similar protocols

Protocol publication

[…] All studies were genotyped using different commercial arrays. Standard genome-wide association quality control measures were applied to each study individually using the Plink toolset. Samples missing >5% of SNPs, with a minor allele frequency of <1%, or failing a test of Hardy–Weinberg equilibrium (P<1 × 10−6), were excluded. Individual samples showing an over- or underabundance of heterozygosity (>5 s.d. from the mean) were labelled as poor quality and also excluded from subsequent analyses. Furthermore, to ensure that all individuals were unrelated, functions available in the software package GCTA were used to estimate kinship values from SNP genotypes for all pairs of individuals in the combined cohort. Population stratification and ancestry were assessed against a reference sample consisting of individuals from the HapMap III and 1000 Genomes via principal component analysis implemented in the software package EIGENSOFT. One half of each pair of individuals with an estimated relatedness >0.025 or 0.1 was excluded. Using the more stringent threshold of 0.025, 575 individuals were removed, leaving a total of 2,364 individuals for the subsequent analyses. In this combined cohort of European ancestry with minimal relatedness between subjects (GRM<0.025), 52% of the individuals were female; the subjects were aged 47±24 years (range=3–90 years); and 273, 128, 131, 147 and 66 subjects were diagnosed with mild cognitive impairment, Alzheimer's disease, schizophrenia, bipolar disorder and other psychotic, respectively. For the less stringent threshold of the estimated relatedness of 0.1, 241 individuals were removed, leaving a total of 2,698 individuals for the subsequent analyses (GRM<0.1). To maximize information present in the data and allow for comparison across multiple samples genotyped on different platforms, genotype imputation was performed using the software packages MaCH and Minimac. A quality control metric (r2) was provided by Minimac and a threshold of r2>0.5 was used to declare successful imputation. [...] We partitioned the variance explained by all of the SNPs into low- and high-conserved regions of the whole genome based on conservation annotation. We obtained a conservation annotation database from the UCSC Genome Browser hg19 assembly. The conservation scores were derived from alignments of placental mammals to human genome. PhastCons is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignments.We assigned weights to conservation scores based on the LD information. We applied the pairwise LD matrix to the vector of phastCons scores. We expect that SNPs with the LD-weighted conservation annotation show more consistent and less noisy association signals. After the LD weighting, 48,523 of the ∼2.4 million SNPs had no scores and were eliminated from the subsequent analysis. We selected the median as a threshold to partition the genome evenly into low- and high-conserved SNPs (∼50%). We estimated the proportion of variance explained by low- and high-conserved genomic regions. The results are shown in and for the GRM<0.025-sample, and and for the GRM<0.1-sample. […]

Pipeline specifications

Software tools PLINK, GCTA, EIGENSOFT, minimac, PHAST
Databases UCSC Genome Browser
Applications Population genetic analysis, GWAS, Genome data visualization
Organisms Homo sapiens