Computational protocol: Impact of pre-imputation SNP-filtering on genotype imputation results

Similar protocols

Protocol publication

[…] We studied 100 German individuals collected in a close area in Saxony and Thuringia. Individuals are a subset of a cohort of an ongoing study regarding genetics of dyslexia [],[]. 65 individuals were males. Ethical approval was obtained from the Ethics Committee of the University of Leipzig. The regional school council Leipzig approved access to study participants in schools. Informed and written consent was obtained from each parent. Individuals were genotyped using Genome-Wide Human SNP Array 6.0 (Affymetrix, Inc., Santa Clara, California, USA). Genomic DNA from these individuals was extracted from blood and saliva using standard silica-based methods and extraction as described by the manufacturer (DNA Genotek, Ottawa, Ontario, Canada and Qiagen, Hilden, Germany), respectively. Integrity of genomic DNA was verified applying agarose gel electrophoresis. Array processing was carried out as a service by the genome analysis centre (Helmholtz-Zentrum München, Munich, Germany). Genotypes were called using the birdseed version 1 algorithm [] implemented in the Affymetrix Genotyping Console software version 4.0, with standard settings. Genotype calling was improved by including additional reference individuals. Overall call rate was between 94.6% and 99.3% with a mean and median call rate of 98.3% and 98.45%, respectively. Included samples passed all technical array-wide quality control criteria as implemented in Genotyping Console (Bounds, Contrast QC, Contrast QC (Random), Contrast QC (Nsp), Contrast QC (Nsp/Sty Overlap), and Contrast QC (Sty) had to be larger than 0.4).Only unrelated individuals were studied, i.e. it holds that p-Hat < 0.05 for all pairs of individuals as calculated by PLINK [] on the basis of our genome-wide data. Analysis of population stratification was based on 30,501 independent SNPs. Applying the EIGENSTRAT method [] revealed no evidence for population stratification. Clustering of first principal components of our samples resulted in a homogenous distribution which partly overlaps with those of HapMap individuals of Caucasian descendant (HapMap CEU, Additional file : Figure S1). Fst indicated close relation between HapMap CEU and our sample (Fst = 0.00062, calculated with software Arlequin 3.5.1.3 []). [...] For imputation of masked SNPs, we applied the software tools MaCH1.0 [] and IMPUTE v2.1.2 [] following best practice guides of the authors. Formats of genotype data required by MaCH and IMPUTE were created by “fcGENE”, a format converting tool developed by our group. This tool is based on C/C++ and is freely available on Sourceforge website [].For imputation with MaCH1.0, 100 iterations of the Hidden Markov Model (HMM) sampler were applied with a maximum of 200 randomly chosen haplotype samples. MaCH commands are provided as supplemental material. In case of imputation with HapMap reference (HapMap3 NCBI Build 36, CEU panel), we applied the recommended two step imputation process [],[]. More precisely, model parameters of the underlying Hidden-Markov model were estimated by running the “greedy” algorithm. During the first step of the algorithm, both, genotyping error rates and cross-over rates were estimated. The second step exploits these parameters to impute all SNPs of the reference panel. When comparing imputation quality between the different filtering scenarios with the help of our newly proposed measures, we used the posterior probabilities contained in the MaCH output files with extension “.mlgeno”.As recommended by IMPUTE developers [],[],[],[], we performed segmented-imputation of chromosome 22 by defining different genomic intervals approximately of size 5 MB. To avoid margin effects of chromosome segmentation, IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval after applying the option --buffer <250 > []. CEU HapMap references (HapMap3 NCBI Build 36) down-loaded from the official website of IMPUTE [] were used for the imputation scenarios requiring a reference panel. More precisely, genetic recombination rates, reference haplotypes and the legend file were used as provided on the website.Command options and parameters used to run MaCH and IMPUTE2 are provided in detail in the supplement material. Throughout all scenarios considered we always used the settings as described above. Reference files used for MaCH and IMPUTE2 contained exactly the same SNPs (M = 20,085). […]

Pipeline specifications

Software tools PLINK, Arlequin, IMPUTE, fcGENE
Applications Population genetic analysis, GWAS