Similar protocols

Protocol publication

[…] QC steps implemented in this pipeline. The steps automated here mostly follow the notes on QC developed by MikeWeale and also calls some R scripts described in his notes during the QC pipeline. It is assumed that input files are already in PLINK format. shows a complete QC pipeline that includes combining data from multiple chromosomes and studies and two portions of the QC pipeline. There are two scripts, QC.py which takes advantage of PLINK calls and also PCA.py that does principal component analysis to investigate population stratification.Gender mismatches    The optional first step in the automated pipeline is a check for gender mismatch using the PLINK ‘-- check-sex’ command. This command compares the sex reported in the .fam file and the sex imputed from the X chromosome inbreeding coefficients. This step automatically removes individuals where problems are identified. The step was made optional because our dataset of interest is matched with phenotype/clinical data of higher accuracy. This step can be turned off using the parameter file as described below in the Operation section.Thresholds    The next steps in this pipeline include checking and applying thresholds for minor allele frequency (MAF), missingness for each individual, and missingness of markers. Minor allele frequency filtering is important because rare genotypes will not show up as often and thus will have less evidence in a GWAS and the calls will be less certain and it is also difficult to detect associations with them. Missingness can lead to false associations if it is non-random with respect to phenotypes or genotypes. Single nucleotide polymorphism (SNP) missingness is the complement to individual missingness and is correlated with SNP quality from the original genotyping assay. Missingness is investigated using PLINK ‘--missing’ and plots are generated as described by Weale . All the plots that are generated during the process are compressed at the end in order to facilitate downloading them when the process is performed remotely on a cluster.    In order to attempt to retain the largest number of markers and individuals that pass QC there is an option to do a two-tiered missingness by individuals filtering. We noticed during testing that this could sometimes lead to final datasets with higher numbers of both. If a value is supplied to the #MIND1 parameter (described below in Operations), then this (expectedly non-stringent) threshold for PLINK ‘--mind’ is used first and the more stringent #MIND is applied in the same step as the PLINK ‘--geno’ and ‘--maf’ thresholds for missingness of markers and minor allele frequency, respectively. If a major reduction in the number of markers or individuals is found during these steps, investigation of the generated graphs can help adjust these thresholds. See notes for more information.    Some reasoning suggests that a minor allele frequency threshold should be set to 10/n where n is the number of markers. The #MAF parameter can be set to ‘na’ which will use 10/n as a threshold or a threshold value can be explicitly given, such as ‘.01’.Heterozygosity    Individuals resulting from random mating within a population should have predictable heterozygosity (H) values. H is a measure of the number of loci in an individual that are heterozygous. Departure from expected H values can signify DNA quality issues (high H) or samples from a different population (low H). This step can be turned off by not supplying the ‘#HET’ parameter in the parameter file. As long as the parameter is listed, this step will be done. H and the inversely related F (Method-of moments F coefficient estimate) are investigated using PLINK ‘--het’. F is calculated as the ([observed homozygous count] - [expected count])/([total observations] - [expected count])) where the expected count is calculated from an imputed MAF. A histogram of F values is generated for manual investigation.    If values for ‘#FMIN’ and ‘#FMAX’ are supplied in the parameter file then samples with an F value below ‘#FMIN’ and above ‘#FMAX’ are removed. If these values are not supplied then samples above or below three standard deviations of the mean H are removed, as suggested be Anderson et al. .Hardy-Weinberg equilibrium (HWE)    Markers out of HWE can indicate that there were genotyping errors. However, a strong association signal can also result in deviations from HWE. So here only variants from control samples are checked for deviations from HWE. PLINK ‘--hardy’ is used to generate HWE p-values and a Q-Q plot of the log-P-values of the markers for the controls is generated for manual investigation. A p-value threshold is supplied in the parameter file to remove markers with a p-value lower than expected.Cryptic relatedness    Cryptic relatedness (CR) is when pairs of individuals are closely related and can lead to false positive or negative correlations when subjects are treated as independent. The PLINK ‘--genome’ command can estimate relatedness, but is quite slow when there are a large number of markers in a dataset. Therefore, markers in high linkage disequilibrium (LD) are removed first to thin the data. This is done using PLINK ‘--indep-pairwise’ with parameters suggested by Weale . This creates a pruned data set that contains markers with a minimal LD (which is caused by limited recombination occurring between two or more loci and results in a non-random association between the loci). Furthermore, only assayed markers are used in this step (i.e. not imputed markers). Using a pruned data set is advantages because CR methods work best when no LD is assumed between markers and it also reduces the input size and in turn greatly reduces the computation time.    PLINK ‘--genome’ estimates relatedness of all pairs of samples and reports identify by decent (IBD, a measure of whether identical regions of two genomes were inherited from the same ancestry) in the PI_HAT (actually, proportional IBD, i.e. P(IBD=2) + 0.5*P(IBD=1)) column of the result file. A PI_HAT value close to 1 would indicate a duplicate sample. The threshold 0.1875 represents the half-way point between 2nd and 3rd degree relatives and is a common cut-off to use. Of each pair of related individuals, the one with the greater proportion of missing SNPs is dropped from the final dataset.Principal component analysis (PCA)    Generally, PCA transforms a data matrix (such as a GWAS n x m matrix where n in is the number of individuals and m is the number of markers and each element in the matrix represents the scaled genotype for the particular individual at that particular marker) so that the successive principal components are not correlated. The number of PCs is less than or at most equal to the original number of columns and the first PC explains the largest variance in the genotype data. Traditionally, PCA is used to (1) screen the study population for heterogeneous ethnic backgrounds and (2) to correct for potential population stratification (the difference of allele frequencies in ancestral subpopulations). It can be seen in where HapMap , data with individuals with known ancestry are included in the PCA, when plotting the first two PCs subpopulations cluster together. HapMap is an international project that aims to identify genetic similarities and differences between populations.    As with the cryptic relatedness step, a thinned dataset created with PLINK and starting from assayed markers only is used to calculate PCs. The SMARTPCA tool is used to calculate PCs from this thinned dataset and identify outliers for removal. The PCs can then be used for further corrections in analysis models. Data Formats. The input GWAS data are expected to be in PLINK bfile format. The input data will have three files associated to it with .bed, .bim, and .fam file extensions. The .bed file is a binary file that contains the genotype information for all individuals ( https://www.cog-genomics.org/plink2/formats#bed). The .bim file is a mapping file giving information on each marker ( https://www.cog-genomics.org/plink2/formats#bim). The .fam file gives information on each individual ( https://www.cog-genomics.org/plink2/formats#fam).This pipeline utilizes information that was not provided in the original PLINK files and therefore the phenotype is always provided in an alternate phenotype file. PLINK ‘–pheno’ is used to provide the phenotype file and PLINK ‘–pheno-name’ is used to provide the phenotype name which also corresponds to the header of the column in the phenotype file. The first two columns in the phenotype file must have the column headers ‘FID’ and ‘IID’ respectively. ‘FID’ is the family ID or ‘0’ if not used and ‘IID’ is the individual ID that corresponds to the ‘IID’ values in the .fam file ( https://www.cog-genomics.org/plink2/input#pheno). […]

Pipeline specifications

Software tools PLINK, EIGENSOFT
Applications Population genetic analysis, GWAS