Computational protocol: Discovery and functional prioritization of Parkinson’s disease candidate genes from large-scale whole exome sequencing

Similar protocols

Protocol publication

[…] WES was performed on 1148 PD cases and 503 neurologically healthy controls of European descent. All participants provided written informed consent. Relevant local ethical committees for medical research approved participation in genetic studies. If PD patients were prescreened for known pathogenic mutations, they were excluded for exome sequencing when having such a variant. The cases were diagnosed with PD at a relatively young average age of 40.6 years (range, 6–56 years), of which approximately 37% reported a positive family history. The neurologically healthy controls are on average 48.2 years of age (range, 10–97 years). A more extensive overview of demographic information is reported in Additional file : Figure S8.Due to improvements of the exome sequencing protocol over time, the exome sample libraries were prepared with different capture kits. For this study, three different capture kits were used: Illumina TruSeq (San Diego, CA, USA) (62 Mb target); Roche (Basel, Switzerland) Nimblegen SeqCap (44.1 Mb target); and Agilent (Santa Clara, CA, USA) SureSelect (37.6 Mb target), which captured 96%, 81%, and 71% of the targeted exome at least ten times, respectively (Additional file : Table S12). Exome libraries were sequenced on a HiSeq 2000 (Illumina, San Diego, CA, USA). The Burrows Wheeler Aligner MEM v0.7.9.a [] was used to align the 100-bp paired-end reads to the human reference genome build hg19. We called the single nucleotide variants (SNVs) and insertions/deletions (indels) for all samples simultaneously using Genome Analysis Toolkit (GATK) 3.x [], followed by the exclusion of low-quality variant calls not passing the default GATK filters. Individual genotypes were removed with genotype quality Phred-scores below 40. ANNOVAR [] was applied to annotate the variants with information concerning variant type (valid annotations when Refseq in concordance with UCSC), MAF in the general population, and predictions of the variant’s effect on gene function, implementing CADD []. [...] Considering the worldwide prevalence of 0.041% for PD in the age range of 40–49 years [], we selected rare variants with a MAF < 1% (corresponding to a homozygous frequency of 0.01%) in the European population. Because the specified 0.041% of the population with young-onset Parkinson’s disease (YOPD) is not caused by one shared genetic factor, we expect a homozygous frequency of 0.01% to be an adequate cutoff, which would be able to determine variants present in approximately 25% of the YOPD population. As a comparison to the most common genetic cause of YOPD, parkin [], the most frequent mutation is an exon 3 deletion, which has been identified in 16.4% of YOPD patients []. Using ANNOVAR [], all variants were annotated with MAF information of ESP6500si (European American population) [], 1000 Genomes Project (European population of April 2012 version) [], and the ExAC browser (non-Finish European population) [, ]. When no public allele frequency was available for homozygous variants, the in-house control dataset of 503 individuals was used as a reference for the general population. Homozygous variants were excluded when being common (>1%) in controls or having a relative higher frequency in controls than in cases. KGGseq [] was used to count the number of homozygous variants for the cases versus controls.In addition to the population allele frequency filters, we only selected SNVs and indels affecting the position of the stop codon or located at a splice site (within 2 bp of splicing junction), which are variants expected to result in a loss of gene function. As the aim of this study was to validate our approach to identify high promising PD candidate genes, rather than discovering all putative PD genes present within our WES dataset, we set a conservative selection criteria by only including frameshifts that caused an immediate stopcodon at the position of the indel. Splice-site variants were only considered when being adjacently located to an exon that is coding for amino acids. As a final filter for the homozygous variants, we manually excluded variants that failed GATKVQSR and hard filtering. Quality predictions based on the ExAC database are more adequate, as it includes ~37× more samples than our dataset.For the putative compound heterozygous mutations, both variants should be located within the same transcript and at least one allele should contain a LoF variant. The second variant could be: (1) a LoF variant; or (2) a missense variant that is absent in dbSNP137 [] database and with a CADD score > 20 (predicted to belong to the 1% most deleterious variants of the total genome), indicating a pathogenic effect. The latter two filter criteria should decrease the chance of including benign missense variants. The putative compound heterozygous variants were identified by scoring the number of variants per sample per gene with PSEQ ( The reads of variants located within approximately 200 base pairs were visualized in IGV [] to judge the authenticity of the compound heterozygous variant. When the different variants are located on distinct alleles, the combination of variants was considered a true compound heterozygous mutation.All recessive variants that remained after the filtering procedures were Sanger sequenced to confirm the variant calls generated by the exome pipeline. [...] SKAT-c [] was used to analyze the burden of coding variants for each identified gene. Both rare variants only and the joint effect of common and rare variants were tested. Because variant aggregation tests are prone to coverage differences, capture usage and population stratification, we performed a more stringent individual and variant QC, resulting in a reduced dataset of 1540 samples (1062 cases and 478 controls) covering 268,038 variants. Individuals were excluded when failing gender test, showing evidence of relatedness, having dubious heterozygosity/genotype calls, or being a population outlier. Variants were removed when having a genotype missingness > 5%, a Hardy–Weinberg equilibrium p value < 1e−6 or a p value for non-random missingness by phenotype < 1e−5. Variants were only considered for association analyses if located in a region targeted by all different capture kits.Benign variants have the potential to dilute a true association signal of the combined effect of functional variants in a gene. We therefore annotated variants with ANNOVAR [] to group variants according to their type or predicted pathogenicity. Two subsets of variants were examined: (1) predicted pathogenic variants, including LoF variants and missense mutations that are predicted to be pathogenic by the CADD framework; and (2) missense variants, including amino-acid changing and LoF variants.As suggested by SKAT, we selected a MAF cutoff of 0.018, which is based on the total sample size and separates rare and common variants. Common variants (MAF > 0.018) were pruned using PLINK [] (indep settings 50 5 1.5). Due to confounding factors (usage different capture kits and multiple CEU populations), 20 principle components, 10× coverage, and gender were taken into account as covariates. Both a traditional one-sided burden (assuming all variants to have a harmful effect) and a two-sided SKAT test (allowing variants to be either damaging or protective) were performed. Empirical p values were calculated by comparison of the nominal p value to 10,000 permutations of affection status. Genes with an empirical p value < 0.05 were considered to be significantly associated to PD. [...] Approximately 70% of the participants included in this study have also been included in previous published GWAS [, , ]. To explore the possibility that our candidate genes might also contain common risk variants increasing the risk to develop PD, next to the identified LoF variants with assumed high penetrance, we searched for GWAS loci within 1 Mb upstream and downstream of the gene of interest using the recent PD meta-analysis through []. Significant associations and suggestive p values < 1e-4 were considered. To understand the underlying linkage disequilibrium structure, LocusZoom [] was applied to visualize the European 1000G recombination events for the candidate genes that were closely located to a GWAS locus. […]

Pipeline specifications

Software tools BWA, GATK, ANNOVAR, CADD, KGGSeq, IGV, SKAT, PLINK, LocusZoom
Databases PDGene
Applications WGS analysis, WES analysis, GWAS
Organisms Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans