Computational protocol: Genome wide Polygenic Burden of Rare Deleterious Variants in Sudden Unexpected Death in Epilepsy

Similar protocols

Protocol publication

[…] All epilepsy samples were sequenced using either Agilent's SureSelect Human All Exon V1 (38 Mb, n = 42) and SureSelect Human All Exon V5 (50 Mb, n = 56) or Illumina's Nextera Rapid Capture Exome kit (37 Mb, n = 16). For the disease control samples, NimbleGen's SeqCap EZ and Illumina's TruSeq Exome capture technology were also used. Sequencing was performed on Illumina HiSeq2500 or GAIIx sequencing systems.We used a multi-sample joint calling strategy across all SUDEP cases, epilepsy and disease control samples to mitigate problems caused by the heterogeneity of sequence capture kits. One major confound in case–control variant burden analyses can arise when either single-sample calling, or multi-sample calling in different batches, is used to generate the variant calls. Standard practice in single-sample calling is to call non-reference alleles only; calling of all sites is possible but impractical. In order to merge such single-sample calls into one dataset, variants not called in one sample need to be assumed as either homozygous reference, or set to missing. In contrast, multi-sample calling routinely calls homozygous reference genotypes, but only for variants with at least one non-reference allele in the entire sample. Our multi-sample joint calling strategy across all cases and controls as a set enabled us to distinguish between homozygous reference and missing genotypes (), and provides the basis for standardized QC across aggregated data, essential for case–control designs (). Details of the variant calling pipeline are given in and Supplementary Method 5. [...] Aiming to estimate the burden of mutations at genome-wide level, we chose thresholds for variant QC metrics to maximize specificity over sensitivity, accepting loss of power to detect a significant association in favour of a reduced type I error rate (Supplementary Method 6.1). Individual-level QC filtering generated samples with similar technical sequencing metrics, including overall call rate, singleton rate, and per-individual heterozygosity (Supplementary Method 6.2). After inspection of the population substructure by multidimensional scaling analysis, as implemented in PLINK (), only samples of clear European ancestry were retained (Supplementary Fig. 1).For the genome-wide burden analysis of variants in SUDEP, we focussed on variants with the highest likelihood to be pathogenic by selecting rare (minor allele frequency (MAF) ≤ 0.5%), protein-changing variants (Supplementary Method 8). The variant selection procedure for the genome-wide burden analysis is detailed in the Supplementary Method 8.1. We chose this strategy because variant pathogenicity is inversely correlated with the frequency of the non-reference allele in the general population (), with prediction of variant deleteriousness being more reliable for exonic and splice-site variants than for non-coding variants (). Using the selected variants, we then assigned to each individual an overall ‘burden score’, calculated by summing the scores for deleteriousness of every selected variant carried per individual, where the deleteriousness of each variant was determined using the Combined Annotation Dependent Depletion method (CADD v1.1) (, Supplementary Method 7). The CADD method has been proven to achieve high sensitivity in identifying known pathogenic variants. To minimize batch effects between the different WES samples and cohorts, only variants sequenced in more than 80% of the SUDEP cases and the two control cohorts were retained. This strategy was enabled by our joint calling strategy across all cases and controls, and ensured that only variants sequenced in the majority of each of the testing groups were used to calculate the per-individual burden scores. This batch correction method is equivalent to a cross-sample coverage-based correction method, and is not equivalent to the filtering of poorly genotyped variants aimed at removing unreliable genotypes. The threshold of 80% was selected to obtain the lowest variability of all observed per-individual burden scores (Supplementary Method 8.1 and Supplementary Table 5).We employed the two-tailed Wilcoxon rank-sum test, as implemented in Stata (, to compare per-individual burden scores and the number of variants per individual of the SUDEP cases against those of the two control cohorts, as well as epilepsy controls versus disease controls. The threshold for statistical significance was corrected for six tests using the Bonferroni method (two burden tests for three testing groups; α = 8.3 × 10− 3). [...] For the gene-based association analyses, we performed association tests based on comparison of the numbers of non-reference protein-changing variants alleles exclusive to cases versus those exclusive to controls, a well-established unique variant approach (). This approach, together with refining based on the predicted deleteriousness (scaled CADD score ≥ 15, Supplementary Method 7), maximizes the power of the gene-based association tests (). Variant and individual-level QC was performed as for the genome-wide burden analysis (Supplementary Method 6). To help dissect out genes more likely conferring risk to SUDEP than to epilepsy, we excluded variants present in the epilepsy control cohort. The variant selection procedure is detailed in the Supplementary Method 8.2.Empirical data show that the performance of rare variant association methods depends upon the underlying assumption of the relationship between rare variants and complex traits (). We employed a one-tailed burden test of an increased rare allele rate in cases (described in the supplementary information of ) and the two-tailed C-alpha test (), which allows for risk and protective variants, as implemented in PLINK/SEQ ( An adaptive permutation procedure was used to assess P-values for all association tests (swapping of phenotype label across individuals; genes dropped from further permutation if clearly not associated). We used the PLINK/SEQ estimate of the smallest achievable empirical P-value for a gene (I-value) to adopt an adjusted Bonferroni correction for multiple testing, by correcting only for the number of genes for which there was power to detect association (I < 10− 3) (). Based on the observed cumulative allele count in the SUDEP cohort for the tested genes, and a Bonferroni-corrected significance threshold, the epilepsy controls did not provide sufficient power to detect associations, and were not used in this component of the study. Confirmatory Sanger sequencing in the SUDEP samples was performed for variants in genes surpassing the adjusted threshold for significance. […]

Pipeline specifications

Software tools Variant Calling Pipeline, PLINK, CADD, Stata, PLINK/SEQ
Applications WGS analysis, WES analysis, GWAS
Organisms Homo sapiens
Diseases Epilepsy