Computational protocol: Rare variants of small effect size in neuronal excitability genes influence clinical outcome in Japanese cases of SCN1A truncation positive Dravet syndrome

Similar protocols

Protocol publication

[…] Whole exome sequencing was performed by array capture of 50 Mb of exome target sequence using the Agilent SureSelectXT Human All Exon V5 enrichment kit followed by paired-end sequencing (100 bases each read) on an Illumina HiSeq 2500. Sequences were trimmed with trimmomatic, v.0.32 (Bolger et al., 2014) and then aligned to the human genome (GRCh37) using Burrows-Wheeler Aligner, v.0.7.9a (Li & Durbin, 2010). Base quality recalibration, indel realignment, and calling of SNVs and small indels were performed using the Genome Analysis Toolkit, v.3.3–0, as previously described (McKenna et al., 2010). Variants were annotated using SnpEff v. 3.4 (Cingolani et al., 2012) with gene annotations made against Ensembl release 73. Previously known variants were annotated with their allele frequencies from the 1000 Genomes Project (, the NHLBI GO Exome Sequencing Project (ESP) 6,500 samples release (, and the Exome Aggregation Consortium (ExAC) release 0.3 ( [...] We used two approaches to identify variants associated with clinical outcome. In the first approach, we focused on the genotypes of common (i.e., MAF >1%) variants and performed Fisher exact tests on genotype counts in the mild or severe phenotypic categories. In the second approach, we search for common and/or rare variants aggregated in and around genes throughout the genome in our exome sequence data. As described in Lee et al. [], we adaptively combined a burden test and SKAT (described in SNP-set Sequence Kernel Association Test, Wu et al. []) to perform gene-based association tests on the WES data to assess the joint effects of multiple SNPs in a region on a binary outcome phenotypes (mild or severe). Adjusting for four covariates, age of seizure onset, time since onset (to present for mild, to first diagnosis of severe for severe), gender and motor delay, we obtain parameter estimates and residuals for the null model, which assumes there is no genetic association between genetic variables and outcome phenotypes. In addition, for each region, SKAT analytically calculates a p-value for association.We also investigated patterns of rare variation in Klassen et al.’s [] set of 237 genes related to neurotransmission. We filtered our variants that were predicted to alter protein function (i.e., nonsynonymous, stop-gain, stop-loss, frameshift, and splice-junction mutations) and that were present at ≤1% frequency in public databases. To filter out apparent false positive heterozygous calls as judged by a skew in allelic balance, we also applied a lower limit of 1% probability on the ratio of observed reads from the two alleles under a binomial model.Choice of a pathogenicity predictor. Several pathogenicity predicting methods are available, each of which is based on different methods to assess the impact of a given substitution. Our approach here was to use a statistical procedure to determine the single best pathogenicity predictor in order to avoid issues associated with a multiple testing procedures. In particular, because we assess pathogenicity across 237 genes (see below), we wanted to avoid the complications associated with combining results from multiple predictors and/or the significance corrections required when performing multiple tests. The Grantham score was one of the first implemented to predict the effect of an amino acid substitution []. This score takes into account protein properties that correlate best with residue substitution frequencies. Subsequently, we considered a variety of strategies to predict pathogenicity. For example, the phyloP score [] is derived from a statistical approach based on non-neutral substitution rates on mammalian phylogenies. SIFT [] is also based on protein conservation. Combined Annotation Dependent Depletion (CADD) [] uses a machine learning approach to integrate multiple annotations. Using protein features, PolyPhen-2 [] predicts damaging effects of missense mutation via comparisons of a property of the wild-type allele and the corresponding property of the mutant allele.Because these pathogenicity predictors were assessed over the entire exome, the quality of their predictions may vary depending on the family of proteins under study. Indeed, from our previous work on Dravet syndrome [], we found PolyPhen-2 to be the most reliable predictor. To certify this formally, we compare the performance of the other three aforementioned pathogenicity scores against PolyPhen-2 using the area under the curve (AUC) for the receiver-operating characteristic. Our data were taken from a set of individuals with epilepsy whose pathogenic mutations were either in the SCN1A (n = 122) [] or SCN8A gene (n = 111). These were matched against the list of variants in the ExAC database (n = 346 and n = 248, respectively) ( Using scores of the cases and controls for these sodium channel mutations, the AUC for PolyPhen-2 is 0.798, while the AUCs for the Grantham score and SIFT algorithms were much lower (< 0.6). For CADD and phyloP, the AUCs were 0.759 and 0.757, respectively () To determine the statistical significance of these differences, we used the DeLong test [] for paired data sets implemented in the pROC R package ( The p-values for one-sided tests are 0.025 and 0.038, respectively. A bootstrap approach produced essentially the same p-values. Hence, in the following analyses we use Polyphen-2 to predict whether variants were possibly or probably damaging or benign. […]

Pipeline specifications

Software tools Trimmomatic, BWA, GATK, SnpEff, SKAT, PHAST, SIFT, CADD, PolyPhen
Applications Phylogenetics, WGS analysis, WES analysis, GWAS
Diseases Epilepsy, Epilepsies, Myoclonic
Chemicals Sodium