Computational protocol: Practical Experience of the Application of a Weighted Burden Test to Whole Exome Sequence Data for Obesity and Schizophrenia

[…] In order to assess the performance of the weighted burden test in a real world example, it was applied to data produced by the UK10K project (The UK10K Consortium, ). Two cohorts of subjects were used, selected from the UK10K exomes arm. The OB cohort consisted of 982 subjects from the Severe Childhood Onset Obesity Project (Wheeler et al., ) and the SZ cohort consisted of 1392 subjects with schizophrenia recruited from five British centres. All subjects were British. Although a small proportion of schizophrenia subjects consisted of between two and five members of the same multiply affected pedigrees, for purposes of analysis all subjects were treated as if they were unrelated. The reason for using these two cohorts, rather than other subjects included in UK10K, was primarily that they represented two groups, each of which was phenotypically fairly homogeneous and which had similar geographical origins. The "case–case" design for association studies has the advantage that one does not require an additional set of controls but may have the disadvantage that if allele frequencies differ between the groups then one may not know which is the relevant phenotype (Curtis et al., ). As described elsewhere (The UK10K Consortium, ), the exome was targeted with the Agilent SureSelect 50Mb V3 exome library, followed by Illumina next generation sequencing with 75bp paired‐end reads. An average read depth of 79x was achieved in the bait regions. Variants were called with samtools/bcftools version 0.1.19‐3‐g4b70907. GATK Unified Genotyper (v1.6‐13‐g91f02df) was only used to recall at SNP sites discovered by samtools. This was to enable VQSR filtering of SNP calls. Three filters were applied to SNPs: LowQual, Description = "Low quality variant according to GATK (GATK)"; MinVQSLOD, Description = "Minimum VQSLOD score [SNPs:‐1.9667, truth sensitivity 99.48]"; SnpGap, Description = "SNP within INT bp around a gap to be filtered [10].” All SNP sites that did not fail these filters were marked as PASS. For the purpose of the current analyses, a number of additional constraints were applied to the downloaded VCF files to exclude some variants from analysis. Only single nucleotide variants (SNVs), not indels, were considered. Variants were excluded if they did not have a PASS in the information field, if there were more than five genotypes missing in either cohort or if the heterozygote count was smaller than both homozygote counts in both cohorts. At a subject level, variants were excluded if they had a genotype quality score less than 30. […]

Pipeline specifications

Software tools SAMtools, bcftools, GATK, PASsiT
Databases UK10K
Application WES analysis
Diseases Obesity