Computational protocol: An exome sequencing based approach for genome-wide association studies in the dog

Similar protocols

Protocol publication

[…] After capture, indexed libraries were pooled separately into groups of eight. Pools were run on an Illumina HiSeq2000 platform, with a target coverage of 60× depth to ensure maximal coverage of the 53 Mb targeted by the exome-1.0 WES design and high quality variant discovery. The reads were aligned to the CanFam3.1 reference genome using BWA 0.5.9. Duplicate reads were marked with Picard (http://broadinstitute.github.io/picard/), and the data were sorted. Using the GATK 3.5-0, local realignments were generated and base quality scores were recalibrated per GATK Best Practices. [...] GATK 3.5-0 was used to realign the reads around indels and HaplotypeCaller was used to call SNPs on the resulting BAM files. The resulting VCF files were converted to PLINK (ped and map) format file using vcftools_0.1.13 and bi-allelic SNPs were considered (indels were not retained). In total, 83,869 unique SNPs were further used for Poodles, 66,809 for Labrador retrievers and 66,000 for Golden retrievers. [...] Both the LD and power analyses were run from within R (version 3.3.1), using R-studio (version 0.99.902). The LD analysis was performed with Haploview 4.2. The following filters were used: MAF of 0.05, Hardy-weinberg p = 0, exclude markers with >0.5 missing genotypes and less than 0.5 nonzero genotypes. The missing genotypes cut-off reflects the 5th percentile of the % genotyped distribution of the 16 Poodles to account for the higher performance variability associated with sequencing relative to GWAS assays. This cut-off accurately removed outliers as all other tagSNPs had a success rate of >90%. An in-depth discussion on the rationale of the setting applied for the Hardy-Weinberg equilibrium test is provided in Suppl. Data S3–. The r² values between subsequent markers were retained and used in the downstream analysis to investigate the potential association with distance and chromosomal position. Minimum, Q1, mean, median, Q3 and maximum values were calculated for each chromosome per breed for cHD and WES separately. To allow an unbiased comparison of the r² between the three breeds, both raw r² values and r² values based on random subsampling (100 random combinations of each time 6 samples) were calculated for the Poodle. To evaluate the effect of the number of SNPs, random subsampling of the original 9541 cHD SNPs (on chromosome 1) was performed, reducing the number of SNPs from 9000 to 1500 in 6 steps of 1500. For each number of SNPs, the random subsampling and subsequent LD analysis was performed ten times. To determine the background LD, one SNP was randomly sampled from each chromosome, followed by calculating the r² between all pairwise combinations of these 38 unlinked SNPs. This random sampling was repeated 100 times and the median value of the obtained distribution was considered to represent the background r². [...] H1: simulation of a signal, followed by simulation under H0. To assess the power, two approaches are generally used: starting from phenotypes, genotypes can be simulated, followed by association analyses. This is however computationally demanding and requires a correct creation of the LD structure of the genome. A second option that starts from the genotype to simulate phenotypes has the advantage that it immediately can be used from existing (GWAS or WES) genotype data and that it uses as such the real LD pattern of the dog breed,.The second procedure was used: for each SNP, the genotype distribution was determined. Under the assumption of a fully penetrant autosomal recessive disease, one homozygous genotype is considered “affected”, while the two remaining ones are considered “healthy”. To mimic a balanced case-control design of 8 cases and 8 controls, only those SNPs were selected that followed this distribution (n = 91). At that moment, a perfect association between the genotypes for that specific SNP and the phenotype has been created. This causal SNP is next omitted from the association analysis itself. The reason is that a tagSNP is typically considered not to be the causal SNP itself: leaving it in would make it a direct method instead of an indirect method that has to rely on LD to detect the association. This would bias the power comparison as one method would have the causal SNP directly interrogated, while the other method would have to rely on LD to detect the association. This is repeated for all SNPs by walking over the chromosome and results in various but at the same time balanced combinations of the original samples. For each SNP that fulfilled the balanced criterion, the association analysis was conducted with PLINK (version 1.90b.45) by applying the Cochrane Armitage trend test to detect the association between phenotype and tagSNPs. After each association test, the case-control labels were randomly permutated and the association test was performed again to evaluate the number of significant results under H0. To control the family-wise error rate (FWER), p-values were calculated with permutation (mperm procedure, 500 permutations based on the observation that adaptive permutation stopped often at lower values and the narrow dispersion of power values). P-values ≤ 0.05 were considered significant. This principle is used in all power analyses with slight adaptations. […]

Pipeline specifications

Software tools BWA, Picard, GATK, PLINK, VCFtools, Haploview, SNPinfo
Applications WES analysis, GWAS
Organisms Canis lupus familiaris
Diseases Genetic Diseases, Inborn