Computational protocol: Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits

Similar protocols

Protocol publication

[…] All AUS individuals and some of the DANZ bulls were directly genotyped for the Illumina BovineSNP50 chip []. The remaining DANZ bulls were imputed from ~ 15,000 SNP to the BovineSNP50 chip. All individuals were then either directly genotyped or had imputed genotypes for the Illumina 800 K BovineHD beadChip. Further details of DANZ genotyping are published in [] and details for AUS are published in []. In addition to HD 800 K SNP genotypes, we identified approximately two million sequence variants (SNP and indels) in gene coding regions and including variants 5000 bp up- and down-stream of these genes (based on annotation available for the reference bovine genome University of Maryland UMD3.1 assembly []). The discovery of sequence variants across these regions was carried out in Run 3.0 of the 1000 Bull Genomes project []. Beagle version 3 [] was used to impute these sequence variants in all animals. The reference sequences used for imputation were 136 Holstein and 27 Jersey bulls combined from the 1000 Bull Genomes project (Run 3.0). The combined HD SNP and imputed sequence variants brought the total number of genotypes per animal to 2,785,440.All 2.785 M variants were then defined as belonging to one of three broad categories based on annotation of the reference genome UMD3.1 (details in Additional file : Table S1). The first category, comprised variants predicted to cause a non-synonymous coding change, referred to as “NSC”. The majority were missense variants, but this NSC category also included variants such as splice site, inframe indels, frame shift and stop gained/lost mutations. The second category included variants in regions that were predicted to have potential regulatory roles: loosely referred to as “REG”. The REG variants were mainly those within a 5000 bp region upstream and downstream of genes, or in three/five prime untranslated genic regions or were non-coding exon variants. All other variants were from the Illumina HD 800 K SNP array and were allocated to the third category, referred to here as “CHIP”: these were mainly intergenic, but included some intronic and synonymous coding variants.We then combined all the AUS Holstein and Jersey genotypes and used this data set to pre-select a subset of the most informative sequence variants. First we excluded those with Minor Allele Frequency (MAF) < 0.0002 using PLINK software []. We then excluded any one of a pair of variants in complete LD (r2 genotypic correlation >0.999) across groups of 500 adjacent variants in sliding windows of 50 variants (using PLINK). LD pruning was carried out first independently within each variant group (NSC, REG and CHIP) and then any REG or CHIP variant in complete LD with an NSC variant was removed. Last, all CHIP variants in perfect LD with a REG variant were removed. The remaining 994,019 variants, henceforth referred to as “SEQ”, were used for the analysis and included 45,026 NSC variants, 578,734 REG variants and 370,259 CHIP variants.We also generated a standard set of SNP chip genotypes for each animal based on the Illumina HD 800 K SNP array that were in common with the full set of imputed 2.785 M sequence variants (ie. prior to pruning). This provided a comparison of the accuracy of genomic prediction using a standard 800 K genotype array or the SEQ genotypes. In total there were 600,641 SNP genotypes in this HD SNP array set, henceforth referred to as the “800 K” genotypes. [...] An association study was conducted in the AUS dataset using ‘SNP Snappy’ []. This process fitted a model similar to Eq. , but replaced the term for all SNP genotypes (Wv) with a single SNP regression of phenotype on genotype, one SNP at a time. That is, as well as the SNP regression, the model included the overall mean, fixed effects, a polygenic term and phenotypes were weighted for heterogeneous error variance []. […]

Pipeline specifications

Software tools PLINK, SNAPPY
Applications De novo sequencing analysis, GWAS
Organisms Bos taurus