Computational protocol: Use of partial least squares regression to impute SNP genotypes in Italian Cattle breeds

[…] The comparison of imputation performances from different publications is difficult due to the many differences between studies. TP size and number of markers in LDP heavily affect the accuracy of prediction. Moreover, the relationships between training and validation animals have an impact on imputation accuracies []. So, before applying the PLSR imputation method to our data, the method was tested on external data provided by Daetwyler et al. [] who exploited the ChromoPhase program [] to impute missing genotypes from low to high density SNP platforms. The data consisted of 1183 Holstein bulls genotyped with the Illumina 50K chip. Only the 2529 markers on chromosome 1 were available. A PP genotyped with the 3K chip (182 SNP) was simulated by masking the markers not present on the 3K chip. In particular, the PP was divided into non-founders (112 individuals that have at least one genotyped parent) and founders (212 animals that do not have a genotyped parent) and imputation accuracies were evaluated for both categories of animals. The PLSR method and Beagle [] software were used to impute SNP genotypes in the PP and results were compared with accuracies obtained by Daetwyler et al. []. Population structure or pedigree was not used with either method.In our experimental data, PLSR was first applied to the Holstein breed. Animals were ranked by age and divided in TP = 1993 (the older bulls) and PP = 100 (the younger) and both 3K and 7K scenarios were investigated. The Beagle software was applied to the same data. No pedigree information was used for either PLSR or Beagle.On simulated data, Dimauro et al. [] demonstrated that, for each chromosome, the PLSR imputation accuracy improved as the number of variables contained in X increased. The reason is that when many variables have to be predicted (the columns of the Y matrix), the number of extracted latent factors should be large. The maximum number of possible latent factors is, however, less or equal to the number of variables in X. So, for chromosomes with a relatively low number of markers in X, a lower PLSR predictive ability is expected. This hypothesis can be easily tested by comparing the imputation accuracies obtained in the 3K and 7K scenarios. Moreover, a PLSR run using an X matrix obtained by combining SNPs belonging to chromosomes 26, 27 and 28, was carried out to test for possible improvement in genotype imputation accuracy when X is artificially enlarged. […]

Pipeline specifications

Software tools IMPUTE, BEAGLE
Application GWAS
Organisms Bos taurus