Computational protocol: Identification of selective sweeps reveals divergent selection between Chinese Holstein and Simmental cattle populations

Similar protocols

Protocol publication

[…] Data were obtained from 96 Holstein cattle, including 10 cows and 86 bulls, and 447 Simmental bulls. Genotyping was performed using the Illumina BovineHD BeadChip that includes 777, 962 SNPs. To ensure the high quality of the SNP data, a series of quality control measures was performed. The following criteria were applied for quality control: (1) an individual sample was removed when the missing genotype rate per individual was higher than 0.05; SNPs were removed (2) when the minor allele frequency (MAF) was lower than 0.05, (3) when there was no known autosomal genomic location (UMD 3.1), (4) when the SNP genotypes were not in Hardy–Weinberg equilibrium (P > 10−6). However, for the XP-EHH and FST tests, loci with a MAF lower than 0.05 were also included. Beagle (version 3.3.2) [] was used to impute missing genotypes and infer the haplotype phase. To ensure independence among the individuals collected in both populations, a relatedness test was performed using PLINK []. A set of approximately independent SNPs was extracted using the PLINK option: indep-pairwise 50 5 0.2, i.e., removal of one SNP of a pair of SNPs that have a pairwise r2 higher than 0.2 within a window of 50 SNPs, and shifting the window by steps of five SNPs. The pairwise IBD was estimated for pairs of individuals within each population. Individuals of a pair of individuals that had a pi-hat value greater than 0.2 were considered to be closely related, and thus, one individual was removed from the analysis. [...] To characterize the origins of the Chinese Holstein and Simmental populations, we conducted a principal component analysis (PCA) using EIGENSOFT 6.0.1 [] and breed assignment analyses using WIDDE []. For each population, 15 individuals were randomly selected to conduct analyses. In breed assignment analyses, the allele sharing distance (ASD) was calculated between test individuals and individuals in the world reference dataset included in WIDDE. For each test individual, the average ASD with all the individuals of each reference population was calculated and the top five genetically closest populations were summarized. As only populations with at least 15 individuals are included in the world reference dataset in WIDDE, the Simmental reference population, which included 10 individuals genotyped on the Illumina BovineHD BeadChip, although present in WIDDE was not included in the world reference dataset. Therefore we downloaded the world reference dataset and the Simmental reference population from WIDDE, and conducted PCA using EIGENSOFT rather than WIDDE. The downloaded dataset included 2513 individuals genotyped by either the Illumina BovineHD BeadChip or Illumina BovineSNP50 BeadChip. Before conducting PCA, the data was filtered. To merge our dataset and the dataset downloaded from WIDDE, we retained only common SNPs. Then, PLINK [] was used to exclude individuals with a missing genotype rate higher than 0.05, SNPs with a missing genotype rate higher than 0.25, and SNPs with a MAF lower than 0.01. [...] For the purpose of this study, we used the algorithm suggested by Gabriel et al. [], which defines a pair of SNPs to be in ‘strong LD’ if the one-sided upper 95 % confidence bound on D’ is higher than 0.98, and the lower bound is higher than 0.7. A haplotype block is defined as a region across which 95 % of the informative SNP pairs show strong LD. LD statistics were estimated, and haplotype blocks were constructed using HAPLOVIEW [] v4.2 based on the reconstructed haplotypes. [...] Gene contents in the candidate regions were retrieved from the UCSC Table Brower []. Before conducting functional annotation, we selected candidate genes within windows that presented selection signals according to the LRH and XP-EHH tests. For the LRH test, each gene overlapping with a candidate window was given a score corresponding to the FST average of SNPs localized within the boundary positions of the gene extended by 1 kb upstream and downstream; the gene with the maximum score in each window was selected for functional annotation. For the XP-EHH test, we considered only genes within the 100 kb region around the SNP with the maximum |XP-EHH| score, since XP-EHH peaked more narrowly around the candidate variant than other methods, to detect selection signals []. Given that a limited number of genes have been annotated in the bovine genome, first we converted the cattle RefSeq mRNA IDs to orthologous human Ensembl gene IDs from the Ensembl Genes 84 Database by BioMart []. Gene Ontology (GO) and KEGG pathway analyses were performed in the Holstein and Simmental populations separately using DAVID 6.8 []. The candidate genes were classified into categories by cellular component, molecular function, biological process and pathway and were compared to the human genome background supplied by DAVID. […]

Pipeline specifications

Software tools BEAGLE, IMPUTE, PLINK, EIGENSOFT, Haploview, BioMart, DAVID
Applications Genome annotation, Population genetic analysis, GWAS
Organisms Bos taurus