Computational protocol: Identifying artificial selection signals in the chicken genome

Similar protocols

Protocol publication

[…] We employed the program ADMIXTURE v1.23 [], which uses a block relaxation algorithm, to investigate the population structure with default parameters. The number of populations considered was from 2 to 12. To infer the phylogenetic tree, PLINK v1.07 [] was used to generate the IBS distance for all pairs of individuals and then a neighbour-joining tree was constructed with FigTree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/) based on the distance matrix. Finally, principal component analysis (PCA) was performed using the flashpcaR package[] based on the whole genome genotype data that remained after quality control was assessed. After filtering SNPs to have a Pearson product-moment correlation of allele frequency (r2) less than 0.8 in 50 SNP windows, we ran ADMIXTURE and flashpcaR on all 231,545 remaining SNPs again. [...] In accordance with the population genetic structure, five breed pools (or breed) of genotype data were used to detect potential selection signals in this analysis: Jinghong and Jingfen (JH_JF), Pekin-Bantam (PB), Rheinlander and White Leghorn (RH_WL), Araucanas and Vorwerkhuhn (AR_VO), Gallus-Gallus-Spadiceus (GA). To identify the signals under artificial selection, GA was defined as the common reference population because it was associated with limited amounts of artificial selection for commercial traits.Multiple elementary selection signal statistics were employed to search for the evidence of artificial selection in two steps (hereafter termed the ‘two-step’ strategy). The first step is positive selection detection, which was performed using the composite likelihood ratio (CLR) [] and the integrated haplotype score (iHS) [] test. Among them, the CLR method calculates the likelihood ratio of selection signals by comparing the spatial distribution of allele frequencies in an observed window to the frequency spectrum of the whole genome. In this analysis, SweepFinder [] software was employed to calculate the CLR with a grid size of 10 kb. To explore ongoing selection signals, the iHS method was performed, which searches for haplotype structures that have ancestral and derived alleles. Single marker scores for unstandardized iHS were calculated using the rehh R package [] and then the |iHS| scores averaged across a non-overlapping 10 kb window across the genome. In the second step, we identified the underlying artificial selection signals from the above detected positive selection signals by calculating the FST statistic for pairwise sites between the observed populations and the common reference population. The unbiased FST estimate proposed by Weir and Cockerham was used to measure the population differentiation, with values ranging from 0 (no differentiation) to 1 (complete differentiation) []. To produce comparable CLR and iHS test results, the 10 kb grid size was also used for determining FST statistics. Finally, the selection signals were detected by using the |iHS|, CLR and FST scores in the genome level. The outliers for all methods were defined as signals with statistics surpassing the significance threshold (P<0.05) determined using 1,000 permutation tests. Accordingly, the chromosomal regions within a 200 kb window surrounding the outliers were defined as potential selection regions (PSR). This window size in this analysis was determined by the linkage disequilibrium decay in real data (). In addition, we further calculated the ‘observed heterozygosity’ (Het), which should be decreased in regions that are affected by a selection signal. As a by-product, the major (minor) allele frequency was also calculated by PLINK v1.07 []. Note that the artificial selection signals in this study were defined in the genomic region in which both positive selection signals and the FST statistical value were greater than the cut-off value at the genome level. [...] To characterize the core haplotype of underlying selection signals, haplotype blocks in the detected artificial selection regions were inferred using Haploview v4.2 [] software based on the solid spin algorithm. The parameters were set as follows: -blockoutput SPI -blockSpineDP 0.8. Core haplotypes were defined as haplotype blocks that have a remarkable signal, and a high frequency in an observed population suggests that the presence of a core haplotype in the potential selection region may be abnormal in neutral scenarios. In general, in sweep analysis, the remarkable signals should be significant and the frequency should be greater than 0.8, according to previous reports []. [...] In this analysis, bioinformatics analyses were performed to explore the potential biological functions of genes located in putative artificial selection regions. This analysis involved all the selected genes in the 200 kb window surrounding the significant signals, which was determined by the linkage disequilibrium decay in chicken populations. Detailed biological functions were matched for all highlighted candidate genes using the NCBI database (https://www.ncbi.nlm.nih.gov/gene). The gene annotation files for chickens that were downloaded from the database from ensemble (http://www.ensembl.org/info/data/ftp/index.html) were used to identify the exact position of markers in selected regions and highlight those that fell into selected regions using the textual mining R script. Gene-based annotations of Single nucleotide polymorphisms (SNPs) located in selected regions were performed using ANNOVAR []. Using ENSEMBL genes, we investigated where the SNPs are located in the regions of gene components. In addition, genes located in putatively selected regions were identified using the BioMart program (http://www.biomart.org/, Kasprzyk, 2011), and then an enrichment analysis, which included the terms cellular component, molecular function, and biological process, was performed for the identified genes using DAVID 6.7 (http://david.abcc.ncifcrf.gov/). […]

Pipeline specifications