Computational protocol: Global Patterns of Diversity and Selection in Human Tyrosinase Gene

Similar protocols

Protocol publication

[…] The following regions of TYR were sequenced: the 5’ and 3’ flanking regions, five complete exons including exon-intron junctions, and additional 12 intronic regions. The total amount of sequence produced is about 24.3 kb: 12.9 kb and 0.4 kb come from the 5’ and 3’ flanking regions, respectively, 1.6 kb are coding sequences and 9.4 kb are intronic sequences. A graphical overview of the sequenced regions is available in and additional detailed information is presented in . PCR conditions and primer information are available upon request from the authors. Nucleotide sequences are available through GenBank, accession numbers KC201427-KC201588.Aligned sequences were compared to the hg19, GRCh37 human reference sequence and chimpanzee genome sequence build 3.1. Individual haplotypes were further reconstructed using the PHASE algorithm [,] implemented in the DNASP 5.10 software package []. The phase of the singletons was, where possible, determined by a cloning procedure with CloneJET™ PCR Cloning Kit (Thermo Fisher Scientific Inc.).New samples were genotyped using the HumanHap650Y BeadChip, while some of the already published samples were genotyped either by the same chip or using the Illumina Human610-Quad BeadChip (see for details). The genotype data were processed and filtered with PLINK v 1.06 []. Our original master-file included only SNPs shared between the three Illumina genotyping arrays: HumanHap650Y, Human610-Quad and Human660W-Quad. We further excluded all SNPs with call rates less than 0.95. A total of 531,315 autosomal SNPs passed this criterion. We further applied additional filters, the filter choice depended upon the analysis performed: (a) for the whole-genome d i we used autosomal non-phased genotypes with call rates equal to 1.0, and SNPs with minor allele frequency over 0.05; (b) for iHS and XP-EHH, first, the haplotype reconstruction was performed using BEAGLE 3.1 [], and later the same filters as for the whole-genome d i were applied to the phased data. After all filtering steps our non-phased dataset included 415,334 and phased dataset 491,253 autosomal markers. [...] The Jukes-Cantor corrected nucleotide diversity (π) and Tajima’s D value [] were estimated using the DNASP 5.10 software []. The significance of Tajima’s D statistics was assessed by two different approaches. Firstly, empirical TYR values were compared to 10,000 coalescent simulations performed under the hypothesis of selective neutrality and population equilibrium in ARLEQUIN []. Secondly, Tajima’s D estimates of the longest consecutive sequence block (ca 13.8 kb), which comprises the 5’ flanking region and the first exon of TYR ( and ), were further compared to the average of D calculated from 10,000 neutral simulations using the COSI 1.2.1 software under the "best-fit" model []. COSI has been calibrated to simulate extant human genetic variation under neutrality. African, European and Asian sequences were generated using the parameters of the "best-fit" model. However, these parameters were not provided for American and Oceanian populations, for which we have chosen the closest proxy in terms of genetic similarity, namely the Asian model [,]. The model parameters were further adjusted to meet our re-sequencing project details: this included sequence length and mutation rate. Local TYR recombination rate from the HapMap GRCh37 genetic map was used in the coalescent simulations [].Within population differentiation was estimated using analysis of molecular variance (AMOVA) with ARLEQUIN []; statistical significance was assessed by 1,000 permutations. [...] The following whole-genome selection scans were performed: integrated haplotype score (iHS) [], cross-population extended haplotype homozygosity (XP-EHH) [] and pairwise FST-based d i []. All statistics were calculated using our Illumina genotyping sample set (n=351) for each of the six population groups separately. The dataset was filtered as explained above. In case of iHS and XP-EHH estimation we followed the previously described approach []. In XP-EHH analyses, we used Bantus as a reference group for all non-African populations and Europeans as a reference for the African population. Whole-genome autosomal pairwise FST values were estimated using GENEPOP 4.0 [] and the function of pairwise FST between population i and the remaining populations (d i) was calculated in R [].All iHS, XP-EHH and d i genome-wide results were normalized by chromosome, divided into 200-kb non-overlapping windows and binned by the number of SNPs in each window (bin size=20SNPs for XP-EHH and d i analysis, and bin size=15SNPs for iHS). We further used maximum value of XP-EHH, average value of d i and fraction of |iHS|>2 in each window. A two hundred kb long TYR window was centered in the middle of the gene and included about 41 kb of upstream and downstream sequence. Percentile rank of the TYR window was calculated using the genome-wide distribution of windows from the same bin. [...] We used two different approaches to reveal phylogenetic relationships among data. First, we used the median-joining (MJ) algorithm implemented in the NETWORK software [] to reconstruct phylogeny based on the alignment of phased sequences. The haplotype block structure of this alignment was established using HAPLOVIEW 4.2 []. We further reconstructed phylogenetic history and estimated time to the most recent common ancestor (TMRCA) in each haplotype block separately using local mutation rate estimates. Coalescent age was calculated using the rho statistics on the MJ network in the NETWORK software. Mutation rate calibration was performed using 7 MYA human-chimpanzee split [,] assuming a 29 year generation time []. In addition, the age of derived non-synonymous alleles of rs1042602 (20 chromosomes in the TYR re-sequencing sample) and rs1126809 (10 chromosomes in the TYR re-sequencing sample) were estimated by two different methods. Firstly, the rho statistics was calculated using the MJ network of the complete re-sequencing alignment as described above. Secondly, a Bayesian coalescent approach was applied as implemented in BEAST 1.7.5 []. Data was partitioned into coding and non-coding regions. Coding nucleotides were analyzed with the SRD06 model []. MODELGENERATOR [] was used to assess the substitution model of non-coding nucleotides independently in each dataset: the HKY model was chosen for sequences with the rs1042602 derived allele, and the HKY+I model was chosen for sequences with the rs1126809 derived allele. The analysis was performed using a local molecular clock of the TYR gene that was estimated for coding and non-coding regions separately, and using Extended Bayesian Skyline Plot as a coalescent tree prior []. Posterior distributions of parameters were estimated by Markov chain Monte Carlo simulation, with samples drawn every 1,000 steps over a total of 50,000,000 steps with the first 10% of samples discarded as burn-in. Three independent runs were conducted to check for convergence. Sufficient sampling of parameters was evaluated using TRACER 1.5.0 [] and convergent runs were combined; 95% Highest Posterior Density intervals (HPD) were used as confidence intervals.Second, we reconstructed the minimum recombination history of TYR haplotypes using phased Illumina genotype data for 1108 individuals. This was performed using only those TYR SNPs in the genotype data that were within the range of our re-sequencing alignment (n=7), or SNPs that were phylogenetically equivalent to them (n=8). Equivalency of a SNP missing from the re-sequencing project was assumed if it was in complete linkage disequilibrium (r2=1) with one of the re-sequenced SNPs among samples present in both re-sequencing and genotyping sample sets (). We applied filtering on our data prior to the analysis: individuals having one of the chromosomes belonging to the rare haplotype with total frequency of less than 1% in our sample were removed. In total, 942 individuals (1884 chromosomes) passed the filtering criteria. Further, ancestral recombination graphs (ARG) based on filtered and non-filtered data were reconstructed in BEAGLE and KWARG, respectively. Both programs implement a branch and bound method for determining the minimum number of recombinations in a dataset []. Initially, the draft ARG was reconstructed and compared to the median-joining network solution of the data. To detect fine-scale phylogeographic patterns we further genotyped eight additional sub-haplogroup defining SNPs among the individuals belonging to the particular lineage (see for details). […]

Pipeline specifications

Software tools DnaSP, PLINK, BEAGLE, Arlequin, Cosi, Genepop, Haploview, BEAST, ModelGenerator
Applications Phylogenetics, Population genetic analysis, GWAS
Organisms Homo sapiens
Chemicals Vitamin D