Computational protocol: Deep Phylogenetic Analysis of Haplogroup G1 Provides Estimates of SNP and STR Mutation Rates on the Human Y-Chromosome and Reveals Migrations of Iranic Speakers

Similar protocols

Protocol publication

[…] We genotyped the commonly-used SNP M285 which defines haplogroup G1 (YCC, 2002) in multiple Eurasian populations using the TaqMan technique (Applied Biosystems) and identified 367 M285-derived samples in 27 populations. All these samples were then genotyped at 17 Y-chromosomal STRs using the Y-filer genotyping kit (Applied Biosystems). All sample donors gave their written informed consent (the study was approved by the Ethics Committee of the Research Centre for Medical Genetics, Russian Academy of Medical Sciences). Data available from the literature were also incorporated (, ).Then we selected 19 samples for high-throughput sequencing of the Y-chromosome. To capture maximum phylogenetic diversity and thus increase the cost-effectiveness of the analyses, we applied three criteria for selecting samples. The geographic criterion led to samples from both steppe and mountain parts of the haplogroup’s area being included, particularly from populations where G1 frequency is high. The phylogenetic criterion led to samples from all clusters revealed on the STR network being included and represented by at least two samples for full sequencing, because STR-clusters might reflect real phylogenetic branches and a single sample would not allow us to distinguish phylogenetically-informative SNPs from private ones. The third criterion could be applied only to those populations where paternal clan structure is present: it led to representatives from different clans being included because members of the same clan have a high probability of sharing almost identical paternal lineages. As an outgroup for the 19 G1 samples, we also sequenced one sample from its brother haplogroup, G2.Y-chromosomal genotyping was performed using a custom enrichment design created for the commercially available “BigY” product offered by Gene By Gene, Ltd. In total, the target regions attempt to sequence around 20 million base pairs with 67,000 capture probes, on the Illumina HiSeq platform. This design captured 11,383,697 bp within the non-recombining male-specific Y-chromosome, consistent with regions genotyped by previous Y sequencing studies [] and the Y positions placed on the phylogenetic tree by the Y Chromosome Consortium []. Following BigY sequencing, and also as part of the product, downstream software analysis was performed using the Arpeggi Engine (AEngine) pipeline. This includes short read mapping, alignment post-processing, and variant calling. For quality control purposes, BigY samples are monitored for read totals, average coverage and average base quality, and should a sample fall below BigY standard thresholds, the sample is re-run. A regions file listing the genomic build 37 capture targets of BigY can be found at Variants found across the samples are classified as any deviation from the reference genome, and in addition, we reported genotypes for about 37,000 phylogenetically informative SNPs in the FamilyTreeDNA database ( addition to the genotyping per sample, we wanted to ensure for this study that SNP positions examined were adequately covered across all samples. This is a concern, because many variant calling methods in high-throughput sequencing are ambiguous when not reporting a variant as to whether there was not enough coverage to genotype, or if there was a legitimate homozygous reference genotype. To discern such cases, each BigY sample was given a “confidence” region list determined by genotype quality scores for each base. The genotype quality is computed as the probability that the genotype is correct, according to a phred score. This probability is derived from AEngine’s proprietary statistical model considering characteristics of read coverage, individual read mapping qualities, and base sequencing quality scored by the HiSeq. A base position is appended to the confidence regions for that sample if its genotype quality score is above 3.02. Thus, if there is no variant occurring at a base within confidence intervals for a sample, it can be assumed that the sample is reference genotype at that position. Variant calls were produced and handled as Variant Call Format (VCF) files, according to the established field standards ( To this effect, the intersection of confident regions covered by the 20 samples studied was also recorded, and can be found within the Supplementary Data. More details on the BigY capture sequencing method are available at, which features the methods and the capture probe list and target regions.To estimate the potential sequencing error rate, we applied a phylogenetic approach. We checked whether we found all SNPs in the BigY captured region which are known to be phylogenetically located between haplogroups A0 and G ( and thus should be present in our samples. The proportion of missed SNPs was the false negative rate. We also checked whether we see SNPs known to define other haplogroups, which are therefore not expected to be present in our haplogroup G samples. The proportion of these unexpected SNPs was considered as the false positive rate. We note that this approach overestimates the error rate, because it considers parallel mutations as errors and ignores potential inaccuracies in identifying the SNP ancestral states in ISOGG database. The (over)estimated rates were 0.008 for false negatives and 0.005 for false positives. (See details in ). [...] The frequency distribution map of haplogroup G1-M285 was created using data reported here for the first time (27 populations, ), data from the literature (33 populations, ) and published data on other 266 Eurasian populations where G1 frequency was zero. The map was created in the GeneGeo software as described previously [,] setting the weight function to 2 and radius of influence to 2500 km. The map is presented at two scales: a specific scale highlighting the distribution pattern of this haplogroup () and the “universal” scale routinely used in GeneGeo for mapping haplogroup frequencies ().An analysis of molecular variance (AMOVA) was performed using Arlequin [] on two groups of populations: those from the ancient area of Iranic speakers, compared to the group of all other Eurasian populations. We calculated variation among these two groups of populations using data on each haplogroup separately and identified haplogroups demonstrating highest differentiation between “Iranic” and “non-Iranic” populations ().Reduced median networks [] of STR haplotypes () were created in the programs Network and Network Publisher (Fluxus-Engineering, We applied the preprocessing star contraction algorithm [] and postprocessing Steiner maximum parsimony algorithm []. The allele sizes for locus DYS389II were determined with the subtraction of DYS389I. Loci DYS385a and DYS385b were excluded from network analyses. The main network was based on 15 STRs genotyped in 386 samples (). To include data from the Madjar subclan of the Kazakhs [], the second network restricted to the Argyn Kazakh population was based on 10 STRs ().Haplotype diversity was calculated according to [] as HD=1−∑pi2, where p i is the frequency of the i th haplotype. Data on 17 Y-STR haplotypes (DYS389II alleles were subtracted) belonging to G1-haplogroup from were used. Neighbouring populations with small numbers of G1 haplotypes were pooled to reach a minimum sample size of six; the average sample size was 30. The values of haplotype diversity were plotted on a map using the GeneGeo software with the weight function set to 4. [...] The BigY output VCF files () contained around 33,700–35,900 SNP calls in each of 20 samples. We combined these datasets into single table and filtered out (i) indels, (ii) SNPs with a call rate below 95% (i.e. not called in at least one out of 20 samples; BED files indicating called ranges for each sample are present in the ) and (iii) SNPs which demonstrated no polymorphism in our samples (i.e. all samples were either identical to or all were different from the reference at these positions). The resulted filtered dataset () consisted of 19 G1 samples, one G2 outgroup sample and 636 SNP positions with very little homoplasy.The parsimony trees were constructed from this dataset using TNT [] and Phylomurka ( software. Only one optimal topology was obtained although the states of internal nodes could be marked in different ways. presents the ages of branches estimated according to [].The same dataset was also subjected to analysis with BEAST software [] which can reconstruct phylogeny and estimate divergence time by a number of Markov chain Monte Carlo methods. For the test we chose the GTR nucleotide substitution model and Gamma-distributed site heterogeneity with default parameters. We tested both strict and lognormal relaxed clock models and finally preferred the latter due to a positive posterior value of the rate variance between tree branches (). For the tree prior we chose the Expansion Growth model assuming that the population grew exponentially since a relatively recent time. The prior for the mutation rate was set as a uniformly distributed value, initially equal to 1.2×10-5 per SNP () while the age of Kazakh cluster was forced to be normally distributed with the mean of 627 years (the value obtained using genealogical records, see below) and standard error of 50 years. Sufficient ESS values were achieved with the MCMC chain size of 20,000,000 and higher. The consensus of 10,000 trees produced by BEAST is the same as our parsimony tree, and Bayesian age estimates show less than 20% difference from those obtained with Rho statistics (). Bayesian methods such as BEAST assume random sampling from a population and interpretation of their output can be less straightforward when lineage-based sampling is used [], and our dataset was restricted to a single haplogroup. However, the coinciding topologies of the trees generated by the different methods in our study shows that the phylogenetic structure is robust to this concern. Note that in our analyses, choosing the model of population growth had the major influence on the results—probably larger than the influence of sampling from a particular lineage—but we report the general consensus of the results across all settings used.In an additional analysis we included two G1 samples from the 1000 Genomes Project (NA20858 and NA20870, Gujarati Indians sampled in Houston, Texas (GIH), 2-4X average coverage). Data were handled in the same way, although the lower coverage of the 1000 Genomes samples halved the number of SNP calls and the filtered dataset consisted of 22 samples and 393 SNPs (). The parsimony method yielded two optimal topologies, and the one supporting the monophyly of all non-Indian lineages was preferred as a more likely reconstruction. […]

Pipeline specifications

Software tools SAMtools, Arlequin, BEAST
Applications Phylogenetics, Population genetic analysis
Organisms Homo sapiens