Computational protocol: A genetic chronology for the Indian Subcontinent points to heavily sex-biased dispersals

Similar protocols

Protocol publication

[…] In order to clarify the phylogeny of haplogroups M, N and R in South Asia, we focused our study on the lineages with recognized or potential likely origin in the Subcontinent, belonging to macrohaplogroups M (M2, M3, M4’67, M5, M6, M13’46’61, M31, M32’56, M33, M34’57, M35, M36, M39, M40, M41, M42b, M44, M48, M49, M50, M52, M53, M58, M62), R (R5, R6, R7, R8, R30 and R31) and N (N1’5). We also studied U2 (excluding U2e due to its West Eurasian origin) in a complementary analysis. We obtained 381 whole-mtDNA sequences from the 1KGP [] (although we note that these were collected from caste families from India and lack tribal groups) and 51 from the HGDP []. In addition, we generated 13 new sequences (accession numbers: KY686204 -KY686216) belonging to the aforementioned haplogroups from Southeast Asia: seven from Myanmar, one from Vietnam, one from Thailand and four from Indonesia. We combined these with other published data from South Asia and neighbouring areas, including a total of 1478 samples (Additional file : Table S1). The additional sequences increased substantially the sample size particularly in the West of the Indian Subcontinent, necessitating a re-evaluation of previously inferred phylogeographic patterns [, ].In order to discern migrations into the Subcontinent at different time periods, we also performed a complementary analysis of several “non-autochthonous” N lineages present in South Asia (H2b, H7b, H13, H15a, H29, HV, I1, J1b, J1d, K1a, K2a, N1a, R0a, R1a, R2, T1a, T2, U1, U7, V2a, W and X2—all subclades of West Eurasian haplogroups), amounting to a total of 635 mtDNA sequences (Additional file : Table S2). We assigned haplogroups using HaploGrep [], in accordance with the nomenclature in PhyloTree (Build 17, February 2016) []. [...] We reconstructed the mitogenome phylogenetic tree manually, based on a preliminary reduced-median network analysis [] with Network v.4.611, checked considering the frequency of each mutation [] and the nomenclature of PhyloTree (Build 17) []. We estimated coalescence ages within haplogroups M and N using both the ρ statistic [] and maximum likelihood (ML). We calculated ρ estimates with standard errors estimated as in Saillard et al. [] using a synonymous clock of one substitution in every 7884 years and a mitogenome clock of one substitution every 3624 years further corrected for purifying selection []. We assessed ML estimations using PAML 4 and the same mitogenome clock assuming the REV mutation model with gamma-distributed rates (discrete distribution of 32 categories) and two partitions, in order to distinguish hypervariable segments I and II (HVS–I and HVS–II) from the rest of the molecule. We performed runs both assuming and not assuming a molecular clock, in order to perform likelihood ratio tests (LRT) [].Since haplogroup M displays a peculiar phylogeographic pattern in South Asia [], we additionally estimated node ages in different sub-regions of the Subcontinent (west, south, central and east) with two different approaches: (1) considering all samples from a given region, regardless of the putative geographical origin of the clade and (2) considering the most probable origin of each major haplogroup (by considering branching structure, number of main branches, and centre of gravity) and including only basal lineages of each region []. To evaluate the effective population size (N e) of haplogroup M in each region, we computed Bayesian Skyline Plots (BSPs) [] using BEAST 1.8.0 []. Although haplogroups do not equate to populations, BSPs applied to specific lineages can provide insights into the size variations of the populations that include them [–]. We used a relaxed molecular clock (lognormal in distribution across branches and uncorrected between them), a two-parameter nucleotide evolution model and a mutation rate of 2.514 x 10-8 mutations per site per year []. [...] We filtered a dataset comprising 1440 samples with 500,123 SNPs, combining data from the 1KGP and 8 independent studies (Additional file : Table S3) for linkage disequilibrium (LD) using PLINK v1.07 [] (r2 > 0.25, with a window size of 100 SNPs and step size of 1), yielding a subset containing 164,149 SNPs. We subjected these to principal component analysis (PCA) using the standard PCA tool provided in EIGENSOFT v6.0.1 [], with which we calculated the first 10 principal components (PCs), from which we calculated the fraction of variance. We included three additional 1KGP populations—Han Chinese from Beijing, China (CHB), Tuscans from Italy (TSI) and Yoruba from Nigeria (YRI)—for ADMIXTURE v1.23 [] and sNMF [] analyses for cross-checking. We performed runs for values of K between 2 and 10, with 5-fold cross-validation in ADMIXTURE, and complementary analyses including Yamnaya aDNA samples []. The filtered datasets used (r2 > 0.25, window size of 100 SNPs and step size of 1) included 66,245 SNPs, for ADMIXTURE analysis, and 64,926 SNPs for the PCA.In order to assess potential sex-biased gene flow into the region, we compared uniparental (mtDNA and Y-chromosome) and autosomal ancestry in the five 1KGP South Asian populations: Bengali from Bangladesh (BEB), Gujarati Indian from Houston (GIH), Indian Telugu from the UK (ITU), Punjabi from Lahore, Pakistan (PJL) and Sri Lankan Tamil from the UK (STU). For the autosomal ancestry variation, we considered the mean of each component for the highest likelihood value. The putative origin of the uniparental lineages present in the populations is shown in Additional file : Table S4. Y-chromosome phylogeny was based on Yfull tree v4.10 (https://www.yfull.com/tree/) []. We considered as South Asian the Y-chromosome lineages that most likely entered the Subcontinent before the Last Glacial Maximum (LGM): H [–], K2a1* [] (this attribution on the basis of the early-branching lineage, and therefore uncertain, but only concerns a single sample and does not affect the results in any way), and C5 []. Y-chromosome haplogroups G, J, L1, L3, Q, R1 and R2 seem to have entered South Asia more recently in the early to mid-Holocene from a West Eurasian source [, –]. C(xC5), O and N probably had a Holocene Eastern origin [, , , ]. […]

Pipeline specifications

Software tools HaploGrep, PAML, BEAST, PLINK, EIGENSOFT, ADMIXTURE, sNMF
Databases HGDP nextstrain
Applications Phylogenetics, Population genetic analysis, GWAS
Organisms Homo sapiens