Computational protocol: Genetic Ancestry of Hadza and Sandawe Peoples Reveals Ancient Population Structure in Africa

Similar protocols

Protocol publication

[…] Semi-supervised and unsupervised clustering analyses were performed using ADMIXTURE version 1.22 (). Analyses were performed in triplicate with different starting seeds and five-fold cross-validation. Standard errors were estimated using 200 bootstrap replicates. For the semisupervised analysis, we generated pseudo-samples by identifying individuals with the highest proportions of each of the 19 previously defined ancestries from a global reference panel of 3,528 individuals () as training data. We required an ancestry proportion of ≥50%, regardless of the sample to which the individual belonged. For each ancestry, we sorted all individuals by ancestry proportion and identified the top 20, except for Oceanian ancestry, for which only seven individuals met our minimum ancestry proportion criterion. The labeled training data set for the semi-supervised analysis thus comprised genotype data for 367 individuals. We then analyzed the Hadza and Sandawe sample data given these training data. Given individual estimates, sample means were estimated using inverse variance weights. Sample means not significantly different from zero were zeroed out. Sample means were rescaled to sum to 1. Semi-supervised analysis is called supervised analysis in ADMIXTURE () and can be performed by invoking the option –supervised. Supervised analysis based on predefined allele frequencies that are not allowed to be updated by the sample genotype data is called projection analysis in ADMIXTURE and can be performed by invoking the option –P (). Supervised analysis is not recommended if there are ancestries missing from the panel of predefined allele frequencies.For the unsupervised analysis, we filtered our reference set to exclude samples with Asian and/or European ancestry. This filtering step resulted in a data set of 881 individuals from 47 samples, including/Gui and//Gana,! Xun (two samples), Agaw, amaXhosa, Amhara (two samples), Angolan! Xun, Anuak, Ari Blacksmith, Ari Cultivator, Bamoun, Bantu from Kenya, Bantu from South Africa, Biaka Pygmy, Brong, Bulala, Ethiopian Jews, Fang, Gumuz, Hadza, Hausa, Ju/’hoansi (two samples), Kaba, Khwe, Kongo, Luhya, Maasai, Mada, Mandenka, Mbuti Pygmy, Mozabite, Oromo (two samples), Qatari Arab, Sahrawi, San, Sandawe, SEBantu (Sotho, Tswana, and Zulu), Somali (two samples), Sudanese, Tunisia, Wolayta, and Yoruba (two samples) (). Unsupervised analysis was performed in ADMIXTURE’s default mode. [...] We reformatted the ancestry-specific allele frequencies from the unsupervised clustering analysis for migration analysis using TreeMix (). To do this, we estimated the effective sample size for each ancestry by summing the mixture proportions across individuals from ADMIXTURE’s Q matrix. We then multiplied these effective sample sizes by two to estimate the effective number of alleles. Finally, we multiplied the effective number of alleles by the ancestry-specific allele frequencies to arrive at ancestry-specific allele counts. We defined a root by coding two copies of the ancestral allele at each position. We set the number of migration events from 0 to 8. Conditional on the number of migration events, we generated 100 bootstrap replicates. Our stopping rule was the number of migrations events at which the range of residuals stopped decreasing. [...] Each pairwise distance estimated from TreeMix involves the distance from a terminal tip to an internal node plus the distance from that internal node to a second terminal tip and thus is an estimate of 2F^ST, assuming equal sample size (). We estimated divergence time using the estimators N^e=θ^4μ^ and 1−F^ST=(1−12N˜e)t, in which t is generations, μ^ is mutations per generation per site, F^ST= is half of the pairwise distance from TreeMix, and N˜e is the harmonic mean of the estimated effective population sizes N^e (), assuming that F^ST=0 at t=0 (). We estimated N^e using the mlrho autosomal heterozygosities H^ reported in the Simons Genome Diversity Project () and the relationship θ^=H^1−H^. The Simons Genome Diversity Project did not include individuals representing Cushitic or Omotic ancestral majorities. We estimated N^e for Hadza and Sandawe by scaling the Western Pygmy estimate (). For μ^, we used the weighted average (% non-sub-Saharan ancestry) × 1.17×10−8 mutations per generation per site + (% sub-Saharan African ancestry) × 0.97×10−8 mutations per generation per site (), based on supervised clustering analysis of the Simons Genome Diversity Project data () and our reference panel of ancestries (). To convert generations into years, we assumed a generation interval of 28 years (; ). […]

Pipeline specifications

Software tools ADMIXTURE, TreeMix, mlRho
Application Population genetic analysis
Organisms Homo sapiens, Guizotia abyssinica