Computational protocol: Early Back-to-Africa Migration into the Horn of Africa

Similar protocols

Protocol publication

[…] We merged genome-wide SNP data from the HOA with the new Yemeni data and other published data from the Middle East , North Africa , Qatar , southern Africa , west Africa , the HapMap3 project , and the Human Genome Diversity Project using PLINK version 1.07 . We excluded symmetric SNPs and SNPs and individuals with greater than 10% missing data. All known and inferred relatives were removed from the HapMap3 and HGDP data , . We then estimated kinship coefficients across all remaining individuals in all included populations using the “robust” algorithm, which is tolerant of population structure, in the KING software . For all sets estimated to be second degree or closer relatives, we removed the individual(s) that would maximize the number of included individuals.After pre-processing, the main dataset included 2,194 individuals from 81 populations for 16,766 SNPs (). We generated the linkage map for this dataset using the online map interpolator from the Rutgers second-generation combined linkage-physical map . This dataset include some markers in strong linkage disequilibrium (LD), which is required for some of the analyses we conducted, but can bias other methods. For the methods that can be confounded by high levels of LD, we randomly excluded one of every pair of SNPs having pairwise genotypic correlation greater than 0.5 within a sliding 50 SNP window. After this exclusion, the “reduced-LD” dataset had 16,420 SNPs.Many methods are known to perform better with more SNPs, especially those based on patterns of LD. To ensure that the estimates using these methods from our main dataset are reliable, we created two additional verification datasets with reduced population representation, which allows for greater overlap of mutually typed SNPs across studies. The “90K” dataset includes data for 91,101 SNPs from HOA, HapMap3, HGDP, and North Africa populations. The “260K” dataset includes data for 259,257 SNPs from the HOA, HapMap3, HGDP, southern Africa, and selected West Asian populations (see for populations in the 90K and 260K datasets). All of the procedures described above for the main datasets were followed. [...] Multidimensional scaling (MDS) was performed upon a genome wide matrix of identity by state (IBS) for all individual pairs in the reduced-LD dataset using PLINK . For each increase in K from 2 to 5, there were substantial changes in reduced stress, but not for K greater than 5, so the IBS matrices were projected to 5-dimensional space. We inferred genetic structure and estimated admixture proportions in the reduced-LD dataset using ADMIXTURE . Ancestry proportions were estimated for K values ranging from 2 to 20, and cross-validation error was calculated for each value of K. The geographic distribution of estimated admixture proportions were plotted using methods modified from Olivier François using the MAPS, MAPTOOLS, and SPATIAL packages in R –. [...] After phasing the 260K dataset using the haplotypes inference algorithm implemented in version 2 of the SHAPEIT software , we partitioned the phased data from admixed HOA and MENA populations into African and non-African chromosome segments using the chromosome painting method implemented in the CHROMOPAINTER software . This algorithm “paints” each target individual as a combination of segments from “donor” populations. As donors, we selected individuals from African and non-African populations without significant evidence for admixture: African populations used as donors were the Anuak, Ju/'hoansi, Mandenka, Mbuti, San, South Sudanese, and Yoruba; non-African ancestry populations used as donors were the Adygei, Basque, Bedouin, Brahui, Burusho, CEU, Druze, Gujarati (GIH), Hazara, Makrani, Orcadians, Pathan, Sardinians, and Saudi Arabians. For each admixed individual, each chromosome segment that was “painted” with 80% or greater confidence from African or non-African donor populations was assigned that origin. On average, 85% of each admixed individual's genome could be confidently partitioned. We then sampled from the painted segments to create 12 African ancestry and 12 non-African ancestry chromosomes for the admixed HOA population samples and the key neighboring admixed population samples of the Yemeni, Palestinians, Egyptians, and Mozabite (12 chromosomes was chosen as a compromise between maximizing sample size and maximizing the included populations). The Ari Blacksmith and Ari Cultivator samples were combined into a single Ari sample and the Ethiopian Somali and Somali samples were combined into a single Somali sample. The small original sample size of the Afar (n = 12) made it impossible to assemble enough African ancestry painted chromosome segments for this population and neither enough African nor non-African painted chromosome segments could be assembled for the Wolayta (original n = 8). To ensure that the African and non-African ancestry analyses would be directly comparable, we retained only those sites where 12 alleles could be selected from both the African and non-African painted segments across all populations; this reduced the starting 260K dataset to 4,340 SNPs (the “4K partitioned” dataset). Because we required a complete dataset with no missing data, the intersection across populations of available data considerably reduces the number of available sites (even though 85% of each individual genome could be confidently partitioned into African and non-African origin ancestries). Because of this, we had to use the 260K dataset, which unfortunately has reduced population representation, missing in particular most of the North African populations. [...] We formally tested for the presence of admixture in all study populations using the f3-statistic, the D-statistic, and a weighted LD statistic , . Because a significant result for any one of these tests may be produced by histories other than admixture, we only report support for an admixture hypothesis when we found support for admixture from all three tests. To test for admixture between a sub-Saharan African and a non-African population, the f3 test requires a reference population for each, which need not be the actual admixture source. For sub-Saharan Africa reference populations, we used populations that showed very little admixture of ancestral population components in the ADMIXTURE analysis: Mbuti Pygmies, Ju/'hoansi, HapMap3 Yoruba, South Sudanese, and Ari Blacksmith. For non-African reference populations, we used the HapMap3 CEU, Gujarati, and Tuscan populations in addition to Basque, Turkey, and Sardinian. The f3 test was run for all other study populations for all possible pairs of reference populations. A strict Bonferroni correction was applied to control for multiple testing, only Z-scores less than −4 for the most negative f3 statistic for each test population were considered significant. For those populations with significant f3 statistics, the bounds of the admixture proportion were then estimated with the addition of a chimpanzee outgroup. The f3 tests on the 90K and 260K datasets have more power, but return almost exactly the same f3 statistic values ().The test for admixture based on the D-statistic requires three populations in addition to the test population . D-statistics significantly different from zero indicate either admixture or ancestral population structure. As in the f3 test, the reference population suspected to be the source of admixture need not be the true source. We chose our population sets such that only positive values would reflect the admixture of interest. For sub-Saharan African and HOA test populations, the unrooted tree tested was ((African reference, test population), (Papuan, Basque)), where the African reference populations are the same as for the f3 test. Since there is no indication in the literature of any African admixture in the Papuan population, any significantly positive D-statistic was taken as support for admixture between the test population and (a population related to) the Basque. For North African, Middle Eastern, and Eurasian test populations, the unrooted tree tested was ((Papuan, African reference), (Basque, test population)), where the African reference populations are the same as before. Again, since there is no indication in the literature of any admixture between Papuans and Basque, any significantly positive D-statistic indicates admixture between the test population and an African reference population. A strict Bonferroni correction was applied to control for multiple testing, only Z-scores greater than 4 for the most positive D-statistic for each test population were considered significant. The D tests on the 90K and 260K datasets have more power but recover indistinguishable D statistic values ().Like the f3 test, the weighted LD test in the ALDER software requires two reference populations, which need not be the actual admixture sources , and we used the same sets of non-African and sub-Saharan African reference populations. The test procedure implemented in ALDER controls for multiple testing across all the pairs of populations for each test population, but we still controlled for multiple testing across the whole family of tests using a strict Bonferroni correction, with only Z-scores greater than 3.2 considered statistically significant. The ALDER tests for admixture on the 90K and 260K datasets have more power but return similar results ().We used three methods to calculate non-African admixture proportions in significantly admixed populations. First, we estimate the lower and upper bounds of non-African admixture using the bounding procedure allied with the f3 admixture test . This method requires an outgroup to the three populations in the f3 test, but does not require a large sample, or even polymorphism, for the chosen outgroup. Therefore, following the recommendation in the description of this method, we used chimpanzee as the outgroup. Second, we estimated admixture proportions using the f4 ratio estimation method . The required number of populations and relationships among those populations for this method are as described for the D statistic test for admixture above, with the addition of an outgroup. Again, we used chimpanzee as the outgroup. Finally, for our third measure of non-African admixture proportions, we summed the proportions attributed to non-African ancestries from our ADMIXTURE analysis at K = 12. [...] We estimated the time of admixture for all populations identified as admixed using two LD-based methods: ROLLOFF , and ALDER . Following Pickrell et al. , we also compared the fit of single and double admixture models for admixed HOA populations. For comparison with other published admixture dates, we used the HapMap3 CEU and Yoruba populations as references. We also used the reference populations that gave the top f3 statistic in the f3 test for admixture and the reference populations giving the strongest signal in the ALDER test for admixture (sometimes these were the same). To verify the admixture date estimates calculated from the main (∼17K SNP) dataset are reliable, we ran ROLLOFF and ALDER on both the 90K and 260K datasets using the HapMap3 Yoruba and CEU as the reference populations. Using the main dataset, we estimate ROLLOFF admixture dates from 2.6–3.7 ka and ALDER admixture dates from 1.1–4.1 ka for admixed HOA population. The verification estimates are not meaningfully different from these, with ROLLOFF admixture dates from 2.6–3.7 ka and ALDER admixture dates from 1.2–3.3 ka for the 260K dataset ().We simulated individuals of admixed ancestry following published protocols , . We extracted 20 CEU and 40 Yoruba (YRI) individuals from a 260K SNP combined HapMap3 and HDGP dataset and phased them using fastPHASE . These phased chromosomes were combined in episodic admixture scenarios, with two instances of admixture. We started with 20 CEU individuals and selected 20 random Yoruba individuals, and simulated admixture at time λ0 with admixture proportion α0 deriving from the Yoruba and 1 – α0 from the CEU. For each haploid admixed genome, we randomly selected one chromosome from each source population. We then created a vector of ancestry transition events along each chromosome by sampling with probability 1 – e −λ0g, where g is the genetic distance in Morgans. Using this vector of transition event locations, we selected ancestry from the Yoruba chromosome with probability α0 at each transition. This procedure was repeated until we had 40 haploid admixed genomes. We then used these admixed chromosomes as a source population for the second episode of admixture at time λ1 with admixture proportion from α1 from the remaining 20 YRI individuals not selected for the first admixture. We randomly combined the 40 haploid admixed genomes into 20 diploid individuals. We chose to simulate 20 admixed individuals because the modal number of individuals in our admixed populations was about 20.In our first set of simulations, we simulated admixture with λ0 equal to 50, 100, 150, or 200 generations and λ1 equal to 10 or 30 generations. Admixture proportion α0 was either 0.10 or 0.25 and admixture proportion α1 was 0.10. Three independent replicates were performed for each combination of parameters (48 simulations in total). The second set of simulations used λ0 equal to 50, 100, 150, 300, 500, 650, 850, 1000, or 1150 generations and λ1 equal to 30 generations. Admixture proportion α0 was 0.50 and admixture proportion α1 was 0.10. Again, three independent replicates were performed for each combination of parameters (27 simulations in total). Admixture dates were estimated for the simulation data using ROLLOFF and ALDER with the remaining unadmixed CEU and Yoruba individuals as the reference populations. In addition, we reduced the simulated data to the 16,766 SNPs present in the main dataset used to estimate admixture dates for the study populations and estimated admixture dates using ROLLOFF and ALDER for the same set of reference population pairs. […]

Pipeline specifications