Computational protocol: Investigating the Prehistory of Tungusic Peoples of Siberia and the Amur-Ussuri Region with Complete mtDNA Genome Sequences and Y-chromosomal Markers

[…] We generated a total of 525 full genome mtDNA sequences from 130 Evenks belonging to four subgroups (from northwest to southeast: Taimyr, Stony Tunguska, Nyukzha, and Iengra), 122 Evens belonging to five subgroups (from west to east: Sakkyryyr, Sebjan, Tompo, Berezovka, and Kamchatka), 31 Udegey, 169 Yakuts belonging to three subgroups (Vilyuy, Central, and Northeast), 20 Yukaghirs, 15 Koryaks, and 38 Nivkh () using the method described in the Supplementary Materials of Barbieri et al. []. The Taimyr Evenk samples were collected in the village of Khantayskoe Ozero, the Sebjan and Berezovka samples in the villages of Sebjan-Küöl and Berezovka (Yakutia), respectively, the Kamchatkan Evens and Koryaks in the villages of Esso and Anavgai, the Nivkh were collected in northern Sakhalin, and the Udegey were collected in the village of Gavsjugi (Khabarovsk Region). For details on the sample locations for the other (sub)populations see ,. The sequences were generated with an Illumina Genome Analyzer IIx sequencer to an average coverage of 274x and full sequences were deposited in GenBank under accession numbers KF148067-KF148359 and KF148361-KF148592. Three different alignments were generated for analysis. For calculation of pairwise ΦST, standard mtDNA diversity indices, and Analysis of Molecular Variance (AMOVA) in Arlequin v3.5 [] all indels, positions with unclear and missing data (coded as N’s) and the poly-C region (16183 - 16194) were removed, leaving an alignment of length 16,507 bp. For the analysis of haplotype sharing using an in-house Python script as well as Bayesian Skyline Plots (BSP) produced with BEAST, a second alignment was generated which omitted only the poly-C region and positions containing Ns, but included all indels, resulting in an alignment length of 16,521 bp. The third alignment was used to construct Median-Joining networks; it was 16,478 bp in length to accommodate published data downloaded from GenBank. Haplogroups were assigned using the online tool Haplogrep [] in reference to PhyloTree Build 15 [], using rCRS [] as a reference. For each population and subgroup the ideal substitution model was calculated using jModelTest v 2.1 []. With the information as to the appropriate substitution model, each (sub)population was then tested for adherence to a molecular clock in MEGA v5 []; for none of the (sub)populations was the null hypothesis of a molecular clock rejected. Bayesian Skyline plots (BSP) were then generated using the BEAST package v1.6 [,], partitioning the data between the coding (577 - 16,023) and non-coding (16,024 – 576) regions and applying the mutation rates from Soares et al. [] (1.708E-8 and 9.883E-8, respectively), using a strict clock model.Median Joining (MJ) networks [] were generated using the Network 4.6 and Network Publisher v1.3 programmes (, with transversions given a threefold higher weight than transitions. Correspondence analysis (CA) and multi-dimensional scaling (MDS) plots were generated with STATISTICA v10 []. Mantel tests of the correlation between geographic great-circle distances – calculated with the R package “geosphere” [] – and ΦST genetic distances between (sub)populations were performed in R using the “ade4” package [].Published mtDNA genome sequences were downloaded from GenBank for inclusion in the network analyses; these included additional Yukaghir, Evenk, Even, Udegey, Koryak, and Yakut sequences as well as Mongolic-speaking Buryats and Khamnigans, Turkic-speaking Tofalar, Tubalar, Altai-Kizhi, Shor, Teleut, Tuvan, and Kazakh from South Siberia, Kets, who speak an isolate language, and Finno-Ugric-speaking Mansi from western Siberia, Samoyedic-speaking Nganasan from the Taimyr Peninsula, Tungusic-speaking Negidal and Ulchi from the Amur-Ussuri region, Chukchi from Chukotka who speak a Chukotko-Kamchatkan language, and Eskimos [,,,,]. […]

