Computational protocol: Genomic insights on the ethno-history of the Maya and the ‘Ladinos’ from Guatemala

Similar protocols

Protocol publication

[…] A total of 110 samples (2 Achi, 2 Kaqchikel, 2 K’iche’, 11 ‘Ladino’, 18 Poqomchi’, and 75 Q’eqchi’) where analyzed for the mtDNA (see Additional file for more details). All samples were amplified and double-strand sequenced for the entire mtDNA control region. Mutations are referenced with respect to the revised Cambridge Reference Sequence (rCRS) [,]. Haplogroup nomenclature follows Phylotree Build 16 (; []). The sequences were initially classified into haplogroups using HaploGrep [] and manually checked according to recommendations []. Potential sequence artifacts were checked as reported previously [-]. In order to increase the phylogenetic resolution of mtDNA HVS-I/II within the Native American phylogeny, we genotyped coding region mtDNA SNPs (mtSNPs) using a single multiplex SNaPshot reaction, as described previously [,]. Unexpected mtSNP phylogenetic patterns according to the known phylogeny were confirmed by repeating the SNP genotyping using single-plex minisequencing and automatic sequencing.Based on the information provided by the control region profiles (Additional file ), 12 Native American lineages (carried by 10 Q’eqchi’ and 2 Poqomchi’) were selected for entire mtDNA genome sequencing following previously described protocols [,]; Additional file . The criterion for selection was mainly based on the particularities of the mutational changes carried by these profiles when compared against the known variability in other Native American datasets and phylogeny. The complete genomes analyzed in the present study have been submitted to GenBank under the accession numbers KM051465-KM051476. [...] We used HVS-I data to build phylogenetic networks with the aid of the program Network [,] and by hand. Hypervariable sites in HVS-I segment such as A16182C, A16183C, and T16519C were not considered (as usual).Maximum parsimony trees were built for the complete genomes obtained in the present study and those collected from the literature belonging to haplogroups represented by the Guatemalan mitogenomes, and following the known worldwide phylogeny (Phylotree). Estimation of the coalescent times of the most recent common ancestor (TMRCA) was computed using two different procedures.TMRCA was initially calculated using a ML procedure (Table ). For this purpose, the software PAML 3.13 [] was used assuming the HKY85 mutation model (ignoring indels, as usual) and using gamma-distributed rates (approximated by a discrete distribution with 32 categories) and three partitions: HVS-I (positions 16051–16400), HVS-II (positions 68–263), and the remainder.TMRCA was also computed from the averaged distance (ρ) of the haplotypes of a clade to the respective root haplotype together with a heuristic estimate of the standard error (σ) calculated from an estimate of the genealogy (Additional file ). These estimates were computed on the mitogenomes considering (i) the whole variation observed (excluding indels and hotspots) and (ii) using only synonymous mutations. The ‘star-likeness’ of the trees was measured using the star index ρ/n × σ2; this index can take values between 1/n (single haplotype representing n mtDNAs) and 1 (perfect star phylogeny) [,].Both methods show very similar divergence ages when applied to mitogenomes. However, the averaged distance to the root shows an anomalous behavior on A2w1 and its sub-clades, with ages that are about twice (averaged on all sub-clades) larger than estimates based on ML (compare to a 1.2 of averaged discrepancy for the rest of the sub-clades). Estimates based on synonymous mutations show also large discrepancies with the ML method. In addition, A2w1 shows very low values of star-likeness (Additional file ), which could be indicative of an overrepresentation of the A2w1 mitogenomes sampled in South America (coupled with the underrepresentation of A2w1 members from other Mesoamerican locations where this clade is probably present) or simply due to a limited sample size in this phylogenetic branch. Overall, the existence of a non-star-likeness phylogenetic pattern in A2w1 is what makes the ML method more reliable and consistent for the estimation of TMCRA. Thus, ML estimates were used for discussion throughout the text.Mutational distances were converted into years using the corrected molecular clock proposed by Soares et al. []. [...] Admixture proportions from autosomal data were inferred by comparing genetic profiles from the present study with those publicly available from the Human Genome Diversity Cell Line Panel, HGCP-CEPH (Centre d’Etude du Polymorphisme Humain; []). These reference parental samples (N = 327) came from populations of three different continents: Africa (Central African Republic, Democratic Republic of Congo, Kenya, Namibia, Nigeria, Senegal, South Africa; N = 105), Europe (France, Italy, Orkney Islands, Russia, Russia Caucasus; N = 158), and America (Brazil, Colombia, Mexico; N = 64). Present-day East Asians were not taken into account as a reference population, assuming that these populations did not substantially contribute to the recent genetic heritage of the Guatemalan people, as is the case in other American locations [,,].Statistical analysis of AIMs included different tools aimed at disentangling the population structure of the Guatemalan study samples. Multivariate analyses were carried out using Principal Component Analysis (PCA). PCA condenses in a few principal components (usually two; PC1 and PC2) an initial set of data that can contain quantitative variables, into a group of fewer variables resulting in a linear combination of the originals.PCA was performed using the statistic software R (R v.3.0.1,, together with the SNPassoc package (SNPassoc v.1.8-5,; []).To further estimate individual ancestry proportions we used ADMIXTURE []. This software uses a ML estimation of individual ancestries from multilocus SNP data (AIMs).Finally, phylogeographic searchers of mtDNA profiles were carried out on an in-house database containing >27,000 mitogenomes and >170,000 partial (mainly HVS-I) mtDNA sequences. Additional exploratory haplotype searchers were carried out on EMPOP (, Familytree (, and the Sorenson ( databases. Note that frequencies obtained from these additional database searchers provide only approximate figures given that their web-interfaces were not conceived specifically for population genetic purposes (e.g. forensic casework in the case of EMPOP). […]

Pipeline specifications

Software tools HaploGrep, PAML, ADMIXTURE, SNPassoc
Databases nextstrain
Applications Phylogenetics, Population genetic analysis, GWAS
Organisms Homo sapiens