Computational protocol: Human Neutral Genetic Variation and Forensic STR Data

Similar protocols

Protocol publication

[…] The complete datasets provided by strdna-db and PopAffiliator online resources were retrieved and named, respectively, “Frequency” (allele frequencies) and “Genotype” (genotype profiles) datasets. Both datasets were subjected to a phase of maximization of comparable data, leading to the inclusion of only those samples that present information on the 13 CODIS loci commonly tested with the commercial forensic kits. The details and reference of each sample used in this study are given in and . The allele nomenclature used refers to the number of repeats. Imperfect alleles, consisting of an increment or a depletion of an incomplete repetitive motif, were also considered. There is some heterogeneity among loci with respect to the complexity of repeat variation. Loci D21S11 and FGA have several imperfect alleles that interrupt the 4 bp repetitive structure. In FGA, these imperfect alleles are distributed at rather low frequencies among samples, whereas they reach substantial frequencies in D21S11. Locus TH01 has instead a single very frequent imperfect allele (allele 9.3, ∼17% and ∼20% for Frequency and Genotype datasets, respectively). For the other 10 loci, the frequency of imperfect alleles is extremely low (<1%).Further filters were then applied separately to each dataset. For the Frequency dataset, we controlled that the sum of frequencies was equal to 1 for each sample, which led to 190 globally distributed population samples (average over loci of 66,349±1,411 individuals) fitting this requirement ( and ). The usefulness of these samples for population genetics studies was further evaluated by identifying those samples supposed to be constituted of individuals of mixed origins or poorly defined provenance (e.g. metropolitan samples or “mestizo” samples from South America) or populations that have recently changed geographic location (e.g. Koreans living in Russia). In this way, 141 samples (summing up to 25,669 individuals) were classified as well-defined, being representatives of a given location presumably since a relatively long time, and the other 49 were considered as possibly admixed populations, having limited information about geographic/ethnic origin. This led us to consider only the well-defined samples () for the statistical analyses presented in this paper. Note that the representativeness of the various continents is much more balanced when considering the well-defined samples only (Africa = 10%, Asia = 42%, Europe = 41%, America = 3%, Oceania = 4%) than in the full database.We applied similar criteria to the Genotype dataset (described in and ) as those used for the Frequency dataset, which resulted in 42 well-defined population samples, summing up 11,132 individuals, comprising almost all the inhabited continents except Australia (). For the Genotype dataset, we also tested Hardy-Weinberg equilibrium in all populations and for all loci using Arlequin 3.5 .As shown in , we allocated the population samples to 12 major world geographic regions that correspond to natural geographic subdivisions and spatial extensions of human major language families, following criteria adopted by the immunogenetics community , : North Africa (NAF), sub-Saharan Africa (SAF), Europe (EUR), Southwest Asia (SWAS), South Asia (SAS), Central Asia (CAS), Southeast Asia (SEAS), East Asia (EAS), Northeast Asia (NEAS), Australia (AUS), North America (NAM), Central and South America (CSAM). The two datasets are considerably different regarding the number and the distribution of population samples (27 sampling locations are common), with the Frequency dataset representing 11 of the 12 geographic groups (all but NEAS) and the Genotype dataset assigned into 8 groups (all but NAM, AUS, SAS and SEAS). [...] The genetic relationships between populations were firstly estimated through the computation of matrices of pairwise RST indices (distances between alleles were computed as sums of squared differences in repeat numbers), by using the software Arlequin 3.5 . The RST values were directly calculated for the Genotype data and their significance tested with the permutation procedure implemented in Arlequin (10,000 iterations). For the Frequency dataset, multi-locus RST between each pair of samples was computed using the Michalakis and Excoffier approach , as applied in : briefly, since the RST index is the ratio of the genetic variance due to differences between populations to the total genetic variance, locus-specific variance components were computed using Arlequin, and then summed over all loci so as to obtain a multi-locus RST value. For each locus, RST significance was tested through the permutation procedure of Arlequin (10,000 iterations). Population pairwise RST values inferred from each of both datasets were then used to calculate coancestry coefficients, (i.e. Reynolds genetic distances ), and the resulting matrices of population pairwise genetic distances were submitted to Multidimensional scaling analysis (MDS) using R .In order to explore the relationship of genetic and geographic distances between populations, pairwise great-circle distances between populations locations were calculated with GeoDist in both datasets. Here also, we imposed obligatory waypoints between major landmasses to compute geographic distances between populations from different continents. We used the Mantel test implemented in the GenAlEx 6 software to test the significance of the resulting correlation coefficients between geographic and genetic distances by a permutational resampling process including 1,000 permutations.The levels of genetic differentiation between all populations and between geographic groups of populations were assessed in both datasets through analyses of molecular variance (AMOVA) . We used a hierarchical framework to obtain the estimations of three fixation indices, reflecting the levels of genetic differentiation, respectively, among populations within geographic groups (RSC), between geographic groups of populations (RCT), and globally among all populations (RST). The significance of these fixation indices was tested by 10,000 iterations of the permutation procedure implemented in Arlequin. For the Frequency dataset, all the AMOVA computations were performed for each locus independently and, in a similar way as was done for populations pairwise RST (see above), the various components of variance were combined across loci to infer multi-locus fixation indices. The statistical significance of the global multi-locus fixation indices were obtained using Fisher’s combined probability test.The Genotype dataset was also analysed with the STRUCTURE software which infers population clusters that maximize Hardy-Weinberg and linkage equilibrium . For this analysis we used the admixture model and the correlated allele frequency model assuming an ancestral relationship between the populations as was done in , and we did not assume a priori assignment of individuals to populations. We tested up to nine clusters (K) with 10 replicates for each run of 100,000 iterations after a burn-in step of 10,000 iterations. The Evanno approach was applied to determine the number of clusters K that best fit the data . […]

Pipeline specifications

Software tools Arlequin, GenAlEx
Application Population genetic analysis
Organisms Homo sapiens