Computational protocol: A fast least-squares algorithm for population inference

Similar protocols

Protocol publication

[…] The algorithms we discuss accept the number of populations, K, and an M × N genotype matrix, G as input: (3) G = g 11 g 12 ⋯ g 1 N g 21 g 22 ⋯ g 2 N ⋮ ⋮ ⋱ ⋮ g M 1 g M 2 ⋯ g MN where gli ∈ {0,1,2} representing the number of copies of the reference allele at the lth locus for the ith individual, M is the number of markers (loci), and N is the number of individuals. Given the genotype matrix, G, the algorithms attempt to infer the population allele frequencies and the individual admixture proportions. The matrix P contains the population allele frequencies: (4) P = p 11 p 12 ⋯ p 1 K p 21 p 22 ⋯ p 2 K ⋮ ⋮ ⋱ ⋮ p M 1 p M 2 ⋯ p MK where 0 ≤ plk ≤ 1 representing the fraction of reference alleles out of all alleles at the lth locus in the kth population. The matrix Q contains the individual admixture proportions: (5) Q = q 11 q 12 ⋯ q 1 N q 21 q 22 ⋯ q 2 N ⋮ ⋮ ⋱ ⋮ q K 1 q K 2 ⋯ q KN where 0 ≤ qik ≤ 1 represents the fraction of the ith individual’s genome originating from the kth population and for all i, ∑kqki = 1. Table summarizes the matrix notation we use. [...] We generate simulated genotype data for a variety of problems using M = 10000 markers, and varying N between 100, 1000, and 10000; K between 2, 3, and 4; and α between 0.1, 0.5, 1, and 2, for a total of 36 parameter sets. For each combination of N, K, and α, we generate the ground truth P from a uniform distribution, and Q from a Dirichlet distribution parameterized by α. Then, we draw a random genotype for each individual using the binomial distribution in Equation 11. We estimate P and Q using only the genotype information and the true number of populations, K. We repeat the experiment 50 times drawing new, P, Q, and G matrices each time. Finally, we record the performance of Admixture using the published tight convergence threshold of ε = 1e-4[] and a loose convergence threshold of ε = MN×10-4; the least-squares algorithm using an uninformative prior (α = 1) and ε = MN×10-4, and the FRAPPE EM algorithm using the published threshold of ε = 1. For reference, we also include the least-squares algorithm with informative prior (known α) with convergence threshold of ε = MN×10-4. In all experiments, Admixture’s performances with the two convergence thresholds were nearly identical and we only report the results for ε = MN×10-4, resulting in shorter computation times. We used a four-way analysis of variance (ANOVA) with a fixed effects model to reveal which factors (including algorithm) contribute more or less to the estimation error and computation time. […]

Pipeline specifications

Software tools ADMIXTURE, frappe
Application Population genetic analysis