Computational protocol: Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis

Similar protocols

Protocol publication

[…] We used the POPRES European data set from , and processed the data as in . The POPRES data set was obtained from dbGaP at bin/study.cgi?study_id=phs000145.v1.p1 through dbGaP accession number phs000145.v1.p1. This data included individuals, each of whom identify all four grandparents as being from a particular European country, genotyped at SNPs, and pruned down to SNPs after removing one of any pair of SNPs that had an .Since our SFA method does not currently deal with missing data, we imputed missing genotypes using impute2 . We imputed each chromosome by intervals of Mb, starting at position , with a buffer of size Mb on either side of the interval. We set the number of burn-in iterations to and the number of MCMC iterations to . We set the effective population size of the European sample to be , and we used the combined linkage maps from build , release (downloaded from the impute website). We used these imputed genotypes as input to all three methods to facilitate fair comparisons. [...] For smaller data sets (all but the European and Indian data), we computed principal components by first standardizing the columns of the matrix (subtracting their mean and dividing by their standard deviation) and then finding the eigenvectors of the covariance matrix of the individuals in R using the function eigen. In our terminology, these eigenvectors, or principal components (PCs), are the loadings, i.e., the columns of . For larger data sets, we identify the PCs using the SmartPCA software from the EigenSoft v package , . For both the European genotype data and the Indian genotype data, we set the number of output vectors to , we use the default normalization style, we do not identify outliers, we have no missing data, and we remove all chromosome data. [...] We ran admixture v with multiple random starting points using the -s option.We mapped the four-dimensional admixture proportions into two-dimensions for visualization as follows: the four-dimensional vector maps to the two-dimensional vector . […]

Pipeline specifications

Databases dbGaP
Applications Population genetic analysis, GWAS