Computational protocol: Reconstructing the Population Genetic History of the Caribbean

Similar protocols

Protocol publication

[…] Generated data and assembled datasets for this study are summarized in . A total of 251 individuals representing six different Caribbean-descent populations were recruited in South Florida, USA. Participants were required to have at least three grandparents from their countries of origin, thus limited ethnographic and anonymous pedigree information was collected. The majority of pedigrees (94.3%, n = 82) had four grandparents from the same country. Only 5 pedigrees (5.7%) had one grandparent from a different country. Informed consent was obtained from all participants under approval by the University of Miami Institutional Review Board (study no. 20081175). A total of 76 trios, 2 duos, and 19 parents were genotyped using Affymetrix 6.0 SNP arrays, which included: 80 Cubans, 85 Colombians, 34 Dominicans, 27 Puerto Ricans, 19 Hondurans, and 6 Haitians. Genotype data will be made available through dbGaP under the Genomic Origins and Admixture of Latinos (GOAL) study. Out of 173 founders, 18 samples were filtered from structure analyses due to cryptic relatedness as inferred by IBD>10%. Four trios were not considered for trio phasing due to an excess of Mendelian errors (>100 K), two trios were removed due to 3rd or higher degree of relatedness between parents as inferred by IBD, and five trios were filtered due to cryptic relatedness between members of different trios above 10% IBD. After filtering, 65 complete trios remained for haplotype-based analyses. To study population structure and demographic patterns involving relevant ancestral populations, 79 previously collected samples from three native Venezuelan tribes were genotyped using the same array (i.e., 25 Yukpa [aka Yucpa], 29 Bari, and 25 Warao). We combined our data with publicly available genomic resources and assembled a global database incorporating genome-wide SNP array data for 3,042 individuals from which two datasets with different SNP densities were constructed (see ). The high-density dataset included populations with available SNP data from Affymetrix arrays; namely African, European, and Mexican HapMap samples , Europeans from POPRES , West Africans from Bryc et al. , and Native Americans from Mao et al. . After merging and quality control filtering, 389,225 SNPs remained and representative population subsets were used in different analyses as detailed through sections below. Our lower density dataset (30,860 SNPs) resulted from the intersection of our high-density dataset with available SNP data generated on Illumina platform arrays, including 52 additional Native American populations , as well as additional Latino populations sampled in New York City and 1000 Genomes Latino samples . The resulting dataset combines genomic data for 1,262 individuals from 80 populations. Full details on the population samples are available in . [...] An unsupervised clustering algorithm, ADMIXTURE , was run on our high-density dataset to explore global patterns of population structure among a representative subset of 641 samples, including seven Native American, eleven POPRES European, HapMap3 Nigerian Yoruba, HapMap3 Mexican, and our six new Caribbean Latino populations (see ). Fourteen ancestral clusters (K = 2 through 15) were successively tested. Log likelihoods and cross-validation errors for each K clusters are available in . FST based on allele frequencies was calculated in ADMIXTURE v1.22 for each identified cluster at K = 8 and values are available in . Our low-density dataset comprising 1,262 samples (detailed in ) was used to run K = 2 through 20. Log likelihoods, cross validation errors and FST values from ADMIXTURE are available in and . Principal component analysis (PCA) was applied to both datasets using EIGENSOFT 4.2 and plots were generated using R 2.15.1. Sex bias in ancestry contributions was evaluated by selecting only females (to ensure we compare a diploid X chromosome to diploid autosomes), and running ADMIXTURE at K = 3 on the X chromosome and autosomes separately. The Wilcoxon signed rank test, a non-parametric version of the paired Student's t-test that does not require the normality assumption, was applied to assess the significance of the difference in X and autosomal ancestry proportions. This tests whether the average difference of ancestry proportions assigned to a given source population for the X and for the autosomes of each sample is significantly different from zero. The test was applied to the entire collection of Latino samples, revealing an over-arching trend, and then to each population in turn to identify any between-population differences. A rejection of the null hypothesis means that the ancestry proportions on the X and the autosomes are significantly different from one another but does not imply which proportion is larger. We provide box plots as a visual aid to show the direction of the difference (). Global ancestry estimates from ADMIXTURE at K = 3 were used to test the correlation between male and female ancestry proportions considering all trio founders within each Caribbean population as well as within the full set of admixed trios. Linear models and permutations (up to 100,000) were performed using R 2.15.1. [...] Family trio genotypes from our six Caribbean populations and continental reference samples were phased using BEAGLE 3.0 software . Local ancestry assignment was performed using PCAdmix (http://sites. google.com/site/pcadmix/ ) at K = 3 ancestral groups. This approach relies on phased data from reference panels and the admixed individuals. To maintain SNP density and maximize phasing accuracy we restricted to a subset of reference samples with available Affymetrix 6.0 trio data, namely 10 YRI, 10 CEU HapMap3 trios, and 10 Native American trios from Mexico . Each chromosome is analyzed independently, and local ancestry assignment is based on loadings from Principal Components Analysis of the three putative ancestral population panels. The scores from the first two PCs were calculated in windows of 70 SNPs for each panel individual (in previous work we have estimated a suitable number of 10,000 windows to break the genome into when inferring local ancestry using PCAdmix, and in this case, after merging Affymetrix 6.0 data from admixed and reference panels, a total of 743,735 SNPs remained/10,000 = window length of ∼70 SNPs). For each window, the distribution of individual scores within a population is modeled by fitting a multivariate normal distribution. Given an admixed chromosome, these distributions are used to compute likelihoods of belonging to each panel. These scores are then analyzed in a Hidden Markov Model with transition probabilities as in Bryc et al. . The g (generations) parameter in the HMM transition model was determined iteratively so as to maximize the total likelihood of each analyzed population. Local ancestry assignments were determined using a 0.9 posterior probability threshold for each window using the forward-background algorithm. In analyses that required estimating the length of continuous ancestry tracts, the Viterbi algorithm was used. An assessment of the accuracy of this approach is given in . [...] To measure the observed deviation in ASPCA of European haplotypes derived from admixed Caribbean populations with respect to the cluster of Iberian samples, a bootstrap resampling-based test was performed. The null distribution was generated from comparing bootstraps of Portuguese and Spanish ASPCA values as models of the intrinsic Iberian population structure. We then compared the ASPCA values of the admixed individuals and tested if the observed differences between Iberian ASPCA values and those of the admixed individuals are more extreme than the differences within Iberia. The distance was determined using the chi-squared statistic of Fisher's method combining ASPC1 and ASPC2 t-tests for each bootstrap. We ran 10,000 bootstraps to determine one-tailed p-values. As Iberians we considered: POPRES Spanish, POPRES Portuguese, Andalusians, and Galicians; and as Caribbean Latinos: CUB, PUR, DOM, COL, and HON. Additional tests were performed comparing Portuguese versus the rest of Iberians and between an independent dataset of Mexican individuals analyzed by Moreno-Estrada, Gignoux et al. (in preparation) projected onto ASPCA space using the same reference panel of European populations. A bivariate test was performed to measure the relative deviation from the Iberian cluster of the distribution given by the Caribbean versus the Mexican dataset. To determine whether insular versus mainland Caribbean populations disperse over significantly different ranges in ASPC2, a Wilcoxon rank test was performed between (COL+HON) versus (CUB, PUR, DOM). Haitians were excluded due to low sample size (N = 2 haplotypes). Boxplot is available in . Population differentiation estimates between clusters inferred with ADMIXTURE were visualized and compared across runs where both the Latino-specific and southern European components were detected. Values are available in and . To provide independent evidence on the sub-continental ancestry of European haplotypes, we considered segments that are identical by descent (IBD) between unrelated Latino individuals and a representative subset of European populations. We used our high-density dataset to extract a subset of 203 POPRES European individuals and the founders of the 65 complete admixed trios. We first performed a genome-wide pairwise IBS estimation using PLINK to ensure that the dataset contains no samples with more than 10% IBS with any other sample. Then we used fastIBD to phase the data and estimate segments shared IBD longer than 2 Mb to eliminate false positive IBD matches and assuming that ancestry will be shared among pairwise IBD hits of segments this long. All 2 Mb or greater segments shared IBD between pairs of individuals were summed, and histograms were created for pairwise matches within each group (i.e., POPRES Europeans, Iberians, and Caribbean Latinos). To inform about the proportion of shared DNA between pairs of populations we calculated a summed pairwise IBD statistic, which is the sum of lengths of all segments inferred to be shared IBD between a given European population and each Latino population, normalized by sample size. […]

Pipeline specifications

Software tools ADMIXTURE, EIGENSOFT, BEAGLE, PCAdmix, PLINK
Databases dbGaP
Applications Population genetic analysis, GWAS
Organisms Homo sapiens