Computational protocol: Unravelling the Genetic History of Negritos and Indigenous Populations of Southeast Asia

Similar protocols

Protocol publication

[…] Quality controls were applied to the data obtained from each OA community separately to exclude problematic samples and single nucleotide polymorphisms (SNPs). All SNPs that failed the Hardy–Weinberg exact (HWE) test (P < 10−6) and displayed missing rates >0.05 across all samples in each population were removed. Additionally, samples with call rate <0.99 were excluded. Gender concordance was examined using PLINK v1.07 () and samples with inconsistency between genotype results and questionnaire-reported sex were excluded. In order to avoid analysis of close relatives, unknown relatedness was measured between all pairs of individuals within each population using PLINK’s (v1.07) Identity-by-Descent estimation, PI_Hat. An upper cut-off threshold of 0.375 was set to exclude first-degree relatedness within each population. Finally, a principal component analysis (PCA) using EIGENSOFT v3.0 () was performed to remove outliers from each population across first ten eigenvectors. In the final stage, all OA populations were merged into one data set and pruned for SNPs that failed HWE (P < 10−6) test and missing rates more than 0.05 across all samples.The OA genotype data were merged with data from Human Genome Diversity Project (HGDP) (), 89 Malay individuals from Singapore Genome Variation Project (SGVP) () and Onge and Jarawa Negritos from Andaman islands were genotyped using Illumina Human 1.2M (SNP population data courtesy of P. Majumder and A. Basu). After merging data sets (supplementary table S1, Supplementary Material online), a total of 291,096 overlapping autosomal SNPs remained for downstream analysis. [...] PCA was used to identify population structure across indigenous Malaysians. PCA analysis was performed on genotyped data of OA combined with Andamanese Negritos, Oceanians, South and East Asian populations in the HGDP, and Malays from SGVP using EIGENSOFT v3.0. To balance sample sizes across our populations, 30 Malay individuals were randomly sampled from SGVP data set (which contains 89 individuals). SNPs with r2 > 0.5 were pruned out in order to avoid the effects of excessive LD between SNPs. After this pruning a total of 204,426 SNPs remained for analysis. Pairwise Fst distance between populations in same data set were calculated using EIGENSOFT v3.0, and a Neighbor-net tree was constructed by SplitsTree v4 software (). ADMIXTURE v1.22, a clustering algorithm, was used on pruned SNPs to estimate the ancestral population clustering ().PLINK v1.07 was used to estimate ROH in selected populations. PLINK takes 5,000 kb (50 SNPs) sliding windows across the genome and allows for 1 heterozygous and 5 missing calls in each window. To minimize the effects of LD on ROH, minimum ROH length was set to be 500 kb because it is unusual for LD to extend beyond 500 kb. LD decay for each population was calculated as r2 using PLINK. Pairwise LD between all possible SNPs was calculated and mean LD was measured in bins of 5 kb.TreeMix v1.12 () was used to explore the population relationships and migration events. Same data set described above was used to estimate the Maximum Likelihood tree with Yoruba as outgroup. We used blocks of 200 SNPs (-k 200) to account for LD and migration edges added sequentially until the model explained 99% of variances. We estimated the D statistics using ADMIXTOOLS () to examine gene flow between OAs and surrounding populations. Divergence time between OA and EA was estimated using 399,971 shared SNPs between our data and HapMap 3 (). Effective population size (Ne) and divergence time between OAs and Yoruba in Ibadan (YRI), Han Chinese in Beijing (CHB), and Japanese in Tokyo (JPT) samples were estimated according to the method suggested by . To estimate LD, pairwise LD was calculated as r2 using PLINK v1.07. In order to minimize the effects of small sample size, all individuals were pooled together in their respective OA groups. Admixture time between OAs and EA was estimated by rolloff package using 399,971 SNPs by HapMap3 and OAs. […]

Pipeline specifications

Software tools PLINK, EIGENSOFT, SplitsTree, ADMIXTURE, TreeMix, AdmixTools
Databases HGDP
Applications Phylogenetics, Population genetic analysis, GWAS
Organisms Homo sapiens
Chemicals Norepinephrine, Nucleotides