Computational protocol: Genomic Runs of Homozygosity Record Population History and Consanguinity

Similar protocols

Protocol publication

[…] The HGDP dataset includes populations from all continents with diverse demographic and cultural histories, from small isolated Amazonian groups to the Han Chinese. A number of populations have well documented preferences for consanguineous marriage : Bedouin , Druze , Palestinians , Mozabite Berbers , Brahui , Baluchi , Makrani , Sindhi and Pathan . In some cases, population sample sizes are low, such that estimates inevitably have wide confidence intervals; however we have addressed this by analysing data at the continental as well as population level. The populations were grouped into seven distinct ‘continental’ groups (Europe, sub-Saharan Africa, America, Oceania, East Asia, Central/South Asia and West Asia). The HGDP dataset includes several individuals that are first or second degree relatives of others in the dataset. In this analysis all of the 1043 individuals were used as significant correlation in homozygosity is only expected for sibs, and even this is very variable .Genotypes were available for 660,918 single nucleotide polymorphism markers (SNPs) from the Illumina 650Y product , . Only SNPs on the 22 autosomes were included in this analysis (in total 644,258). SNPs located in the centromere regions were also removed, along with those with more than 10% missing genotypes. In addition, we excluded SNPs that failed an exact test of Hardy-Weinberg equilibrium in each population separately. The number of SNPs with p<0.0001 varied from 0 – 85, depending on the population; in total 735 SNPs were removed.Ascertainment bias is a consequence of the small number of individuals in the SNP discovery panel and the bias in their geographic origins, so that the probability that a SNP is discovered is dependent on the allele frequency . The Illumina arrays used in this study mostly contain tag SNPs derived from the HapMap populations: Yoruba from Nigeria, Han Chinese, Japanese and European Americans. SNPs represented in the Illumina array will therefore only represent common variants in those four populations and not SNPs specific to other regions of the world, or which are rarer in the ascertainment populations and more common in other places. The populations that are most distant genetically to the four HapMap populations will be most affected by this ascertainment bias, i.e. those from the Americas and Oceania, and indeed considerable numbers of SNPs were not polymorphic in these populations. This could inflate estimates of ROH length and frequency because a proportion of apparent ROH may in fact consist of consecutive non-polymorphic SNPs, which provide no information on identity by descent. To correct for this ascertainment bias, SNPs with frequencies less that 1% in any of the 7 continental regions were also removed, leaving a consensus panel of 415,130 SNPs to be used in the final analysis. Using different minor allele frequency thresholds up to 20% changed the absolute values of ROH statistics, but had little effect on the relative order of individuals or populations. As all ROH, long and short, ultimately arise through shared parental ancestry, recently or thousands of years ago, we did not account for linkage disequilibrium (LD) in the analyses. It is, however, unusual for LD to extend over 500 kb, the minimum ROH length considered here.Runs of homozygosity were identified in PLINK , which takes a window of 5000 kb (50 SNPs) and slides it across the genome, determining homozygosity at each window. It allows for 1 heterozygous call per window and 5 missing calls. For each SNP, the proportion of homozygous windows that overlap that position is calculated and based on the threshold value of 0.05 (average for the SNPs in a segment), segments are called homozygous or not. The minimum length of a run of homozygosity was set to be 500 kb. This length was chosen as it approaches the limit of resolution given the SNP density of the array and also avoids counting short ROH where LD complicates matters. To ensure that locally low SNP density does not increase the length of an ROH, a required minimum density was set at 50 kb/SNP and the maximum gap between two consecutive SNPs in a tract was set at 100 kb.Minimum SNP coverage in this dataset is 6.5 kb/SNP, so on average the minimum number of SNPs in the run will be 77. However, the density of SNPs is not consistent throughout the genome, and it is therefore important to ensure that there are enough SNPs to constitute a true ROH or homozygosity-by-descent. The minimum number of SNPs in a homozygous tract was set at 50 to make sure that the unobserved markers between the first and the last SNP in the run are also very likely to be homozygous.Calculations were performed using SPSS and R software. A mixed linear model was used to check whether the values of the median, the 75th percentile and the number of ROH differed between the continents. The median and the 75th percentile values of the ROH per individual were used to check for the differences in length of ROH. The 75th percentile was used to test for the differences between longer ROH. The number of ROH was logarithmically transformed as it was not normally distributed. Due to the fact that the data are unbalanced (each continent contains different number of populations, and each population contains different number of individuals), restricted maximum likelihood (REML) estimation of variance components was used (implemented in R). In the model, continents were treated as a fixed effect and because each continent was subdivided into several populations, populations and individuals within the populations were treated as random effects. Chi squared tests were performed between each continent to determine whether there are significant differences between the proportions of individuals with ROH of different lengths (2–4 Mb, 4–8 Mb, 8–16 Mb, >16 Mb). Shorter length categories were not tested because all of the individuals from all populations have ROH shorter than this. Bonferroni correction for multiple testing (89 tests) was applied, and a significance threshold was set at p = 0.00056. Individuals from West Asia (N = 176) and Central/South Asia (N = 207) were grouped together and tested against the combined group of Europe (N = 160) and East Asia (N = 235). The Mann-Whitney U test was used to test differences between the total mean lengths of ROH of the individuals between continental regions. Differences were tested for the mean length for each individual and four different cut-off lengths were used (>1.5 Mb, >3 Mb, >5 Mb, >10 Mb). Bonferroni correction for multiple testing (84 tests) was applied, and significance threshold was set at p = 0.0006. Correlations between ROH measures and pedigree inbreeding (Fped) were calculated for 17 different populations for which pedigree-based mean inbreeding coefficients were available . Where several Fped estimates were available for one population, the mean of those estimates was used. For each individual, the FROH measure of homozygosity was calculated . FROH was calculated by dividing the sum of ROH per individual by the total length covered by SNPs, excluding the centromeres (2,682.410 Mb).As the rate of decay of LD varies across the genome, a threshold based on physical distance might not be appropriate in regions where haplotype blocks are unusually long or unusually short in outbred populations. To allow for this, we used the genetic map distance estimates from HapMap (which are based on modelling haplotype structure as a mosaic of HapMap source haplotypes) to set different thresholds for defining ROH. Using thresholds from 0.04 to 5 cM, we recovered a perfectly correlated dataset. For example the proportion of ROH >4 Mb in length was correlated (r = 0.9993, p<0.0001) with the proportion of ROH identified using a 5 cM threshold, while the sum of ROH >0.5 Mb was also perfectly correlated (r = 0.9999, p<0.0001) with the sum of ROH called using a 0.04 cM threshold. Walking distances from Addis Ababa were taken from reference . […]

Pipeline specifications

Software tools PLINK, SPSS
Databases HGDP
Applications Miscellaneous, GWAS
Organisms Homo sapiens