Computational protocol: The Heterogeneous HLA Genetic Makeup of the Swiss Population

Similar protocols

Protocol publication

[…] Following the treatment of ambiguities described in the previous section and File S11, allele and haplotype frequencies were estimated both for the whole Swiss registry and for each regional service individually by using an Expectation-Maximization (EM) algorithm implemented in the Gene[RATE] program package accommodating ambiguous data (http://geneva.unige.ch/generate/) , , . Box-and-whisker diagrams were drawn from the estimated frequencies with a GNU/Linux script and the R statistical software (http://www.r-project.org/) to depict potential outliers (i.e. regions with significantly lower or higher frequencies for a given allele or haplotype) among the different regional services. A significant departure from Hardy-Weinberg equilibrium (HWE) expectations was tested using a nested likelihood model, where the HWE model is seen as a particular case of a model that includes a parameter, i.e. an inbreeding coefficient, accounting for HWE deviations . This approach does not require making assumptions about the kind of data (blank alleles, ambiguities, etc) and is therefore not restricted to HLA data. An adapted version of the classical Ewens-Watterson (EW) test for ambiguous data , was used to assess selective neutrality at the HLA loci under study (also see Supplementary information File S5). Global linkage disequilibrium (LD) is a measure of non-random association of several pairs of alleles between two loci. For multi-allelic loci this is not the same as non-random association between individual pairs of alleles of the two loci (often named linkage disequilibrium as well, but also gametic association between two alleles). Of course, gametic associations between individual pairs of alleles of two loci may create significant global LD between these loci (for a formal discussion see and references therein). In this study, global LD was tested by using a resampling procedure rather than by considering all possible gametic associations between pairs of alleles of these two loci and correcting for multiple testing (i.e. an alternative way of testing global LD). As for haplotype frequency estimations, only the pairs of loci most commonly described in the literature for registry data were analysed (i.e. Class I pairs, Class I with HLA-DRB1, and HLA-DRB1-DQB1). This approach consists in generating, from the observed ambiguous data, 1,000 random samples in which no individual has ambiguous genotypes. The observed statistic (i.e. the sum of squared differences between the observed two-locus haplotype frequencies and the two-locus haplotype frequencies expected under the null hypothesis of no LD) was compared to the empirical distribution resulting from the resampling procedure, and was considered as significant if falling above the 95% percentile. Gametic association between alleles was assessed using standardized (Pearson) residuals , where a value of plus or minus 2 indicates a deviation too large under the assumption of random association, i.e. a significant association. Standardized residuals are computed as the difference between the observed and the expected frequency divided by the square root of the expected frequency (i.e. this is equivalent to the square root of a chi-square contribution) and are used to determine which haplotypes are major contributors to the rejection (or not) of the null hypothesis of no gametic association.Different recruitment centers were compared by computing Reynolds' genetic distances based on allelic frequencies , for each locus taken independently. Pairwise FST's between regions were tested for significance by using a non-parametric resampling procedure . To summarize the results, mean pairwise genetic distances were computed for the 5 loci taken together and plotted using a multidimensional scaling (MDS) analysis , . Comparisons between geographic, linguistic and genetic distances were done by 2-way and partial 3-way Mantel tests , . Geographic distances were computed as the logarithms of arc-distances. Linguistic distances were approximated by choosing arbitrary values of 0 among recruitment regions speaking the same language (either French, Italian and German), of 1 between French and Italian (both belonging to the Italic branch of the Indo-European phylum ) and of 2 between either French or Italian and German (German being part of the Germanic branch of the Indo-European phylum). Analysis of Variance (ANOVA) was performed with Arlequin software to test the significance of the variance components associated to three levels of genetic structure: among recruitment centers (FST), among recruitment centers within predefined geographic or linguistic groups (FSC), and among such predefined groups (FCT), respectively. To summarize the results found for the 5 loci taken together, weighted averages were computed for FSC and FCT , and combined probabilities were computed according to Fisher's meta-analysis method , . SAMOVA analyses were performed to identify possible genetic boundaries among Swiss regions . […]

Pipeline specifications

Software tools ECOMICS, Arlequin, SAMOVA
Application Population genetic analysis