Computational protocol: A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels

Similar protocols

Protocol publication

[…] In the current study, we selected 67 bulls and 24 cows that originated from New-Zealand and were all both genotyped and sequenced at high coverage (15x or more) from a larger WGS dataset that was previously used in []. It should be noted that, in this study, our aim was to assess the phasing accuracy for WGS genotypes called with relatively high confidence and not for low-fold WGS data. Detailed procedures to generate the WGS data, including DNA extraction, sequencing procedure, alignment, quality score recalibration and variant calling were previously described in [].This WGS dataset is composed of 36 Holstein–Friesian (six cows and 30 bulls), 24 Jersey bulls and 31 Holstein–Friesian/Jersey crossbred (18 cows and 13 bulls) individuals. Among these 91 animals, 38 parent-offspring pairs were available for which data was available in the WGS dataset for the sire of 30 animals, for the dam of two animals and for both sire and dam of three animals. These parent-offspring relationships span over several generations (up to four generations) and were used to phase offspring with high confidence based on the Mendelian segregation rules.When evaluating phasing accuracy on real WGS data, the estimated phasing errors do not result only from genuine phasing errors but also from other sources (e.g. assembly or genotype calling errors), which can blur the genuine phasing errors. To reduce as much as possible, the noise due to other sources of errors, we performed a very stringent data filtering to select the so-called trusted variants (high-confidence variants). For the sake of generality, we also performed a more traditional variant filtering for ease of comparison with other studies and to evaluate phasing in more realistic conditions. In this paper, the WGS dataset always refers to the trusted set of variants, unless explicitly specified.The stringent filtering rules applied to the 22,228,949 SNPs from the original VCF file are described hereafter. In addition to calibration score, we used VCFtools [] to select bi-allelic SNPs that:are present in other available bovine WGS datasets (the 1000 bull genomes [] run 2, the Belgian Blue cattle and New-Zealand populations used in [] and a Dutch Holstein pedigree of 415 individuals reported in []);are present in the datasets of all 91 individuals used here;have a MAF higher than 0.01 (i.e. any SNP for which the minor allele occurred only once was discarded);did not deviate from Hardy–Weinberg equilibrium (p > 0.05).In this selection, we retained SNPs that displayed correct Mendelian segregation in the WGS Dutch Holstein pedigree based on the following rules: no parent-offspring incompatibilities (e.g., opposite homozygotes), no deviation from Hardy–Weinberg proportions (p > 0.05) and no deviation from expected genotypic proportions in offspring of heterozygous parents (p > 0.05). In addition, we excluded those markers associated with a low power to detect possible parent-offspring inconsistencies. Application of these filtering steps reduced substantially the number of SNPs but also the level of genotyping errors.are present in other available bovine WGS datasets (the 1000 bull genomes [] run 2, the Belgian Blue cattle and New-Zealand populations used in [] and a Dutch Holstein pedigree of 415 individuals reported in []);are present in the datasets of all 91 individuals used here;have a MAF higher than 0.01 (i.e. any SNP for which the minor allele occurred only once was discarded);did not deviate from Hardy–Weinberg equilibrium (p > 0.05).In addition to variant quality, we also removed some genomic regions that may be incorrectly mapped (errors in the genome assembly). Additional errors were detected based on the following evidences: multiple long runs of homozygosity (ROH) that had been detected with the genotyping array data were heterozygous for some segments of the WGS data, excess of double cross-overs in the WGS Dutch Holstein pedigree compared to the array-based haplotypes, and split reads or unexpected distances between mate-pairs in a WGS mate-pair library.Finally, SNPs that were retained in the genotyping array dataset (see below) but discarded based on the filtering step mentioned just above were re-introduced in the WGS dataset. Application of the complete series of filtering steps resulted in a final list of 5,185,663 SNPs (thereafter referred to as the trusted set of WGS SNPs) that are listed in Additional file : Table S1, whereas, application of only the more traditional filtering steps, i.e. SNPs that were bi-allelic, present in the datasets of all 91 animals, showed no deviation from Hardy–Weinberg equilibrium with p > 0.05, and had a MAF higher than 0.01 resulted in 13,175,535 SNPs. The latter set was used only for illustrative purposes (comparisons to other studies) and will be referred to as the traditionally filtered WGS data (see Additional file : Table S2). [...] A first phasing was done for all 58,369 genotyped animals from LIC using SHAPEIT2 [, ] and default parameters except for the window size (set to 5 Mb). The originality of this method consists in the possibility to efficiently explore the space of the haplotypes that are consistent with a given genotype. This phasing method is referred to as “GEN-P1” and the results were used as the pre-phase for imputation of the WGS data from the genotyping data using only LD information. [...] As mentioned above, LINKPHASE3 was used to detect and discard map errors. However, the original purpose of this method is to partially phase the genotypes using Mendelian segregation rules and linkage in half-sibs families. After applying this method to the population of 58,369 animals, further haplotype reconstruction was performed by integrating LD information using DAGPHASE [] and Beagle []. The resulting haplotypes were therefore inferred with both familial and LD information to the 35,285 SNPs (missing genotypes being imputed by Beagle). This phasing method is referred to as “GEN-P2” and was used as scaffold for phasing the WGS data panel using both LD and familial information. [...] To assess whether a pre-phasing strategy based on both LD and familial information improves or not the accuracy of imputation, we compared two scenarios (see Fig. ):WGS-I1, imputation using WGS-P1 pre-phased haplotypes, i.e. imputation is performed from GEN-P1 to WGS-P1;WGS-I2, imputation using WGS-P2 pre-phased haplotypes, i.e. imputation is performed from GEN-P2 to WGS-P2.To evaluate the impact of the pre-phasing strategy on imputation accuracy, a 13-fold cross-validation was performed. The imputation to seven target animals from 84 reference animals was repeated 13 times. Pools of seven animals were randomly chosen without repetition, which resulted finally in 91 imputed animals. Imputation was achieved for all 29 bovine autosomes by using Impute2 [], with an effective population size set to 200, a number of reference haplotypes set to 168, i.e. twice the number of reference animals, and by applying the option “–allow-large-regions” to impute the entire chromosome at once. For each animal, the result is imputed dosage of both phases.WGS-I1, imputation using WGS-P1 pre-phased haplotypes, i.e. imputation is performed from GEN-P1 to WGS-P1;WGS-I2, imputation using WGS-P2 pre-phased haplotypes, i.e. imputation is performed from GEN-P2 to WGS-P2.The following statistics were then obtained for all WGS SNPs by comparing the imputed dosages and observed genotypes of the 91 animals: imputation accuracy r 2, as the squared correlation between imputed dosages and observed genotypes of any WGS SNP, and imputation error rate, as the sum of the residues between imputed dosages and observed genotypes per number of imputed SNP alleles (i.e. twice the number of SNPs). […]

Pipeline specifications

Software tools VCFtools, SHAPEIT, BEAGLE, IMPUTE
Applications WGS analysis, GWAS