Computational protocol: Identification of polymorphic inversions from genotypes

Similar protocols

Protocol publication

[…] In the analysis of genotype data, we apply the inversion model in two steps. In the first step, we use the inversion model to phase and pair haplotype blocks around potential breakpoints. This approach allows flexibility to phase two haplotype blocks at any distance, without phasing the whole genome or constantly re-phasing regions, when one of the breakpoints is changed. This use of block phasing is a novel application of the inversion model. The second step is the original application of the model [] to identify which individuals are likely to have a genetic inversion.In this work, we define a candidate breakpoint as a pair of consecutive SNPs. A segment tested for an inversion is thus defined by two candidate breakpoints (left and right). We flank the left and right candidate breakpoints by two blocks of N SNPs each, and perform local haplotype phasing with haplo.stats []. As a result, an individual i has two haplotypes, L1(xi) and L2(xi) of 2N SNPs each, containing the left breakpoint in the middle, and two 2N SNPs haplotypes R1(xi) and R2(xi), containing the right breakpoint in the middle. The first application of the inversion model is to sort out the chromosome pairing of the containing blocks, that is, we set B1 = L1, B2 = R1, B3 = R2, B4 = L2 and m = ns (number of subjects) in the inversion model. The responsibilities for the alternative model tell us if the pairing of (L1, R1) in one chromosome and (R2, L2) [e.g. (L2, R2) ordered by handedness] in the other chromosome is more (less) likely than the pairing (L1, R2) and (R1, L2) [(L2, R1)], for each subject in the sample. This first process phases the genotype data between any pair of blocks, surrounding candidate breakpoints.The second application of the inversion model is to determine the presence of genetic inversions on phased data, as originally proposed in Sindi and Raphael 2010 []. In total, we have nc = (2 * ns) haplotype blocks phased by the former pairing. We first aggregate the data into two variables one for each breakpoint, HL(Lj(xi)) corresponds to the haplotype of subject i and chromosome j containing the left breakpoint, and HR(Rj(xi)) to the haplotype of subject i and chromosome j containing the right breakpoint. We next split the left haplotype into two N - SNPs blocks (HLL, HLR) = HL each of which flank the potential left breakpoint, and (HRL, HRR) = HR each of which flank the right breakpoint. In the inversion model, we set B1 = HLL, B2 = HLR, B3 = HRL, B4 = HRR and m = nc, as the total number of chromosomes in the sample. The responsibilities of this model now determine for which chromosomes the pairings (HLL, HLR) and (HRL, HRR) are more (less) likely than the pairings (HLL, HRL) and (HLR, HRR). That is, the model determines which chromosomes are likely to be inverted between the left and right breakpoints considered.An important consideration in our analysis is whether the initial phasing step itself is confounded by the presence of an inversion. Thus, in the first step of our analysis, we perform two other (more complex) local haplotyping strategies. In the first strategy, we locally haplotype each N-SNP flanking block separately with haplo.stats and then use the inversion model to phase the internal flanking blocks, i.e. those assumed to be within the inversion segment. Finally we use the model again to phase the internal haplotype with each external block. A second alternative strategy is to perform the local haplotyping with haplo.stats once on the surrounding blocks following the forward population and twice following the inverted population, leaving two distinct haplotype data for each population. We compare the accuracy of each phasing strategy in the prediction of inversions on simulated data and contrast them with the accuracy of the prediction for already phased data. [...] In addition to providing a novel method to analyze genotype data for polymorphic inversions directly, our work aims at improving the efficiency of detecting inversions, and the accuracy of classifying individuals with respect to inversion status.Previous LD methods [,] considered a potential breakpoint between every pair of adjacent SNPs in the genome with sufficient diversity and physical distance between breakpoints. Thus, the total number of segments tested for inversions was very large, O(n!) where n corresponds to the number of candidate breakpoints or SNPs. To reduce the computational load, we decrease the number of regions tested by using a sliding window. We tested our approach on simulated inversions from invertFREGENE [], and found our model still correctly identifies an inversion if the candidate segment tested is contained within the real inversion sequence and "large enough" compared to the real inversion. Consequently, we scan the chromosome with trial segments of fixed length (window size), as probes for detecting inversions with comparable length. We then reconstruct the true inversion by considering all trial segments with sufficiently large Bayesian Information Criterion (BIC). This simplification massively reduces the number of computations to O(n).In addition, prior methods for haplotype data were unable to successfully identify individuals carrying the inversion, a necessity for genome association studies. We develop an accurate classification method based on the responsibility of each individual to the normal or inverted subpopulation in overlapping windows. For a fixed window w we determine the responsibility of chromosome i, to the forward model r0,iw. Sindi and Raphael [] attempted to classify individuals based on their dominant responsibility, normal if r0,iw≥0.5 and inverted otherwise; however, this did not yield a useful classifier because the classification from any single instance of the mixture model was poor. We find that combining classifications from adjacent, overlapping windows yields a substantial improvement in classification accuracy with the true inversion. More generally, we estimate the responsibility of individual i to the inversion I by a majority vote overall overlapping segments where BIC exceeds a given threshold BIC > tB: (7) r 0 , i I ( t B ) = ∑ w H ( r 0 , i w - 0 . 5 ) H( B I C w - t B ) ∑ w H( B I C w - t B ) , where H is the step Heaviside function, H(z) = 1 when z >0 and 0 otherwise. For haplotype data, we classify a chromosome i as inverted when r0,kI(tB)<0.5 for a particular value of tB. For genotype data, we classify each individual as homozygous for the reference orientation, heterozygous for the inversion or homozygous for the inversion, depending on the classification obtained from the responsibilities of both chromosomes. We define the classification accuracy of our method as the number of individuals with correctly identified inversion status (i.e., fraction of true positives and true negatives). In the simulation results below, we compare the classification accuracy of our method to the true known categorization of individuals as function of tB. In practice, our classification accuracy increases with tB. […]

Pipeline specifications

Software tools haplo.stats, invertFREGENE
Applications Population genetic analysis, GWAS
Chemicals Nucleotides