Computational protocol: Signals of Historical Interlocus Gene Conversion in Human Segmental Duplications

Similar protocols

Protocol publication

[…] We used the genomicSuperDups table from the UCSC Table Browser to identify 1419 families of ≥4 paralogous sequences (mean number of sequences per family = 8.4) in the human reference genome (GRCh37). We limited our focus to large SDs >10 kb (mean sequence length = 34.7 kb) to ensure an adequate number of informative sites in tests of historical recombination (see below). Owing to the hierarchical, nested organization of SDs across the human genome, there is considerable sequence redundancy between SD families. Collectively, sequences within these families span 52.4 Mb of the human genome (1.7%) with each family surveying an average 29.4 kb of duplicated sequence not present in any other SD family. This initial dataset includes sequences overlapping 712 genes, with 205 genes fully contained within analyzed sequences.We constructed a high quality multiple sequence alignment for each SD family using the L-INS-i algorithm implemented in mafft (v. 6.951) . Alignments were trimmed to remove terminal regions of non-alignment and all gaps. We subsequently removed any sequences with <88% pairwise sequence identity to any other sequence in the alignment. Below this level of sequence identity, the rate of IGC is likely negligible . Sequences that reduced overall alignment length by a factor >0.25 through the introduction of a large number of gap sites were excluded, and families with <10 kb aligned, ungapped sequence or fewer than four sequences were removed from the analysis (n = 83). The remaining 1336 SD families had an average of 7.7 paralogous sequences with 95.8% sequence identity across 18.8 kb of high quality aligned sequence.To infer the evolutionary relationships between paralogous sequences within each SD family, we constructed maximum likelihood unrooted trees using PhyML (v3.0) . Nucleotide evolution was modeled as a general time reversible process with gamma distributed rate variation across four rate classes and a proportion of invariant sites estimated by maximum likelihood . To ensure high confidence in the resulting tree topology, we pruned the subset of trees with nodes supported by less than 88% of bootstrap replicates. Specifically, we randomly dropped one sequence stemming from the node with the lowest bootstrap support and reconstructed the tree on the reduced sequence alignment. This process was reiterated until all nodes on the tree were supported by ≥88% of bootstrap replicates, or until the alignment was reduced to fewer than four sequences. Fifty-six SD families were eliminated by this criterion (). We excluded an additional 34 SD families for which neighbor-joining tree topologies based on an F84 DNA distance matrix were not equivalent to the maximum likelihood topologies . The final set of 1246 SD families includes 38.9 Mb of sequence covering 617 genes. [...] Although the evolutionary history of each sequence family is summarized by a single dominant tree topology (which presumably mirrors the duplication history of the sequences), individual sites within the alignment may be discordant with the tree. As illustrated in , non-allelic recombination between two sequences can give rise to informative sites that support an alternative tree topology. Non-allelic crossover events will introduce large, contiguous blocks of variable sites in the alignment that support an alternative tree topology. Gene conversion between non-allelic sequences can create short clusters of variable sites that are inconsistent with the dominant tree (). The inter-digitation of IGC tracks, ectopic crossovers, and more complex rearrangements (e.g., microhomology-mediated break-induced replication (MMBIR)) over time will generate complex, interwoven switches in the evolutionary history across a multiple sequence alignment . In keeping with the vocabulary of Jackson et al. , we refer to sites inconsistent with the tree as “Reticulate sites” (R sites). In addition to interlocus recombination, R sites can also arise from independent mutation to the same nucleotide at paralogous positions. Most double mutation events, however, are expected to result in three bases at a given site in the alignment. We refer to these sites of obligate double mutation as “Bimutational Sites” (B sites). Informative sites consistent with the dominant tree topology are classified as “Concordant sites” (C sites).An SD family with N paralogous sequences is comprised of four-sequence sub-alignments, or quartets. For each sequence quartet within a SD family, we classified all informative sites as concordant, reticulate, or bimutational. The total number of R sites in the full alignment is the number of sites (or columns in the alignment) that are reticulate in ≥1 sequence quartet. We bootstrapped the columns of the alignment 100 times and calculated the number of R sites across each bootstrapped alignment to derive approximate 95% confidence intervals for the observed number of R sites.We evaluate the hypothesis of historical IGC among sequences in a given quartet by comparing the observed number of R sites to the number expected in the absence of interlocus exchange. This null expectation was derived by simulating 100 sequence datasets along the ML tree using Seq-Gen (v1.3.3) . We used the proportion of invariant sites, four classes of evolutionary rates, empirical nucleotide frequencies, and transition probabilities estimated from the actual data to guide the simulations ().Approximate P-values for comparisons of the observed and expected numbers of R sites were computed using a strategy that accounts for uncertainty in both values. First, each draw from the bootstrap distribution of observed R sites was paired with a draw from the simulated distribution of R sites (n = 100 paired values). If the simulated distribution is a good approximation to the observed distribution, the difference between these values should be zero on average. On the other hand, if more R sites are observed than expected according to our simulation, the difference between the observed and simulated R site counts should be greater than zero. P-values were calculated as the fraction of 100 paired observations for which the simulated number of R sites was greater than the corresponding value from the bootstrap distribution. […]

Pipeline specifications

Software tools MAFFT, PhyML, Seq-Gen
Application Phylogenetics
Organisms Homo sapiens
Diseases Neuroblastoma