Computational protocol: Joint amalgamation of most parsimonious reconciled gene trees

Similar protocols

Protocol publication

[…] Here, we consider the problem of finding the most parsimonious reconciliation (MPR) when considering—as possible macro-events that result in the birth and death of gene copies—speciation, gene duplication, gene transfers and gene loss (). The general problem of finding an MPR is known to be NP-complete, even for reconciling two binary trees (). The complexity of the problem is due to the difficulty of ensuring the time consistency of gene transfers, i.e. satisfying the chronological constraints among nodes of the species tree that are induced by transfer events. However, the problem becomes polynomially solvable when accepting a time-ordered species tree as input (among others ; , see for a review). In this article, we build upon the combinatorial reconciliation model introduced by , which can be used to solve this special case of the problem.Some parsimony methods (e.g. ) do not need information on the order of speciations in time. This allows a more efficient recursion over reconciliations, but at the cost of considering reconciliations that contain transfer events that are not consistent with any ordering of the species tree ().The DTL model of can be used to reconcile a time-ordered binary species tree S with a binary gene tree G by constructing a mapping α that maps each node u∈V(G) into an ordered list of nodes in V(S′), namely the ancestral and/or extant species in which the sequence corresponding to u evolved. This model takes into account four kinds of biological events: speciation, gene duplication, gene transfer and gene loss. The atomic events of this model are: a speciation (S), a duplication (D), a transfer (T), a transfer followed immediately by the loss of the non-transferred child (TL), a speciation followed by the loss of one of the two resulting children (SL), and a contemporary event (C) that associates an extant gene to its corresponding species. Finally, a null event (∅), is used to model a gene lineage crossing a time boundary. Note that duplication-loss events and transfer followed by the loss of the transferred gene, unlike a transfer followed by the loss of the non-transferred gene and speciation-loss events, leave no trace and are therefore undetectable. This is why, in the DTL model, losses are never considered alone. We refer the reader to for the formal definition of a DTL reconciliation.Let θ, τ, λ be, respectively, the costs of a duplication, a transfer and a loss. Given a DTL reconciliation, we define the cost of α, denoted by c(α), as the sum θd+τt+λl, where d, t and l are respectively the number of D events, of T and TL events, and of SL and TL events in α. In the authors give an efficient algorithm to compute c(G, S) for a time-ordered species tree S and a gene tree G, where c(G, S) is defined as the minimum cost over all possible DTL reconciliations between G and S. [...] Even though our aim is to reconstruct reliable gene trees from a multiple sequence alignment and a species phylogeny, our approach does not directly take sequence alignments as an input, but requires a sample of gene trees, typically produced from the alignment by either a Markov Chain Monte Carlo (MCMC) methods such as PhyloBayes () and MrBayes (), or bootstrap resampling.To find the optimal gene tree, clades found in the input sample of gene trees are combined using the amalgamation approach in order to recover an optimal tree with respect to our scoring scheme. The optimal tree recovered will only contain clades found in the input sample of gene trees, but it will not in general be found in the sample itself. provides a schematic illustration of the amalgamation approach. Clades present in the sample of trees (the unrooted trees on the left) can be combined to obtain a tree such that each clade is found in the sample, but the tree itself is not. For example, one can produce a green-blue tree consisting of a green subtree with genes a, b and c, and a blue subtree with genes d, e and f. The sequence score of each tree is obtained using CCPs that depend on the number of times different trees are seen in the sample and is described in detail in the next section. The reconciliation score for each tree corresponds to the MPR of the gene tree with the species tree. The amalgamation algorithm itself is a joint dynamic programming recursion over (i) all trees that can be produced from clades present in the input sample and (ii) all possible ways to reconcile each of these trees with the species tree, to recover a gene tree with the smallest joint sequence-reconciliation score. As shown in Figure 1 in the online appendix, amalgamation permits us to explore a vastly larger set of trees than those contained in the sample.Conceptually, both ALE () and TERA are based on the amalgamation approach of AnGST (), and all three methods are—at the level of the dynamic programming recursion—closely related. TERA differs from AnGST in the underlying reconciliation model () and because it allows transfers going/coming from extinct or unsampled species (). Moreover, the AnGST scoring scheme is solely based on the reconciliation score. ALE differs from TERA in that it relies on a complex underlying probabilistic model; the results of which, in contrast to TERA, are contingent on time-like branch lengths of the species tree.TERA’s amalgamation algorithm can be regarded as a generalization of the gene tree reconciliation algorithm of , which iterates over reconciliations by mapping each node of a gene tree to branches of the species tree. In the joint recursion presented in this article, instead of nodes of a gene tree, the clades found in the input sample of gene trees are mapped into branches of the species tree.More formally, assume we are given a set of (unrooted) gene trees G on the same leaf set reconstructed from a unique sequence alignment. We denote respectively by A(G) and Π(G) the union of all the clades, and the union of all tripartitions in G. For each tripartition π, we denote by π[1] (π[2] and π[3], respectively) the first (second and third, respectively) element of π. If G contains unrooted trees we consider all possible rootings for each tree when computing A(G) and Π(G). Furthermore, for a given clade C of A(G), we denote by Π(C) the set of tripartitions π∈Π(G) for which π[1]=C. When focusing only on the reconciliation score, the optimization problem consists of computing c(G,S):=min⁡G∈Gamc(G,S), where Gam is the set of gene trees such that A(G)⊆A(G) for all G∈Gam. The pseudocode is given in Algorithm 1 in the appendix. Roughly speaking, our algorithm starts by computing the subdivision S′ of S, and the sets A(G) and Π(G). Then, it performs a joint traversal of all gene tree clades and species tree branches wherein clades C in A(G) are considered in order of increasing size, and nodes x′ of S in order of increasing height. For each pair (C, x′) the algorithm computes the cost of reconciling clade C with x′ by testing all possible tripartitions π in Π(C). Because each non-trivial tripartition π can be seen as an internal node of an amalgamated tree, with children π[2] and π[3], the cost of reconciling a tripartition π with x′ can be computed according to Algorithm 1 of . We refer the reader to Algorithm 1 of for a better understanding of the pseudocode. The correctness of our approach is proven in the appendix.Note that—for ease of writing—the pseudocode of the algorithm does not contain the transfers from the dead, i.e. the transfers going/coming from extinct or unsampled species (). However, Algorithm 1 can be easily modified to accommodate this kind of event by adding to the species tree S a sister group of the root clade such that, within this group, duplications and losses are free, speciations are not permitted, and transfers to this new group (formally corresponding to unrepresented speciations) cost zero—similar to what is done in the likelihood framework by . [...] TERA is implemented in C++ and is freely available from samples of gene trees, for both simulated and real alignments, were downloaded together with the ‘true’ gene trees used to simulate alignments from the dryad data repository (doi:10.5061/dryad.pv6df) provided by .For all the parsimony-based species tree aware methods, including TERA, we used the DTL costs δ = 2, τ = 3, λ = 1 obtained by using a criteria based on minimizing the change in ancestral genome sizes on a large biological dataset. We ran TreeFix-DTL with default parameters, JTT/GTR with a gamma distribution as models of evolution, and as a starting tree the PhyML tree. MowgliNNI was run with default parameters, a threshold of 50 for weak edges, and with the PhyML tree—with bootstrap values—as a starting gene tree. AnGST was run with default parameters using the dated species tree on samples of 1000 gene trees, whereas JPrIME-DLTRS was run with JTT with a gamma distribution as model of evolution, 100 000 iterations, a thinning factor of 10 and a time out of 10 h. Finally, we ran TERA with a starting cA of 0.1 and for G samples from 10 up to 10 000 gene trees for each simulated alignment. The gene trees reconstructed by ALE were downloaded from the above mentioned data repository.Note that, from a practical perspective, the DTL costs we use are the default parameters for all the parsimonious methods described in the article, and seem to work well for several parts of the Tree of Life. If the user suspects that these values are not suited for the analyses, these parameters should be estimated beforehand, e.g. using the ALE method. […]

Pipeline specifications

Software tools MPR, PhyloBayes, MrBayes, ALE, AnGST, TreefixDTL, PhyML
Application Phylogenetics