Computational protocol: Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches

Similar protocols

Protocol publication

[…] The basic steps for a mega-phylogeny include (1) designating the clade of interest, (2) identifying the gene region(s) of interest, (3) recording the extent of molecular diversity of the gene region in the clade of interest, (4) recording the threshold of coverage and identity to be used for orthology tests, (5) narrowing the possible sequences with a very broad term search [optional], (6) remove all potential sequences that are not members of the clade of interest, (7) testing orthology by BLASTing each potential sequence to each gene region identified for the breadth and removing those sequences that differ by more than the established threshold, (8) identifying sequences that should be reverse complemented, (9) removing sequences for duplicate taxon names, keeping the sequence with the best coverage and identity, and (10) test for saturation. If the sequences are saturated, subdivide them using the next available subclade and perform additional tests of saturation (step 10). Finally, once all of the sequences are in an alignment or exist as singletons (i.e. are not found to be contained in any subdivisions), profile each alignment to a master alignment. This can be repeated an arbitrary number of times for each gene region of interest. If multiple gene regions are used, these are then concatenated into a large matrix and the phylogeny inferred.We implemented this pipeline in Python (vers. 2.5) with the BioPython (vers. 1.48) module and using the BioSQL (vers. 1.0.1) database schema. Each mega-phylogeny matrix assembly analysis presented here was run on a Linux laptop with 1 GB RAM and a 2.4 Ghz dual-core processor. The phylogenetic analyses were conducted on an eight-way SMP Linux computer with 2.4 Ghz processors and 32 GB of RAM using RAxML (vers. 7.0.4; []). The steps that are novel for matrix assembly are described briefly below. [...] One major problem for large matrices using broadly sampled sequences or smaller matrices with quickly evolving sequences is that multiple sequence alignments become more challenging as sequences become more divergent [,]. Almost all multiple sequence alignment algorithms build a phylogeny during the estimation procedures []. The phylogenies built for multiple sequence alignment are often based on model-corrected or raw pair-wise distances. These methods are susceptible to problems with saturation (i.e. multiple mutations at the same site for the same organism) and are therefore much less accurate for large and broadly sampled alignments that are likely to contain very distantly related sequences. The quality of multiple sequence alignments can have a dramatic impact on the accuracy of the phylogenies produced [e.g., [-]]. As a result, other supermatrix methods sidestep multiple sequence alignment by employing clustering techniques to determine "alignable regions." Such clustering techniques have allowed for the assembly of very large matrices of sufficiently similar sequences [,,]. This approach can be dramatically affected by the parameters used during clustering, sometimes resulting in multiple informative clusters for slower gene regions (e.g. two large rbcL clusters for Ericales in Phylota).We combine the analysis of sequence saturation with recent advances in multiple profile-to-profile alignment methodology. A profile alignment is an algorithmic approach to identifying structural elements that are highly conserved between different alignments [-]. To accomplish this, separate alignments are aligned together while preserving the columns in the individual alignments. Newer profile alignment programs allow for more flexibility in profile alignment procedures (e.g. MAFFT; []). In our case, we separate sequences into subgroups of aligned sequences based on the degree of sequence saturation. For example, if the algorithm determines that the most inclusive group of sequences is saturated, then the group is broken up into less inclusive groups using the next level in the taxonomic hierarchy. In a Linnaean taxonomic system, if an "order" is found to be saturated, it would be broken into "families". Each smaller subset of sequences is then re-aligned and the saturation reassessed. This process continues iteratively to less inclusive groups until sequences no longer appear saturated and these alignments are then stored. We note, however, that the taxonomic groups used in this procedure need not correspond to ranks in the Linnaean hierarchy, but should simply be hierarchically nested (as in the NCBI taxonomy). Grouping using a rank-free classification (PhyloCode; []) could be used and will be possible once a database of phylogenetic names is implemented and usable. After every sequence has been either placed in an alignment or placed as a "singleton," the individual alignments are then profiled to a larger alignment. The order of the profiling can be random, optimized to find the best order, or aided by a hierarchical "guide" tree (e.g. first aligning more closely related matrices). Currently, we employ highly conservative guide trees based on published studies to carry out profile alignments. [...] We introduce a simple method based on dispersion statistics to rapidly detect saturation across a set of sequence data. Dispersion (an indicator of spread) is assessed on the one-dimensional Euclidean distance between the raw pair-wise sequence distances and those corrected according to a Jukes-Cantor model of molecular substitution. A one-dimensional Euclidean distance is the absolute difference between two points. Our measure of dispersion is based on the median and is commonly referred to as the median absolute deviation (MAD) and given by MAD = 1.4826 × Med (| xi - Med (x)|), where the median is estimated from the residual variation about the median of all pair-wise Euclidean distances. The constant 1.4826 is used to make MAD consistent for the standard deviation []. Thus, in our use, the larger the MAD the larger the overall spread in the Euclidean distances – that is, above a certain value the assumed nucleotide substitution model is no longer adequately accounting for the rate variation exhibited by pair-wise distances among species.We performed a simple simulation study to explore the behavior of MAD. First, we wanted to determine a threshold for subdividing sequences into smaller alignments. In addition, we wanted to compare MAD with alternative measures of dispersion based on the sample mean (i.e. mean square error, MSE; root mean square, RMSE). Sequence data were simulated across randomly constructed 20- and 100-tip phylogenies. Different rates of molecular evolution were simulated by incrementally scaling the total tree length by a factor of 0.10, starting from 0.10 and stopping at 2.0. All molecular simulations were carried out using Seq-gen (Ver 1.3.2; []).The results from these simulations clearly highlight the utility of MAD. First, unlike MSE or RMSE, MAD does not require an underlying Gaussian distribution, which is useful as the distribution of Euclidean distances becomes skewed as the degree of sequence divergence increases. A second advantage, and perhaps the most critical, is that MAD appears stable when sequence divergence is unrealistically high (e.g. tree length scaled by a factor of 2; Figure ). This situation is also analogous to the presence of outliers that have well-known influences on dispersion statistics based on the mean. Because MAD does not require an explicit distribution and is the 50th percentile of the residual variation, it has the inherent property of being robust to outliers (Figure ). Finally, our simulations indicate that a MAD exceeding ~0.01 provided a conservative indication of a saturation level necessitating a profile alignment scheme. […]

Pipeline specifications

Software tools Biopython, RAxML, PRALINE, MAFFT, Seq-Gen
Applications Phylogenetics, Amino acid sequence alignment
Organisms Homo sapiens