Computational protocol: An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines

Similar protocols

Protocol publication

[…] A total of 120 individuals were sequenced in eight ingroup species of North American white pines from Pinus subsection Strobus (Pinus albicaulis Engelmann, P. ayacahuite Ehrenberg ex Schlechtendal, P. chiapensis (Martínez) Andresen, P. flexilis James, P. lambertiana Douglas, P. monticola Douglas ex D. Don, P. strobiformis Engelmann, and P. strobus L.) and three outgroup species from Pinus subsection Gerardianae (P. bungeana Zuccarini ex Endlicher, P. gerardiana Wallich ex D. Don, P. squamata X. W. Li), the identified sister lineage to Pinus subsection Strobus[,]. Sequencing was conducted on haploid templates generated from DNA extractions of seed megagametophyte tissue; as a single haploid sequence was generated for each individual at each locus, no phasing was necessary. Gene sequences were obtained from 245 putative nuclear loci chosen from among ∼7,500 loci recently resequenced for loblolly pine (Pinus taeda L., using single pass, bidirectional Sanger sequencing of PCR products amplified from haploid megagametophyte tissue excised from seeds of each species. Further description of laboratory protocols appears in []. Sequence data were pre-processed and organized using PINESAP[], a bioinformatics pipeline that combines PHRED[], PHRAP[], and MUSCLE[,] to call bases and align sequencing reads. Reported nucleotide sequences consisted only of A, C, G, T, missing, and gap information, with no other ambiguity codes used. After pre-processing, the data were manually assembled and aligned using CODONCODE (CodonCode Corporation, Dedham, MA). Bases were called using a minimum PHRED score [,] of 25 for aligned bases. All polymorphisms were visually validated. All alignments were further aligned to resequencing data from P. taeda (unpublished data) using the profile-profile option in MUSCLE[,]. These alignments are publicly available as part of the Dendrome project ( GenBank accession numbers for sequences in the study appear in Additional file : Table S1.Of 245 loci sequenced initially, 37 were dropped from further consideration due to low overall quality of sequence reads. An additional 15 loci were discarded due to possible chloroplast or mitochondrial contamination, on the basis of BLAST analysis against pine organellar sequences in GENBANK[]. Two loci were dropped due to sequence similarity to retroelement-like proteins, leaving 191 high-quality nuclear gene alignments. We then eliminated 70 loci for which at least one of the 11 species contained no data. This filter reduced the dataset to 121 loci, covering ∼47 kb of aligned sequence data.Coding regions (i.e., site annotations) could be confidently identified for 112 of the 121 loci by further analysis using TBLASTX against protein-coding genes in Arabidopsis, Oryza, Picea, and Populus. For these 112 loci, the gene for the highest-scoring TBLASTX hit, in combination with the expressed sequence tag from loblolly pine, was used to identify coding regions. Site annotations for each alignment were validated with BLASTP analysis of the amino acid sequences derived from the inferred coding intervals against the gene that was used to derive the site annotations. For the data from the 112 annotated loci, ∼62% represents exonic regions, ∼18% represents intronic regions, ∼1% is from 5’ UTRs, and ∼19% is from 3’ UTRs. Because 112 of the loci could be confidently identified as belonging to coding regions, with a substantial fraction of exonic sequence, the data likely contain a mixture of non-neutral and neutral regions. [...] We view as a species tree inference method any method that outputs a species tree estimate. The six species tree inference methods in this study are Concatenation [,], SuperMatrix Rooted Triple (SMRT; []), STEAC [], STAR [], Rooted Triple Consensus (RTC; []), and Minimize Deep Coalescences (MDC; [,]). Concatenation and SMRT are concatenation-based, and STEAC, STAR, RTC, and MDC are consensus methods. Because we have adopted a unified two-stage framework for phylogenetic inference strategies in which gene trees are first inferred by one approach and species trees are then inferred from gene trees by a second approach, we did not investigate single-stage approaches such as BEST [,], and *BEAST [] that bypass gene tree inference or that perform gene tree inference simultaneously with species tree inference. Our analysis pipeline explores the performance of two-stage inference strategies when the roles of gene tree and species tree inference are separated, and it therefore requires that strategies estimate species trees from inferred gene trees and that they permit different gene tree inference methods to provide input to a given species tree method. The six species tree methods investigated in this article satisfy both of these conditions, whereas species tree methods such as BUCKy [], BEST [,], and *BEAST [] do not. Further, the methods we have selected are well-suited to a computationally intensive bootstrap approach included in our pipeline for generating distributions of species tree topologies, and the more computationally intensive of the single-stage methods would not be easily accommodated within this framework. Given the large number of two-stage methods available, it would not be possible to be comprehensive; we have thus chosen a limited number of methods that represent a range of underlying principles. Our choice of methods permits a diverse set of criteria for estimating species trees to be evaluated, and the conceptual differences in the underlying methods enable some differentiation in behavior across methods.Consider a set of L loci (multiple alignments) with m ingroup and one outgroup species. Concatenation methods concatenate the L alignments to create a single “super locus” consisting of an alignment of the m+1 species across L loci. From this alignment, a gene tree is inferred by either maximum likelihood, maximum parsimony, or neighbor-joining—note that the definition of Concatenation does not require that gene trees be estimated using any specific method—and is then taken as the species tree estimate. Similarly, SMRT creates a concatenated alignment of the m+1 species from a set of L alignments. However, SMRT then constructs from this concatenated alignment all m3 concatenated alignments of three ingroup species and an outgroup species. Rooted three-taxon gene trees are then inferred from each of the m3 concatenated alignments. A supertree algorithm is then applied to the set of rooted three-taxon gene trees to estimate an m-taxon species tree topology. This study uses the Modified Mincut supertree algorithm implemented in the program SUPERTREE[] to construct a species tree from rooted three-taxon gene trees.Consider a set of (m+1)-taxon gene trees (m ingroup and one outgroup species) inferred at each of L loci. STEAC estimates a species tree topology by using estimated mean coalescence times. For distinct species Si and Sj, the mean coalescence time is computed as the estimated coalescence time for Si and Sj, averaged over all L gene trees. This resulting mean is placed into a matrix of distances between species, to which neighbor-joining is applied to estimate the species tree topology (using the R PHYBASE package).STAR estimates a species tree topology by using average coalescence ranks. It assumes that the rank of the root of a gene tree is equal to the number of species in the tree (m + 1 in our case). An internal node of a gene tree is then assigned one less than the rank of its immediate ancestor. For distinct species Si and Sj, the average coalescence rank is computed as the rank of the node that connects Si and Sj, averaged over all L gene trees. Similarly to STEAC, these average coalescence ranks specify a matrix of distances between species pairs. Neighbor-joining is applied to the matrix to estimate the species tree topology using PHYBASE.RTC estimates a species tree from rooted three-taxon tree topologies. At each locus ℓ, ℓ=1,2,…,L, RTC finds the set of m3 rooted tree topologies of three ingroup and one outgroup species that are displayed by the inferred gene tree at locus ℓ. RTC then applies quartet puzzling [] to the m3L topologies to estimate the species tree topology (using the program TRIPLEC).A coalescence event between a pair of lineages is considered “deep” if the coalescence does not occur in the first population in which the pair of lineages is capable of coalescing. Given a gene tree, the number of deep coalescences on a species tree is defined as the total number of “extra lineages”, summed across branches of the species tree topology, that is needed to fit the gene tree within the species tree topology. Here, the number of extra lineages for a branch is one fewer than the number of lineages that survive to the ancestral node of the branch; if incomplete lineage sorting does not occur, then only one lineage persists from a branch to a more ancestral branch, and there are no extra lineages. For a set of L gene trees, the number of deep coalescences for a species tree is the total number of deep coalescences for the species tree given a gene tree, summed across the L gene trees. MDC estimates a species tree topology by minimizing the number of deep coalescences. That is, MDC finds a species tree topology for which the number of deep coalescences that will fit the set of L gene trees within the species tree topology is minimal. This study utilizes the MDC implementation in PHYLONET[]. [...] We aim to determine which of the 72 phylogenetic inference strategies perform similarly, and we use multivariate analyses to define clusters of strategies that provide similar species tree estimates. Consider a 72×145-dimensional data matrix S in which rows represent strategies and columns represent 145 observed clades, among the ∑k=28-18k=246 possible non-trivial clades (i.e., clades that contain more than one species and fewer than all analyzed species) of eight species. Entry Sij in column i and row j of S is the number of times that strategy i infers clade j in 1000 bootstrap replicates across loci.Principal components analysis (PCA) was applied to S to create a 72×2-dimensional matrix V, with rows representing strategies and the first and second columns representing the first and second principal components, respectively. Plotting strategies onto the space defined by these principal components yields a two-dimensional spatial “map” of phylogenetic inference strategies. We similarly applied multidimensional scaling (MDS) to a distance matrix for all 722 pairs of strategies, computing pairwise distance as the mean Robinson-Foulds distance [] across all 106 pairs of bootstrap trees, and extracting the first two components. We calculated the Robinson-Foulds distance using TREEDIST in PHYLIP.To compare spatial maps of phylogenetic inference strategies, we used Procrustes analysis [-]. In particular, we compared the spatial distribution of a subset of 72-r strategies when analyzed alone to the spatial distribution for all strategies. The comparison enabled us to quantify the influence that a set of r strategies with a particular feature (i.e., species tree construction method, gene tree inference method, or outgroup species) has on the full spatial distribution. Consider a proper subset Σ={σ1,σ2,…,σ72-r} of the full set of strategies. Consider a (72-r)×145-dimensional data matrix SΣ in which rows represent the strategies in set Σ and columns represent observed clades. SΣ is a submatrix of S, in which the rows corresponding to strategies in Σ are selected from S. Consider a (72-r)×2 target matrix X and a (72-r)×2 comparison matrix Y. X is matrix V restricted to the set of strategies Σ. Y represents the first two principal components in the PCA applied to matrix SΣ. Now consider a (72-r)×2 matrix Z=bYT+C, where b is a scaling factor, T is a 2×2 matrix that rotates and reflects Y, and C is a (72-r)×2 matrix that has constant columns and that is used to translate the matrix. Procrustes analysis seeks to find b, T, and C to minimize the sum of squared differences between X and some (72-r)×2 matrix Z⋆=bYT+C. That is, Z⋆ is formally defined as Z⋆=argminZ{∑i=172-r∑j=12(Xij-Zij)2}. The dissimilarity measure between X and Z⋆ is computed as (3) ∑ i = 1 72 - r ∑ j = 1 2 ( X ij - Z ij ⋆ ) 2 ∑ i = 1 72 - r ∑ j = 1 2 ( X ij - μ j ) 2 , where μj=172-r∑i=172-rXij is the jth entry of the centroid of X. This measure takes the sum of squared differences between points on the spatial maps defined by X and Z⋆ and normalizes it by the sum of squared differences between the points on the spatial map defined by X and their centroid.Define a cluster as a set of strategies and let the centroid of a cluster be the location in the 145-dimensional space of clades whose coordinates are the means of those of all strategies in the cluster. Hierarchical clustering was performed by first creating a matrix of Euclidean distances between all 722 pairs of 145-dimensional vectors represented by the matrix S. Define the within-cluster sum of squared Euclidean distance as the squared Euclidean distance between a point in a cluster and the cluster centroid, summed over all points in the cluster. From the 72×72-dimensional matrix of Euclidean distances between strategies, a dendrogram relating the strategies was constructed using the Ward algorithm [], which iteratively merges clusters until all points are contained within a single cluster. For a given iteration, two clusters are merged if their merged cluster has a smaller within-cluster sum of squared Euclidean distances than any other potential merged cluster. The nesting of clusters created by the algorithm defines the dendrogram.We performed K-means clustering on the 72 145-dimensional vectors, using K clusters, K=2,3,…,9. Given K, strategies were separated into K clusters on the basis of the squared Euclidean distance between all pairs of the strategies in a 145-dimensional space. We ran 104 replicates with random starting locations. Each replicate yielded a total within-cluster sum of squared distances for the set of K clusters, representing the within-cluster sum of squared distances between points in a cluster and the cluster centroid, summed over all K clusters. We then chose the set of cluster assignments that had the minimum total within-cluster sum of squared distances, where the minimum was taken over all 104 replicate starting locations.To compute the Pearson correlation coefficient between a pair of strategies, we only used points in the 145-dimensional vector that were nonzero in both strategies being compared. […]

Pipeline specifications