Computational protocol: The Impact of Outgroup Choice and Missing Data on Major Seed Plant Phylogenetics Using Genome-Wide EST Data

Similar protocols

Protocol publication

[…] In order to generate a comprehensive molecular matrix to address the phylogenetic questions of flowering versus non-flowering seed plants, we searched the TIGR Plant Transcript Assemblies database ( for well-sampled representatives of all major seed plant groups. Our database search for available EST/unigenes (from a total 226,210 EST assemblies and singletons) from well-sampled representative members of major seed and seed-free plant groups retrieved a total of 158,358 genes from complete genomes (Arabidopsis, rice, and poplar), and between 16,000 and 22,000 total unigenes (depending on the dataset) from ESTs for all other species included in various versions of the analysis. In all, the following species were surveyed: Arabidopsis thaliana, Oryza sativa (common rice), Amborella trichopoda, Vitis vinifera (common grape vine), Populus trichocarpa (California poplar) (angiosperms); Cycas rumphii (Malayan fern palm), Zamia fischeri, Ginkgo biloba, Gnetum gnemon (melinjo, bago, peesae), Welwitschia mirabilis, Cryptomeria japonica (Japanese cedar), Pinus taeda (Loblolly pine) (gymnosperms) as ingroup taxa; Selaginella moellendorffii (Lycopophyte), Adiantum capillus-veneris (Filicalean fern), Marchantia polymorpha (liverwort), Physcomitrella patens (moss) and Chlamydomonas reinhardtii (unicellular green alga) as outgroups. All available assembled EST databases, independent of their source (tissue, developmental stage, or type of experiment) were surveyed. Using these unigenes, the OrthologID software pipeline (; was employed to predict orthologous groups resulting in fully aligned matrices composed of 926–1,600 gene or ortholog partitions. The variance in the number of orthologs depended on the filtering schemes discussed below. These ortholog groups consisted mostly of translated EST sequence data. [...] OrthologID identifies all genes that are orthologous amongst the taxon set under examination . Due to the incomplete nature of the EST database, oftentimes the resulting orthologous groups will include only a few taxa. In addition, the available orthologs can be distributed in specific and narrowly defined taxonomic groups. We reasoned that the inclusion of partitions with three or fewer orthologs will add little to the robustness of the present analysis, so we developed a filtering function in our informatics analysis pipeline that removed any ortholog sets that had fewer than four taxa with genes in the ortholog group. In addition, we restricted the distribution of this filtering to include only those ortholog groups with at least three ingroup taxa (specifically at least two gymnosperms and one angiosperm) and one outgroup taxon per partition. We arrived at a comprehensive dataset formed by 12 ingroup species and 4 outgroup species. We found that using all available outgroups resulted in the retrieval of the largest number of bona fide orthologous partitions (1,239) with the filtering scheme specifying the minimal presence of three ingroup taxa (two gymnosperms and one angiosperm) and one outgroup per partition. The resulting ortholog groups comprise genes that are randomly distributed throughout the genome as demonstrated by mapping the loci on the chromosome map of Arabidopsis thaliana (). This somewhat balances for the general bias of EST and transcriptome data, which most often show enrichment for genes implicated in metabolism, energy and general housekeeping, and an underrepresentation for functional categories such as gene regulation. Still, our dataset comprises an array of orthologous genes belonging to diverse functional categories () including transcriptional regulators and signaling genes. The fact that statistical tests (z-scores, Sungear ; data not shown) show a lack of overrepresentation of these categories further suggests that our ortholog sample is more balanced (i.e. less biased) than any previously reported for similar studies of EST data. [...] Once the ortholog groups were established as detailed above, we used the Perl script ASAP (Automated Simultaneous Analysis Phylogenies; ) to organize and construct a matrix. This program automatically constructs a matrix with named partitions into gene name, GO category, and other informatics categories. The concatenated partitioned matrix can be found in . [...] The phylogenetic matrix was analyzed using maximum parsimony (MP) and maximum likelihood (ML) optimality criteria. Parsimony analysis was performed in PAUP* 4b10 using equal weights. Node support was evaluated using the nonparametric bootstrap and jackknife methods in PAUP. Pairwise phylogenetic congruence across all partitions was tested using the ILD test (incongruence length difference; , ) in PAUP. While this measure has been criticized recently –, we choose to use this test conservatively in the context of this study. Branch support measures, such as the Bremer index , partitioned branch support , and hidden branch support , were calculated in ASAP in conjunction with PAUP. Maximum likelihood inference was carried out in RA×ML 7.0.4 at the AMNH Computational Sciences facility on an 8-way server with 2.2 GHz AMD Opteron 846 processors and 128 GB RAM using the fine-grained parallel Pthreads (POSIX Threads Library; ) and on the CIPRES cluster ( using the MPI (Message Passing Interface; , ) implementations. The substitution model best fitting the data was selected in ProtTest by contrasting each model inference's log-likelihood score. The JTT model yielded the highest likelihood score and therefore was used in ML inference taking into account empirical amino acid frequencies calculated directly from the data in hand (). Among-site rate heterogeneity was accounted for using the CAT approximation model with 25 site rate categories. Node support was quantified with 1625 rapid bootstrap pseudo-replicates as implemented in the parallel versions of RA×ML . In order to explore outgroup choice on tree topology, we performed a series of searches, with different combinations of ingroup and outgroup taxa. These manipulations are summarized in . We also explored the effect of missing taxa on the overall phylogenetic hypothesis by measuring the amount of branch support (BS) and partitioned hidden branch support (PHBS) for trees generated by serial nested additions of ingroup taxa (3–11). This analysis involved serially adding partitions with up to 3 taxa, then up to 4 taxa, and so on, so that the matrix kept expanding as partitions with more taxa were added. […]

Pipeline specifications