Computational protocol: The Tree of Life and a New Classification of Bony Fishes

Similar protocols

Protocol publication

[…] Molecular data and taxonomic sampling This study is the main product of the Euteleost Tree of Life Project (EToL). A total of 21 molecular markers with a genome-wide distribution were examined, the majority of which were developed by EToL using a genomic screen pipeline. This pipeline compared the Danio rerio and Takifugu rubripes genomes to identify single-copy genes with long exons (>800 bp) and divergence levels suggesting they evolve at rates appropriate for phylogenetic resolution among distantly related taxa. Exons markers were sequenced from 11 nuclear genes previously published by our group (kiaa1239, ficd, myh6, panx2, plagl2, ptchd4 (=ptr), ripk4, sidkey, snx33 (=sh3px3), tbr1b (=tbr1), and zic1) and three additional markers, including one intron (hoxc6a) and two exons (svep1, and vcpip), were newly developed for this study using the same approach. Sequence data from seven additional markers, including EToL markers (enc1, gtdc2 (=glyt), and gpr85 (=sreb2)) or markers developed by others (16S mtDNA, rag1, rag2, and rh), were generated for our previous studies (e.g., , , , , , ) or obtained from NCBI, Ensembl, or other genomic databases.A total of 1184 bony fish taxa were initially targeted for this study and samples were primarily obtained from the tissue repository of the Ichthyology Collection at University of Kansas (1129 samples) or other collections. Of the initial list, samples for 18 taxa either failed to amplify or belonged to duplicate species that were ultimately combined or discarded. Sixty taxa that produced sequence data for one or two genes only were also discarded. Twenty-five additional taxa were excluded from the final matrix because they had low genetic coverage and highly variable phylogenetic placement in preliminary analyses, as identified using bootstrap trees obtained with RAxML v7.3 and the RogueNaRok server . Our final sampling thus included 1081 taxa and sequence data from 335 additional taxa were obtained from previous EToL studies (e.g., , , , , , ) or public databases (Table S1). In order to minimize missing data, some sequences retrieved from public databases were combined as genus-level composite taxa (52 taxa). DNA extraction, amplification protocols via nested PCR, and primers followed previous studies (e.g., , , , , , ). Primer sequences and optimized PCR conditions used for the three new markers is presented in Table 1. The PCR amplicons obtained were submitted for purification and sequencing in both directions to High Throughput Sequencing Solutions ( or other core facilities.Fish diversity is represented in the phylogenetic data matrix by a sample of 1410 bony fish species (of ca. 31000) plus four tetrapod species and two chondrichthyan outgroups (total 1416 terminals). The taxonomic sampling of bony fishes consists of 1093 genera (of ca. 4300), 369 families (of 502; see below), and all traditionally recognized orders (e.g.,). Our taxonomic sampling emphasizes representation of percomorph groups, with 1037 (of >15000) species in 201 families. All scientific names were checked against the Catalog of Fishes . A complete list of material examined is given in Table S1. Sequence alignment and phylogenetic analyses Contigs were assembled from forward and reverse sequences using CodonCode Aligner v3.5.4 (CodonCode Corporation), Sequencher v4 (Gene Codes Corporation), or Geneious Pro v4.5 (Biomatters Ltd.). Exon markers were aligned individually based on their underlying reading frame in TranslatorX using the MAFFT aligner. The hoxc6a and 16S sequences were aligned with MAFFT v6.9 using 1000 iterations and the genafpair algorithm. Because nested PCR is highly prone to cross-contamination, we vetted the data by visually inspecting individual gene trees estimated with the Geneious Tree Builder algorithm in Geneious. To qualitatively assess gene-tree congruence, the final gene alignments were analyzed under maximum likelihood (ML) in RAxML using ten independent runs for each; exon alignments were partitioned by codon position. Alternative approaches to analyze combined data based on species-tree methods that account for gene-tree heterogeneity due to lineage sorting (e.g., , , , ) could not be applied to this dataset due to high proportion of missing data (see Results). Individual genes were concatenated using SequenceMatrix v1.7.8 or Geneious. Two datasets were assembled and analyzed separately, one including all 1416 taxa with sequence data from three genes or more (3+ dataset) and a subset including 1020 taxa with sequence data from seven genes or more (7+ dataset). Analyses of the 3+ dataset were performed under maximum likelihood (ML) using two partitioning schemes, a simple one determined arbitrarily with 5 data partitions (3 codon positions across all exons plus 16S and hoxc6a), and a more complex scheme with 24 partitions (a combination of codon positions and individual genes plus 16S and hoxc6a) indicated by PartitionFinder. To make the PartitionFinder analysis scalable, a representative subset of 201 taxa was run under the Bayesian Information Criterion. The 7+ dataset was analyzed with the 24-partition scheme only. Analyses for both datasets and partition schemes were conducted in RAxML using 30 independent replicates under the GTRGAMMA model. Nodal support was assessed using the rapid bootstrapping algorithm of RAxML with 1000 replicates estimated under the GTRCAT model, and the collection of sample trees was used to draw the bibartition frequencies on the optimal tree. All RAxML analyses were conducted in the CIPRES portal v3.1.For comparison purposes, the 3+ dataset was also analyzed under implied-weighted parsimony. The optimal tree search and bootstrap trees were set to run independently. Gaps were treated as missing characters and all parsimony uninformative characters were ignored. A relatively mild value of k (20) was chosen arbitrarily due to computational limitations to explore sensitivity of the nodes to other weighting functions. Tree searches were performed in TNT 1.1 using a driven-search strategy combining the following tree-search algorithms: ratchet, drift, sectorial searches and tree fusion. The exhaustiveness of the search parameters was self-adjusted every 2 hits of the current best score. To maximize tree-space exploration, the final searches implemented tree-bisection-reconnection (TBR). A strict consensus of nine equally optimal trees (length 407187 steps; fit 7309.19) was computed. Bootstrap search strategies were relaxed to ten random addition sequences and TBR, saving only one tree per replicate (1000 replicates); bootstrap bipartition frequencies were drawn on the consensus tree. Divergence time estimates Time-tree estimation in a Bayesian framework using the complete dataset was computationally infeasible. Thus, we selected a subset of 202 taxa for 18 genes that had representation of: (i) all major bony fish lineages, (ii) lineages encompassing the nodes in which the assignment of fossil calibrations is most informative, (iii) taxa with the highest genetic coverage to minimize missing data in the data matrix (the markers vcpip, svep1, hoxc6a, including a high proportion of missing data, were also excluded). Divergence times were estimated in BEAST v1.7 using the uncorrelated log-normal (UCLN) clock-model. Sixty calibration points were selected as priors for divergence time estimates, of which 58 are based on previous studies , , , and two (calibrations 45 and 60) are proposed here (Appendix 1). However, the actual BEAST analysis conducted for this study included 59 calibrations only (see details under calibration 60, Appendix 1). A starting chronogram that satisfied all priors (e.g., monophyly and initial divergence times) was generated under penalized likelihood in r8s v1.71 using the RAxML tree. To model branching rates on the tree, a birth-death process was used for the tree prior with initial birth rate = 1.0 and death rate = 0.5. The substitution model was GTR+G with 4 rate classes and the data were partitioned into 4 categories with independent parameter estimation: three codon positions across exons of protein-coding genes plus 16S. Clock and tree priors were linked across partitions. Five replicates of the Markov chain Monte Carlo (MCMC) analyses were each run for 200 million generations, with the topology constrained to that recovered in the phylogenetic analyses of the 3+ dataset (pruned for taxa not included in the subset). Post-run analysis of MCMC log files was assessed using Tracer v. 1.5 and mixing was considered complete if the effective sample size of each parameter was >200 , . Tree files from the five runs were combined in LogCombiner v1.7.4 with the first 10% of trees from each run discarded as burn-in. The maximum clade credibility tree, with means and 95% highest posterior density of divergence times, was estimated with TreeAnnotator v1.6.1. The complete tree with 1416 taxa was time-calibrated under penalized likelihood (PL) with treePL. The PL model, which assumes rate autocorrelation, has been shown to perform poorly in simulation studies resulting in high stochastic error of divergence time estimates. To ameliorate this problem, mean highest posterior density estimates of clade ages obtained with the subset in BEAST were imposed as fixed secondary calibrations for the PL analysis, rather than using primary calibrations with minimum and maximum age constrains. A total of 126 secondary calibrations were used for this analysis, including the ages obtained for all major groups in the tree as well as the nodes near which primary calibrations were defined. The rate smoothing parameter was set to 10 based on the cross-validation procedure and the χ2 test in treePL (four smoothing values between 1 and 1000 were compared). […]

Pipeline specifications