Computational protocol: Phylogenetic position of a whale-fall lancelet (Cephalochordata) inferred from whole mitochondrial genome sequences

Similar protocols

Protocol publication

[…] Four Asymmetron, an Epigonichthys, and three Branchiostoma species, including whole mitogenomes of known species (Asymmetron sp. B, AB110092[]; E. maldivensis, AB110093[]; B. floridae, AF098298[]; B. lanceolatum, AB194383 []; B. belcheri, AB083384 [Matsuzaki et al., unpublished data]) were phylogenetically analyzed, based on surveyed mtDNA sequence data. An enteropneust Balanoglossus carnosus (AF051097[]), a cyclostome Petromyzon marinus (U11880 []), and a small-spotted catshark Scyliorhinus canicula (X16067[]) were chosen as outgroups. Urochordates were not included in the present analysis because of peculiarity of their mitogenome sequences that was remarkably different from those of other chordates, supposedly because of rapid evolutionary rate in the mitogenome [,].The DNA sequences for the 11 species were edited and analyzed with EditView ver. 1.0.1, AutoAssembler ver. 2.1 (Applied Biosystems), and DNASIS ver. 3.2 (Hitachi Software Engineering Co. Ltd.). Amino acids were used for alignments of the protein-coding genes, and secondary structure models were used for the alignment of tRNA genes. Since strictly secondary-structure-based alignment for the two rRNA genes was impractical for the large dataset, we employed machine alignment instead, which would minimize erroneous assessment of the positional homology of the rRNA molecules. The two rRNA gene (rrnL and rrnS) sequences were initially aligned using CLUSTAL X, ver. 1.81 []. Each primary aligned sequence was realigned using ProAlign ver. 0.5 [] and those regions with posterior probabilities ≥70% used in the phylogenetic analyses. These probabilities seemed to effectively remove all ambiguously aligned regions. Ambiguous alignment regions, such as the 5' and 3' ends of several protein-coding genes and loop regions of several tRNA genes, were excluded, leaving a total of 12,497 available nucleotide positions (10,059, 1,275, and 1,163 positions for protein-coding, tRNA, and rRNA genes, respectively) for phylogenetic analyses. Two datasets were used in our analyses: dataset #1, concatenated nucleotide sequences from 13 protein-coding, 22 tRNA, and two rRNA genes (total position 12,497); dataset #2, concatenated amino acid sequences from 13 protein-coding genes plus nucleotide sequences from 22 tRNA and two rRNA genes (5,791). [...] Maximum-likelihood (ML) analysis for dataset #1 using PAUP* 4.0b10 [] was performed under a transversional substitution model with gamma correction and invariable-site assumption (TVM + I + Γ), which was chosen as the most fit for the present case based on hierarchical likelihood tests by Modeltest 3.6 []. The base frequencies were estimated to be A = 0.2940, C = 0.2233, G = 0.1598, and T = 0.3230. The substitution rates were A-C = 0.9657, A-G = 8.4537, A-T = 1.3911, C-G = 1.6808, C-T = 8.4537, and G-T = 1.0000. Assumed proportion of invariable sites was 0.1312. Gamma distribution shape parameter was 0.4086. Heuristic search option of PAUP* was chosen for obtaining the ML tree. Robustness of each internal branch of the ML tree estimated was evaluated with 100 bootstrap replications [].Partitioned Bayesian inference (BI) phylogenetic analysis was performed with MrBayes version 3.1.2 [,]. Five (dataset #1) and three (dataset #2) partitions were set (1st, 2nd, 3rd codon positions, tRNA genes, and rRNA genes; and amino acid sequences of 13 protein-coding genes, tRNA genes, and rRNA genes, respectively). The general time reversible (GTR) model with gamma correction and invariable-site assumption was used in the analysis for dataset #1, and for tRNA and rRNA genes of dataset #2. As mentioned above, TVM + I + Γ was chosen as the best fitted for the present case. However, the TVM model is a special case of the GTR model and is not yet implemented in MrBayes. Therefore, the GTR model (GTR + I + Γ) was used in the analyses. The mtREV [] model with gamma correction and invariable-site assumption (mtREV + I + Γ) was used in the analysis for the protein-coding genes of dataset #2. This model was selected as the best-fit model of amino acid substitution by MrBayes. Model parameter values were treated as unknown and were estimated for each analysis. Random starting trees were used, and analyses were run for one million generations, sampling every 100 generations. Bayesian posterior probabilities were then calculated from the sample points after the Markov Chain Monte Carlo (MCMC) algorithm began to converge. To ensure that our analyses were not trapped in local optima, four independent MCMC runs were performed. Topologies and posterior clade probabilities from different runs were compared for congruence.Maximum parsimony (MP) analysis for dataset #1 was performed using PAUP* 4.0b10 []. Heuristic MP analyses were conducted with TBR (tree bisection-reconnection) branch swapping and 100 random addition sequences. All phylogenetically uninformative sites were ignored. Robustness of each internal branch of the MP tree estimated was evaluated with 1,000 bootstrap replications []. [...] The analyses of divergence time were conducted with the penalized likelihood (PL) [] and the nonparametric rate smoothing (NPRS) [] methods. Molecular clock approaches were not used because a high rate of heterogeneity among lineages of lancelets was observed by the two-cluster test (LINTREE []). The previous analyses based on molecules were referred to the calibration points for our dating because of the absence of a useful fossil record in the lancelets. PL approach based on the BI tree (dataset #2) was performed by r8s 1.71 []. All r8s analyses utilized the truncated Newton (TN) algorithm and the additive rate penalty function. All analyses were reoptimized 1,000 times (set_num_restarts = 1,000) to avoid entrapment on a local solution optimum. The optimal smoothing parameter (121) was estimated using cross-validation. The divergence times between Cephalochordata and Vertebrata (+ Urochordata) (891 Mya) and between Agnatha (Cyclostome) and Gnathostomes (Chondrichthys) (652 Mya) [] were used for the age of two calibration points. NPRS approach based on the ML tree was performed by TreeEdit 1.0 []. As a reference point for dating, the divergence time between Asymmetron and the other genera (162 Mya) was used for the age of root node []. […]

Pipeline specifications

Software tools Clustal W, ProAlign, PAUP*, ModelTest-NG, MrBayes, r8s
Applications Phylogenetics, Nucleotide sequence alignment
Organisms Asymmetron inferum, Physeter catodon