Computational protocol: Patterns of kinesin evolution reveal a complex ancestral eukaryote with a multifunctional cytoskeleton

Similar protocols

Protocol publication

[…] Predicted protein datasets were obtained for 45 diverse eukaryotes for which complete or near-complete genome sequence data is publicly available. Additional file provides a comprehensive list of sources and versions for these datasets. From these datasets, we extracted complete kinesin repertoires using HMMERv2.3.2 [] to find all predicted proteins with a match to the Pfam 'kinesin motor domain' profile (PF00225; []). In total, 1624 sequences match the kinesin motor model at or above the 'gathering threshold' (score = -135; expectation value < 2 × 10-4). However, for phylogenetic reconstructions, highly divergent sequences cause problems with both sequence alignment and tree inference [] and we found that inclusion of the most divergent kinesin sequences hindered tree reconstruction (data not shown). For this reason, 166 sequences with scores < 100 (expectation value > 10-25), representing the most divergent sequences, were excluded from phylogenetic analyses (Additional file ). The remaining 1458 sequences were trimmed to 80 aa either side of the kinesin motor domain (as defined by the Pfam model) and the motors domains aligned using MAFFT6.24 [] adopting the E-INS-i strategy []. This alignment was then trimmed to well-aligned blocks (330 characters) and we reduced redundancy in the dataset by removing 195 sequences from duplicated genes that encode proteins predicted to be identical or nearly identical (>95% identity at the amino acid level) to other sequences from the same organism. Both untrimmed and trimmed alignments are available in Additional file and , respectively.Bayesian phylogenies were inferred from the protein alignment using metropolis-coupled Markov chain Monte Carlo (MCMCMC) method as implemented in the program MrBayes3.1.2 []. The WAG substitution matrix was used [] with a gamma-distributed variation in substitution rate approximated to 4 discrete categories and shape parameter estimated from the data (mean α = 0.927). Ten runs were preformed each consisting of 4 Markov chains heated to a 'temperature' of 0.2 and run for 12,000,000 generations. All runs were initiated from a starting tree inferred from BLASTp scores as described in [] - a strategy which gave significantly better stationary phase tree likelihoods than those using starting trees inferred by either maximum parsimony or neighbor-joining (data not shown). Chains were sampled every 8,000 generations. Two runs, which did not reach apparent stationary phase by halfway through the run, were discarded. For the remaining 8 runs, the first 6,400,000 generations of each was discarded as burn-in and the remaining generations were used to construct the majority-rule consensus tree shown in Additional file . [...] Since the scale of the phylogenetic analysis (1263 sequences) made bootstrap replication unfeasible, we tested the level of support for the inferred topology using the approximate Likelihood Ratio Test (aLRT) method of Anisimova and Gascuel []. Both non-parametric Shimodaira-Hasegawa-like (SH) and parametric χ2-based p-values were generated using the aLRT implementation in PhyML 3.0 [] with the LG substitution matrix []. It is likely that both aLRT methods provide a better estimate of branch support than do Bayesian posterior probabilities. aLRT methods directly test the inferred topology by comparing it to an alternative topology where each node has been systematically collapsed. In contrast, Bayesian methods rely on adequate sampling of the posterior distribution of topologies to provide a good estimate of the posterior probabilities. Because our dataset is highly complex and the tree topology was calculated from a very large MCMCMC search, the resulting trees sampled for the consensus tree will include numerous trees with slight variations in topology by virtue of stochastic error within the MCMCMC sampling procedure. This has the effect of increasing the frequency of recovery of low posterior probabilities in large and complex datasets, as is evident when compared to the results of the aLRT topology assessment methods (Additional file ). Kinesin families (K1-20) were defined as encompassing all sequences within the most basal clans having p > 0.95 support in both aLRT tests. To test the affect of a change in amino acid substitution matrix, we repeated the aLRT test using the WAG [] and JTT matrices []. Of the 485 nodes recovered in the phylogenetic analysis supported with p > 0.95 for both χ2- and SH-based approximate likelihood ratio tests using the LG matrix, 461 (94.5%) and 463 (94.9%) were recovered with p > 0.95 for both tests when using the WAG or JTT matrix, respectively - demonstrating that a change in matrix had a relatively minor effect in the clade support values used to classify kinesin paralogues.Unsurprisingly, the proportion of sequences falling into one of the well-supported kinesin families decreases as the 'quality' (as assessed by Pfam score) of the kinesin motor domain decreases (Additional file ). This implies that a large proportion of the highly divergent kinesin motors excluded from tree inference do not belong to established kinesin paralog families, and it is unlikely that large numbers of bona fide family members were excluded from our analysis. [...] We used all 1624 sequences identified from the HMMER search as separate search seeds for PfamA [] and CDD [] searches in order to identify the presence and relative order of conserved protein domains. The results of the two protein architecture searches were compared, noting the relative position of the domains within the amino acid sequence. Using these comparisons consensus putative domain architecture were identified for each protein sequence. All architecture types were mapped onto our comprehensive phylogeny in order to identify the phylogenetic distribution of the protein architectures (Additional file ). Kinesin protein architectures specific to paralog families or specific phylogenetic clusters were judged as the product of a single protein domain rearrangement or domain acquisition event (Additional file ; see Additional file for exclusions). We identified several kinesin domain architectures, which include domains present in a low number of distantly related genomes or for which the kinesin motor domains belong to distantly related paralog families. In these cases, we conducted further analysis to investigate whether these sequences were composed of domains related by either convergence or vertical inheritance, or if the domain classification was artifactual. For each candidate domain architecture marked 'd' on Figure , functional and annotation data was accessed from Pfam and CDD [,], domain alignments were made using MUSCLE and manually edited using the SEAVIEW alignment platform [,]. 11 cases of domain classification, for which no good evidence of homology could be found, were either excluded as likely artifact or adjusted for taxon distribution as appropriate (Additional file ). SAM1 and SAM2 domains are homologous and were classified as one domain for the purposes of this study (Additional file ). [...] To investigate the minimum complement of kinesin forms present in common ancestor of all 45 genomes sampled, we coded the presence and absence of kinesin families (marked 'c' Figure ) and reliable protein architectures (marked 'c' Figure ) as binary characters. In both cases we were careful to include only characters that were strongly suggested to be monophyletic by the phylogenetic analysis, allowing for some secondary loss of domain architectures within established kinesin families. To further ameliorate patterns of secondary loss we coded the presence and absence of kinesin across the 8 higher taxonomic units (marked on Figures and 2) to produce a matrix of 8 'taxa' and 39 characters. We used a Dollo parsimony analysis method [] implemented through Phylip 3.68 [] to assess the ancestral repertoire implied by several alternative eukaryotic topologies, the best scoring Dollo parsimony tree topology (see Figure ). To further investigate these alternative topologies we used a second coding of the data; in this case we used only the kinesin subfamilies in Additional file (or kinesin families where no subfamilies had been identified), producing a matrix of 8 taxa and 51 characters. Kinesin family member that did not fall into any of the subfamilies were coded as uncertainty in any absences for the other subfamilies. […]

Pipeline specifications