Computational protocol: Inferring the evolutionary mechanism of the chloroplast genome size by comparing whole-chloroplast genome sequences in seed plants

Similar protocols

Protocol publication

[…] A total of 272 complete or nearly complete chloroplast genomes were collected from NCBI (National Center for Biotechnology Information), including the genomes of five gymnosperm groups, four clades of eudicots (fabids, malvids, lamiids, and campanulids), one major clade of monocots (commelinids), and basal angiosperms (magnoliids). The details (species name, family names, and accession numbers) of 272 chloroplast genomes are listed in Supplementary Table .In 1986, for the first time, the complete chloroplast genomes of tobacco (Nicotiana tabacum ) and liverwort (Marchantia polymorpha ) were obtained and the chloroplast genes were annotated by gene expression. With the expansion of the NCBI database, homology searches by Blastx and Blastn against the GenBank database have been used to annotate chloroplast genes for several years. Consequently, the gene names and data annotation information are inconsistent among different studies, . In addition, it is possible that some hypothetical chloroplast open reading frames (ycfs) or open reading frames (ORFs), whose functions and features have been identified, were not updated in previous studies, . DOGMA (Dual Organellar GenoMe Annotator) is a web-based annotation package that solves some of these problems, including typos, incorrect sequences and gene names in GenBank. Therefore, protein-coding, ribosomal RNA (rRNA) and transfer RNA (tRNA) genes of all the collected chloroplast genomes were re-annotated using DOGMA with the default settings. However, because BLAST cannot provide a precise search for start and stop codons for the protein coding genes and those genes with more than one intron were annotated as two genes, the start and stop codons must be chosen by manual operation. Thus, we further modified the annotation information using our own Perl scripts. [...] Chloroplast genomes were analyzed at the order and species level. We collected 45 orders, and the phylogenetic relationship of these orders was an integration of previously published phylogenies established by Jansen et al., Moore et al. and APG III. For the species tree, maximum likelihood (ML) analyses were performed on datasets of 40 genes to ensure sufficient information for the calculation of branch length, . An individual gene matrix was aligned using T-Coffee and then manually adjusted. We used group-to-group profile alignments, by taking advantage of previously recognized phylogenetic relationships–, which yielded data matrices with fewer missing data compared to other methods. We then identified and concatenated alignment clusters of homologous gene regions. ML analysis was conducted using RAxML version 7.0.4 using the PROTGAMMAJTT substitution model and default settings. Support for each node for ML analysis was tested with 1000 bootstrap replicates. These trees were viewed and edited with the TreeExploter program in MEGA 5.0. [...] To identify the relationship between chloroplast genome size and all the other characteristics of chloroplast genome sequences shown in Supplementary Table , we conducted a conventional analysis of variance (ANOVA) to test the differences between genome size and all sequence characteristics based on cross-species and phylogenetic signal analyses. In cross-species analysis, the relationship between each pair-wise characteristic and chloroplast genome size was described using their standardized major axes without taking phylogeny into account (SMA; model II regression). We computed the common slope using SMA analyses among species with a likelihood ratio procedure. The smatr package of R was used to perform the SMA analyses.The ANOVAs were carried out using the PDAP package to test whether there was significant cross-species association between sequence characteristics and genome size that could also be a small-probability event based on a random model of Brownian motion evolution. We first used Pdsimul to generate 1000 Monte Carlo simulated data by taking the tree topology and branch length information into account (see the phylogenetic analysis section). The F-statistic of ANOVA of the simulated data was analyzed by pdanova, and the obtained values were compared against the observed F-statistic from the cross-species analysis. If the observed F-statistic was greater than 95% of the simulated data, the relationship between chloroplast genome size and other characteristics was not random and was affected by phylogenetic signals. This analysis was implemented separately for each characteristic. K was the descriptive statistical parameter to describe the degree of the difference between the F-statistic of simulated data and observed F-statistic distributions. In brief, the K statistic was the ratio of the observed mean square error derived from a phylogenetically corrected mean and the expected mean square error obtained from the analysis by considering tree topology and branch length information based on a Brownian motion evolution model. K = 1 would denote that the species had a close relationship with the same characteristic values as those obtained from a Brownian motion evolution model, whereas K < 1 would indicate that the relationship of the characteristic values was not affected by phylogenetic signals. Slope estimates and r 2 from SMA analyses were obtained from the results of our standardized contrasts utilizing pdtree and the R package smatr. In addition, we performed the same likelihood ratio procedure as described earlier in this section to test the common slope for within-group SMA analyses. [...] The ratio of nonsynonymous to synonymous substitutions (Ka/Ks) of all individual datasets was estimated for each branch of the phylogenetic tree using PAML, . A free-ratio model was implemented in PAML, and an independent Ka/Ks value was assumed separately for each branch. Because independent estimation of the Ka/Ks ratio for each branch of the tree was extremely time-consuming, the phylogenetic tree of angiosperms was divided into six monophyletic sub-trees, while the phylogenetic tree of gymnosperms was divided into three sub-trees, and each of the sub-trees was evaluated independently. A free-ratio model was implemented in PAML, and an independent Ka/Ks value was separately assumed for each branch, . Only the Ka/Ks values between modern species and their most recent reconstructed ancestors were used in subsequent analyses. Thus, we focused only on the rate of accumulation of mutations between homologous gene pairs with the most recent common ancestors. […]

Pipeline specifications