Computational protocol: Plastome Rearrangements in the “Adenocalymma-Neojobertia” Clade (Bignonieae, Bignoniaceae) and Its Phylogenetic Implications

Similar protocols

Protocol publication

[…] Plastomes were assembled using the Fast-Plast pipeline (; McKain and Wilson, unpublished). For each species, adaptors were removed and low quality sequences trimmed using Trimmomatic 0.35 (Bolger et al., ) with the SLIDINGWINDOW:10:20 and MINLEN:40 parameters. Trimmed reads were mapped against a database that included C. cujete L., Erythranthe lutea (L.) G.L. Nesom (NC_030212.1; Vallejo-Marín et al., ), Olea europaea L. (NC_013707.2; Messina, unpublished), Sesamum indicum L. (NC_016433.2; Yi and Kim, ), Salvia miltiorhiza Bunge (NC_020431.1; Qian et al., ), and T. tetragonolobum (Jacq.) L.G. Lohmann using Bowtie2 2.1.0 (Langmead and Salzberg, ) with the default parameters. Mapped reads were assembled into contigs using SPAdes 3.1.0 (Bankevich et al., ) with k-mer sizes of 55 and 87, using the “only-assembler” option. Resulting contigs were assembled with the software afin ( nit/afin) and the default parameters -l 50, -f 0.1, -d 100, -x 100, and -i 2. For species for which it was harder to obtain comprehensive contigs, we tested different values for maximum percentage of mismatches (-g), and minimum overlap of contig (-p) parameters. For some species, the de novo assembly returned a large contig that contained the complete plastome. These contigs were checked and finalized with Geneious 9.0.2 (Kearse et al., ). The plastome assembly was verified through a coverage analysis conducted in Jellyfish 2.1.3 (Marçais and Kingsford, ). The estimate of 25-mer abundance was used to map a 25-bp sliding window of coverage across the plastome of each species.Plastome annotation was initially conducted in DOGMA (Wyman et al., ). These annotations were checked in Geneious 9.0.2 using O. europaea and Solanum lycopersicum L. (NC_007898.3; Daniell et al., ) as references. Promising open reading frames at non-coding regions were verified with BLAST (Altschul et al., ) available at NCBI ( Maps of the annotated plastomes were created using OGDRAW (Lohse et al., ). We characterized the overall plastome structure, gene content, and general gene information of the 10 species sampled and compared our results with the information available for two other Bignoniaceae (i.e., C. cujete and T. tetragonolobum), and one Oleaceae (i.e., O. europaea). Points of potential rearrangements and junctions between the IRs, the LSC, and SSC were tested iteratively using afin ( nit/afin), and checked with PCR amplifications and electrophoresis. Coverage values for these regions were also assessed. [...] We used the LSC, SSC and one IR to infer the phylogenetic tree of the “Adenocalymma-Neojobertia” clade. We excluded one IR to avoid duplication of data. We used three chloroplast genomes of members of the Lamiales (C. cujete, O. europaea, and T. tetragonolobum) as outgroups. Pseudogenes and its orthologous were treated as non-coding regions. Genes with overlapping portions were treated as neighbors to avoid character duplication.For the phylogenetic analyses, annotated plastomes were fragmented into coding and non-coding regions, excluding regions smaller than 50 bp. The retained regions were grouped by sequence similarity (with a threshold of 65% of global similarity and default alignment costs) using the annotated plastome of Adenocalymma biternatum (A. Samp.) L.G. Lohmann as reference and considering the pool of regions for all species. Plastome partitioning and sequence grouping was conducted using the R package Biostrings (R Development Core Team, ; Pagès et al., unpublished). Coding regions were aligned with MAFFT 7 (Katoh and Standley, ) using the G-INS-i 1,000 strategy, while non-coding regions were aligned using the E-INS-i 1,000 strategy. We removed poorly aligned regions of the coding and non-coding alignments using GBlocks (Castresana, ) default settings in order to circumvent homology assessment problems due to random similarity of sequences or indels. Alignments of non-coding regions with rearrangements were edited manually or misaligned sequences were removed using the outlier search option implemented in T-Coffee (Notredame et al., ). This was necessary since GBlocks is not able to recognize rare outlier sequences (Castresana, ). Three different partition schemes were built as follows: (1) 91 coding regions (“coding”); (2) 76 introns and spacers with alignment edited by hand (“non-coding edited”); and (3) 76 non-coding regions with poorly aligned sequences removed with T-Coffee (“non-coding filtered”). Combined datasets were also analyzed as follows: (4) 91 coding regions plus 76 non-coding regions (“coding + non-coding edited”); and (5) 91 coding regions plus the 76 filtered non-coding regions (“coding + non-coding filtered”). The five datasets were compared based on tree topology and node support.All phylogenetic analyses were performed with Maximum Likelihood (ML) using RAxML 8.2.9 (Stamatakis, ), and Bayesian Criteria (BC) using MrBayes 3.2 (Ronquist et al., ). ML node support was estimated through a rapid bootstrap analysis with 1,000 replicates. BC were run using uniform priors and two independent runs of 10 million generations with four chains per run, sampling trees every 1,000 generations. BC support was estimated using posterior probabilities. For BC, chain convergence and stationarity were assessed using the R package Coda (R Development Core Team, ; Plummer et al., unpublished) by visually examining plots of parameter values and log-likelihood against the number of generations. For Bayesian analysis we employed the reversible jump strategy (Ronquist et al., ), which does not require the establishment of evolutionary models or partition schemes a priori. For ML the GTRCAT evolutionary model was used (Stamatakis, ), avoiding pre-defined partitions. [...] Among the 76 introns and spacers recovered, we retained the 31 regions that were recombination free and with suitable length for PCR amplification (amplicons with size between 500 and 1,100 bp). These partitions were analyzed to identify highly informative regions that may serve as useful markers for future species level phylogenetic analyses. ML trees were inferred for each of the 31 partitions using RAxML 8.2.9 and the GTRCAT evolutionary model. For each partition, alignment length, variable sites, topological distance, and branch length distance (Kendall and Colijn, ) were estimated. Metrics were computed using the R packages (R Development Core Team, ) Ape (Paradis et al., ) and Treescape (Jombart et al., unpublished). Partitions were ranked using standardized values of the number of informative characters, as well as the topological and branch length distances between the tree derived from the analysis of each partition and the best tree estimated in this study (i.e., the tree derived from the analysis of the “coding + non-coding edited” dataset; see results). All metrics were computed for the “Adenocalymma-Neojobertia” clade exclusively. Four non-coding regions were selected with Geneious 9.0.2 for primer design. […]

Pipeline specifications