Computational protocol: Evolutionary Dynamics of Cryptophyte Plastid Genomes

Similar protocols

Protocol publication

[…] Plastid genome assemblies and annotations followed procedures used by . The data were trimmed (i.e., base = 80 bp, error threshold = 0.05, n ambiguities = 2) prior to de novo assembly with the default option (automatic bubble size, minimum contig length =1,000 bp). The raw reads were assembled using the MIRA4 (http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html) and SPAdes 3.7 (http://bioinf.spbau.ru/spades) assembler. Raw reads were then mapped to the assembly contigs (similarity = 95%, length fraction = 75%), and regions with no evidence of short-read data were removed (up to 1,000 bp). The assembled contigs were determined to correspond to the plastid genome according to several criteria: 1) BLAST searches of commonly known plastid genes against the entire assembly resulted in hits to these contigs and 2) a genome size consistent with other photosynthetic cryptophyte plastid genomes, which range from 121 Kbp (Guillardia theta NC000926) to 136 Kbp (Rhodomonas salina NC009573). Plastid genome-derived contigs were then manually aligned in the Genetic Data Environment (MacGDE2.5) program () to produce a consensus sequence.A database of protein coding genes, rRNA, and tRNA genes was created using all previously sequenced cryptophyte plastid genomes. Preliminary annotation of protein coding genes was performed using GeneMarkS (http://opal.biology.gatech.edu/genemarks.cgi). The final annotation file was checked in Geneious Pro 9.1.3 (http://www.geneious.com/) using the ORF Finder with genetic code 11 (Bacterial, Archaeal and Plant Plastid Code). The predicted ORFs were checked manually and the corresponding ORFs (and predicted functional domains) in the genome sequence were annotated.To identify tRNA sequences, the plastid genome was submitted to the tRNAscan-SE version 1.21 server (http://lowelab.ucsc.edu/tRNAscan-SE/). The genome was searched with the default settings using the “Mito/Chloroplast” model. To identify rRNA sequences, a set of known plastid rRNA sequences was extracted from the published plastid genomes of cryptophytes and used as a query sequence to search in the new genome data using BLASTn. We used RNAweasel (http://megasun.bch.umontreal.ca/cgi-bin/RNAweasel/RNAweaselInterface.pl) to determine the types of introns that were present. Physical maps were designed with the OrganellarGenomeDRAW program (http://ogdraw.mpimp-golm.mpg.de/). [...] Four published cryptophyte plastid genome sequences (Cryptomonas paramecium CCAP 977/2a, ; Guilardia theta CCMP 2712, ; Rhodomonas salina CCMP 1319, ; Teleaulax amphioxeia HACCP CR01, ) were downloaded from GenBank. An additional plastid genome is available from GenBank under the name of Guillardia theta CCMP 2712 (KT428890, ), however, the gene sequences are very different from those of the previously reported G. theta CCMP 2712 (; ). Therefore, we did not include this genome in our study. For structural and synteny comparisons, the genomes were aligned using Mauve Genome Alignment version 2.2.0 () with default settings. To aid in visualization, we arbitrarily designated the beginning of the trnY gene marker to rpl19 direction as position 1 in each genome. [...] Phylogenetic analyses were carried out on data sets created by combining 88 proteins encoded by 56 plastid genomes, including those of 8 cryptophytes, 5 haptophytes, 20 stramenopiles, 2 alveolates, and 14 red algae (, Supplementary Material online). The sequences of six Viridiplantae and one glaucophyte species were used as outgroup taxa to root the tree. The data were concatenated (16,878 amino acid sequences) and manually aligned using MacGDE2.5 (). For the RNA operon (16S-trnI-trnA-23S rDNA) phylogeny, the data were concatenated into 4,046 nucleotides from plastid genome sequences in 38 taxa including 8 cryptophytes, 5 haptophytes, 1 rappemonad, 13 stramenopiles, 12 rhodophytes, and 16 outgroup taxa including 2 glaucophytes, 9 chlorophytes, and 5 cyanobacteria.Maximum likelihood (ML) phylogenetic analyses were conducted using RAxML version 8.0.0 () with the Le and Gascuel gamma (LG + GAMMA) model () for amino acid data chosen by ProtTest 3 () and the general time-reversible plus gamma (GTR + GAMMA) model for nucleotide data. We used 1,000 independent tree inferences using the -# option to identify the best tree. The model parameters with gamma correction values and the proportion of invariable sites in the combined data set were obtained automatically by the program. ML bootstrap support values (MLB) were calculated using 1,000 replicates with the same substitution model. To reduce calculation time, ML phylogenetic trees for lineage-specific genes were inferred using IQ-TREE Ver. 1.5.2 () with 1,000 bootstrap replications (e.g., –S12, Supplementary Material online). Evolutionary models for each tree were automatically selected by the –m LG + I+G option incorporated in IQ-TREE.Bayesian analyses were run using MrBayes 3.2.6 () with a random starting tree, two simultaneous runs (nruns = 2) and four Metropolis-coupled Markov chain Monte Carlo (MC3) algorithms for 2×106 generations, with one tree retained every 1,000 generations (e.g., , Supplementary Material online). The burn-in point was identified graphically by tracking the likelihoods (Tracer v.1.6; http://tree.bio.ed.ac.uk/software/tracer/). The first 500 trees were discarded, and the remaining 1,501 trees were used to calculate the posterior probabilities (PP) for each clade. Additionally, the “sump” command in MrBayes was used to confirm convergence. This analysis was repeated twice independently; identical topologies were obtained. Trees were visualized using FigTree v.1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/). […]

Pipeline specifications