Computational protocol: Pitfalls of Establishing DNA Barcoding Systems in Protists: The Cryptophyceae as a Test Case

Similar protocols

Protocol publication

[…] Blast searches have been performed with unpruned sequences. The newly obtained COI-5P sequence of Cryptomonas curvata strain CCAC 0080 and several 5′-partial nuclear LSU rDNA sequences representing new lineages at different classification levels in different cryptophyte clades have been used as query sequences to test the performance of the different nucleotide versus nucleotide search algorithms offered at NCBI . The partial LSU rDNA sequences represented either a species of the Rhodomonas clade not yet available in GenBank (1 sequence), new lineages in the Chroomonas clade (2 sequences) or a new sub-lineage within the species C. curvata (1 sequence). According to Kim et al. the codon usage in the mitochondrial genome of the cryptophyte Hemiselmis andersenii strain CCMP644 corresponded to the standard code table, thus this setting was used for a blastx search with the C. curvata CCAC 0080 COI-5P sequence as a query .For phylogenetic analyses and frequency distributions, two different alignments have been assembled from 5′-partial LSU rDNA sequences. The Cryptomonas data set comprised 64 OTUs, including 8 new sequences, the Chroomonas data set 45 sequences, including 11 new sequences (). Both alignments contained no outgroup taxa and were automatically pre-aligned with MUSCLE . Alignment errors have been corrected by eye using the multiple sequence alignment editor SeaView 4.3.3 . Non-alignable regions have been excluded for phylogenetic analyses, distance computations and saturation tests. The final Cryptomonas data set consisted of 975 and the Chroomonas data set of 920 positions. Phylogenetic analyses have been performed using the threaded version of RAxML 7.2.6 (maximum likelihood) and the MPI version of MrBayes 3.1.2 (Bayesian analyses) –. For maximum likelihood analyses, GTR+I+ has been used as an evolutionary model including 1000 bootstrap replicates for each data set. MrBayes was set to 2 runs with four chains each, 4 million generations and GTR+I+. The burn-in phase for each data set has been determined and removed using the “sump” command.To compare genetic distances and saturation among COI-5P sequences, the C. curvata CCAC 0080 COI-5P sequence, all cryptophyte sequences and their neighboring sequences found one position up and down in the discontiguous megablast ranking have been aligned (12 sequences, see results and ) and subjected to comparative distance analysis with the K2P and GTR+I+ models. All three alignments, the small COI-5P alignment as well as the two 5′-partial LSU rDNA data sets have been subjected to tests for substitution saturation –. Substitution saturation of the data sets has been examined using the test according to Xia et al. in DAMBE 5.2.57 under exclusion of gaps .Genetic distances have been computed with Paup 4.0b10 . For a comparison of the effects of different distance measures on frequency distributions, four different distance measures have been inferred from the Chroomonas data set. For computation of the two most common distance measures in DNA barcoding, uncorrected p- and Kimura-2-parameter distances, the algorithms implemented in Paup under the distance criterion have been used. A run of jModeltest 0.1.1 yielded TIM2+I+ as the best trade-off between complexity and appropriate approximation of molecular evolutionary processes . For distances under the latter and under the most complex evolutionary model, GTR+I+, the respective maximum likelihood parameters have been estimated given a Jukes-Cantor neighbor-joining tree and thereafter used for distance computation. For the Cryptomonas data set intra- and interspecific GTR+I+ distances have been coded separately to yield frequency distributions for each. The distances inferred from both data sets were exported from Paup as column formatted text files and were imported into Calc 3.2.1 for further processing . The genetic distances were sorted into distance classes to generate frequency distributions for the Cryptomonas and Chroomonas data sets. This procedure did not include computation of mean values or standard deviations to avoid biased frequency distributions.For GMYC at first Bayesian analyses with BEAST 1.7.2 have been performed to obtain an uncalibrated tree under the assumption of a molecular clock . To account for all possible clock models, the random local clock setting has been chosen and clock rates were estimated . Tree prior was set to Yule process –. Two Markov chain Monte Carlo runs with 40 million generations and sampling of every 1000th generation have been performed to increase effective sample sizes (ESS) for each run beyond 200. Burn-in was determined with Tracer 1.5.0 (400,000 and 17,000,000 generations, respectively) . After excising the trees drawn during burn-in process, the samples of two runs have been merged. The representative tree selected by the treeannotator modul of the BEAST software suite was imported into the statistics software R 2.15.1 . GMYC analysis required the R package SPLITS obtainable from the R-Forge website . The results of the analyses (ultrametric tree plot and semi-logarithmic lineage-through-time plot with identified thresholds) have been saved in portable document format (PDF) and thereafter processed with Inkscape 0.47 . […]

Pipeline specifications