Computational protocol: Phylogenetic position and age of Lake Baikal candonids (Crustacea, Ostracoda) inferred from multigene sequence analyzes and molecular dating

Similar protocols

Protocol publication

[…] All sequences were visualized using Finch TV version 1.4.0 ( BLAST (Altschul, Gish, Miller, Myers, & Lipman, ) analysis of GenBank database was used to check that the obtained sequences were ostracod in origin and not contaminants. Each sequence was checked for the quality of signal and sites with possible low resolution, and corrected by comparing forward and reverse strands. Sequences were aligned in MEGA 7 (Kumar, Stecher, & Tamura, ) with ClustalW (Thompson, Higgins, & Gibson, ) with extension penalty changed from default settings (6) to 1 for 28S dataset in order to allow alignment of homologous regions that were separated by expansion segments present in some taxa but not others. All alignments were manually checked and corrected where necessary. The 28S alignments were also checked with Gblock (Castresana, ), and ambiguous blocks were removed. We analyzed alignment of each gene and all three regions of 28S amplified with different primes (dd/ff, ee/mm, vv/xx) separately. In addition, we performed two analyzes of the concatenated dataset: one including all three genes, and the other with only three 28S fragments; the latter was used only in the divergence time estimations. In the concatenated datasets, some species datasets were composed of sequences acquired from different specimens in order to avoid missing data, and for our outgroup we combined 16S from a different, but closely related, species. Missing data in concatenated datasets were coded “?”. Recent simulations and empirical analyzes suggested that missing data in Bayesian phylogenetics are not themselves problematic, and that incomplete taxa can be accurately placed as long as the overall numbers of characters are large (Wiens, ; Wiens & Moen, ). Sequence differences within and between groups in each individual alignment, as well as in concatenated datasets, were calculated in MEGA 7 using simple p‐distance method. Sequences are divided into groups, defined by the genus they belong to. For the best fit evolutionary model program, jModelTest 2.1.6 (Darriba, Taboada, Doallo, & Posada, ; Guindon & Gascuel, ) was used with the Akaike information criterion (Hurvich & Tsai, ). Bayesian inference reconstruction in MrBayes v3.2.6 (Huelsenbeck & Ronquist, ; Ronquist & Huelsenbeck, ; Ronquist et al., ) was performed with the best fit model and priors for the base and state frequencies calculated by jModelTest. For the concatenated set, data were partitioned into five blocks corresponding to gene regions, each with its fixed priors. All analyzes ran with four chains simultaneously for two million generations in two independent runs, sampling trees every 200 generations. Of the four chains three were heated, and one was cold, the temperature value (“Temp” command in MrBayes) was 0.1 (default option). The results were summarized, and trees from each MrBayes run were combined with the default 25% burn‐in. A >50% posterior probability consensus tree was constructed from the remaining trees. For the choice of the outgroup we relied on the phylogeny published in Hiruta et al. (). As the relationships within Cypridoidea were not clearly resolved and Candonidae appears as a sister taxon to all other Cypridoidea, we decided on a representative of Cyclocyprididae, which used to be in the same family with Candoninae. For details of the number of original sequences, their sampling localities as well as for those downloaded from GenBank (Supplement 1).Saturation test and likelihood ratio test for deviation from molecular clock of each separate dataset were performed with DUMBE5 (Xia, ), while for the concatenated datasets marginal model likelihood using stepping stone algorithm was applied to test molecular clock in MrBayes. After examining the consensus tree resulted from separate and concatenated analysis we chose four nodes to calibrate the molecular clock in the divergence time analysis performed in BEAST v1.8.3 (Drummond, Suchard, Xie, & Rambaut, ). Three analyzes were run as follows: concatenated dataset with all three genes, 18S dataset, and combined 28S dataset. The last differed from the first two in using strict clock model, while in the case of the first two we used uncorrelated relaxed (lognormal) clock. Otherwise in all three analyzes, GTR + G + I model (Rodríguez, Oliver, Marín, & Medina, ) was used for the site model and Calibrated Yule model (Heled & Drummond, ) for the tree priors. Priors for the node ages were all set with normal distribution. The root was calibrated based on the oldest Candonidae fossil with a mean of 180 Mya and standard deviation of 6 Mya, covering the period of the Early Jurassic. The three internal nodes were calibrated as follows: Candona origin with a mean of 80 Mya and standard deviation of 3.2 Mya, corresponding to the time of the first Candona ssl. fossils from the Upper Cretaceous (see Danielopol et al., ); Pseudocandona origin with a mean of 24 Mya and standard deviation of 2.6 Mya, corresponding to the time of the first Pseudocandona ssl. fossil from Late Oligocene/Early Miocene (Triebel, ); and Trapezicandona Schornikov, 1969 with a mean of 6 Mya and standard deviation of 1 Mya, corresponding to the time of the first Trapezicandona fossils from Late Miocene/Early Pliocene period (see Danielopol, ). All other priors were set to default program options. We conducted two independent runs for each analysis, each for 10,000,000 generations, sampling every 1,000 generations. Software Tracer (Rambaut, Suchard, Xie, & Drummond, ) was used for visualizing results of the BEAST analyzes and FigTree v1.4.3 for tree visualizations. We did not analyze 16S separately for the divergence time estimate, because of a very limited dataset. […]

Pipeline specifications