Computational protocol: Phylogenomic Analysis Resolves the Formerly Intractable Adaptive Diversification of the Endemic Clade of East Asian Cyprinidae (Cypriniformes)

Similar protocols

Protocol publication

[…] The decisive role of orthologues in avoidance of erroneous speculations of species tree has been highlighted in many cases , , . The genes we used in this phylogenomic analysis were thus carefully selected to avoid fundamental errors in homology. We implemented a bioinformatics pipeline for mining of single-copy genes. Whole genomic sequences of D. rerio were retrieved from the Ensembl database for gene screening . We extracted the protein sequence and conducted extensive searches against the genomic sequences in all six possible reading frames using t-blastn at the e-values of 10−1. To obtain single-copy genes for future analyses, only protein-coding sequences with both similarity (S) and coverage (C) of less than 30% within-genome sequence comparisons were retained. That is to say, only sequences that have no duplicates over 30% similar to themselves in D. rerio genome were selected for further analyses. We then performed t-blastx searches using sequences of these candidate genes against Genbank to obtain orthologues from fugu (Fugu rubripes) and medaka (Oryzias latipes). We selected genes with the reported criteria : not only were these selected genes conservative enough for primer design, but they were also parsimony informative for the resolution of a variable range of intractable phylogenetic problems. [...] For this study, multigene sequences were determined for 13 species of Cyprinidae, including representative East Asian cyprinids (one individual per species) and D. rerio (). Total DNA was extracted from muscle or fin tissues using phenol/chloroform extraction procedure . Primers for PCR of all 100 genes are listed (). PCR amplification was carried out for 35 cycles, under the following conditions: an initial denaturation at 94°C for 5 min, denaturation at 94°C for 30 s, annealing at 48–56°C for 30 s–50 s, extension at 72°C for 30–120 s, and a final extension of 8 min at 72°C. To provide a check for orthology, amplified products are with a single prominent band. Amplified DNA fragments were fractionated by electrophoresis through 1.0% low-melting agarose gels. Products of expected size were sequenced either directly or after cloning into PMD18-T vectors (Takara). Because of amplification difficulties, some data were missing and partial sequences were present in some species. Missing data or incomplete sequences did not, however, affect the inferred phylogeny because the dataset in our study provided sufficient information, consistent with previous empirical studies , , .Using experimentally amplified sequences as queries, we performed t-blastx searches against the database of GenBank in NCBI ( to confirm that there was only one significant hit for each genetic marker, thus avoiding potentially paralogous comparisons . Multiple alignments were carried out using default parameters in Bioedit (Biological sequence alignment editor V5.0.9, Frame shifts or indels detected in exon and intron were manually excluded. Exons were easily aligned; however, non-coding sequences required greater effort in alignment because these regions had higher variability and repeated stretches of monomers. Alignments of individual genes are available from the authors upon request. We chose to exclude regions of each gene that showed evidence of high levels of saturation by multiple substitutions, and poor sequencing quality from phylogenetic analyses. All sequences amplified in this study were deposited in GenBank (accession numbers GU217798 to GU218392, and GU218394 to GU218691; ). We evaluated alignment statistics of individual gene, including length, percentage of exon, ratio of variable and parsimony informative characters, average within-group p-distance, and average base composition using MEGA4 and Seqstate . [...] The aligned sequences were concatenated using a custom Perl script, and was used for all phylogenetic inference. Heterogeneity of the nucleotide base composition was tested using Chi-square test in PAUP* version 4.0b10 . Parameters such as base frequencies, numbers of substitution types, proportion of invariable sites, and Gamma distribution shape were optimized using Modeltest3.7 with the Akaike Information Criterion (AIC). We performed the heuristic searches option with tree bisection-reconnection (TBR) branch-swapping under Maximum Parsimony (MP). All characters were treated as equally weighted. Node support values in MP analyses were assessed using non-parametric bootstrapping for 1000 pseudo-replicates (10 random taxon addition sequence replicates per pseudo-replicate). We used PhyML to determine Maximum likelihood (ML) tree with the optimal model. Robustness of lineages was tested by bootstrap analyses based on 1000 rounds of bootstrap resampling. Bayesian inference (BI) was conducted using MrBayes v3.1.2 , in which four independent runs of Metropolis-coupled chains (MCMC) with 2000000 generations to estimate the posterior probability distribution (sampling one tree per 1000 replicates for each run). After discarding the first 1000 trees as burn-in with non-stationary log likelihood values, 50% majority-rule consensus trees were estimated for the remaining trees. Stability of nodes was estimated using posterior probabilities (PP). Ancestral state reconstruction was also performed using MrBayes. To compare alternative topologies obtained from previous studies , with the combined datasets, site-wise log-likelihoods for candidate trees were calculated using PAUP*4.0b and used as inputs into the CONSEL program package . The p-value was calculated using Approximately Unbiased (AU) test, Bootstrap Probability (BP) test, Kishino-Hasegawa (KH) test, Shimodaira-Hasegawa (SH) test and Weighted Shimodaira-Hasegawa (WSH) test.We employed variable length bootstrap analysis to investigate the minimum length required to obtain robust phylogenetic inference for this group. In this analysis, bootstrap support of resampled characters was estimated at variable sequence lengths , . All bootstrap searches were performed using MP analyses with PAUP* version 4.0b10 and the number of resampled bases extended from 2000 to 500000 characters to generate bootstrap pseudomatrix.For comparison, we evaluated the relative contribution or effect of each gene using a decay index or Bremer support index in TreeRot3 . Partitioned Bremer Support (PBS) was calculated following the method of Baker & DeSalle . Individual PBS scores can be positive, negative and zero. Positive PBS values indicate that a given dataset increases support for particular node whereas negative values show that data partition provides net negative support for the given node. A PBS value of zero suggests that the given data partition at that node has an indifferent relationship. Thus, the larger the PBS values for a node of interest, the greater the relative effectiveness of that genetic marker in resolving and supporting that node. The sum of PBS values of the different data partitions for any given node will always be equal to the decay index for the node of the inferred tree.Phylogenetic studies with relatively few taxa have a major advantage in terms of exploring a variety of analytical methods , and all possible phylogenetic reconstructions. Several discussions have raised legitimate arguments against the naturalness of data partitioning and choice of model selection –. To examine the potential systematic errors caused by model misspecification and improper data partitioning strategies, we applied a series of data partitioning strategies, and homogenous versus mixed models (parameters were unlinked across partitions). We then evaluated the relative merits of competing data partitioning strategies and alternative models by Bayes factors. The analysis does not require the assumption of any asymptotic property and hierarchically nested hypotheses but it provides a rigorous basis for model testing or data partitioning in terms of probability . We approximated the Bayes factor as the marginal likelihood (the ratio of the harmonic mean of likelihoods) of Markov Chain Monte Carlo samples . We calculated twice the natural logarithm of the Bayes factors for alternative partitioning strategies, and determined the result using the criteria provided by Kass and Ratery : the null hypothesis is preferred if 2lnBF <0, which provides evidence in favor of model 0; on the other hand, when 2lnBF > 10 the null hypothesis is rejected. The partitioning strategies were as follows: (1) equal length partitioning (dividing the concatenated data to 7 partitions of equal sequence length), (2) partitioning by exon and intron (exon + intron), (3) partitioning by codon positions and intron (1st codon position + 2nd codon position + 3rd codon position + intron), (4) all data combined in one single partition (one partition), and (5) partitioning by genes (100 gene partitions). Using these varied partitions, results of analyses were compared, to test which was the most suitable one for improving phylogenetic inference. To eliminate model misspecification, we also used Bayes factors with the above-mentioned criteria to evaluate the relative merits of homogeneous models and mixed models.To estimate divergence times, likelihood ratio test was performed using PAUP*4.0b to obtain the likelihood scores and investigate whether a global clock fit the combined dataset. Divergence times were estimated using Multidivtime –. We chose the default F84 model, the most complicated model with four discrete categories for the Γ distribution of rate in Multidivtime. Divergence time algorithms require calibration for at least one internal node. Minimum age constraints were determined using fossil records of extant cyprinids in China from the Pliocene (5.33–1.81 MYA). We chose the recent fossil-based time constraints assignable to Mylopharyngodon piceus and Ctenopharyngodon idella with a minimum age of 1.81 MYA . […]

Pipeline specifications