Computational protocol: Evaluating the Effects of Non-Neutral Molecular Markers on Phylogeny Inference

Similar protocols

Protocol publication

[…] Tests of selection are based on assessing the ratio of synonymous to non-synonymous nucleotide substitutions, therefore, only protein coding loci were chosen for this study. To test the notion that rhodopsin may not be an appropriate marker for phylogenetic analysis of aquatic organisms , the acanthomorph dataset of Li et al. was selected. Those data include ring finger protein 213 (rnf213), mixed-lineage leukemia (mll), inverted repeat binding protein (irbp) and rhodopsin (rho) sequences; the former was sequenced for Li et al.'s (2009) study, the rho sequences were generated by Chen et al. , and the mll and irbp sequences (with the exception of a few sequenced by Li et al. ) were generated by Dettai and Lecointre . Only taxa with accessioned sequence data for all four markers were included. Of those, Epinephalus aeneus was excluded because the published irbp sequence (AY362227) BLASTed against the NCBI nucleotide database (with default parameters) as mll. All sequences used in this study were retrieved from GenBank.All four markers were aligned using MUSCLE implemented in Geneious (Biomatters Ltd., Auckland, New Zealand) with full penalty for terminal gaps, a gap open score of −1, and a maximum of eight refinement iterations. Model selection, utilizing the AICc (corrected Akaike Information Criterion), and phylogeny inference, using the ML criterion, were carried out in Treefinder . Five alignments were analyzed, comprising each of the four individual markers and a concatenated dataset.To test for selection, the Nei-Gojobori method (Proportion), implemented in MEGA 4.0 was used to estimate the number of non-synonymous substitutions per non-synonymous sites (dN) and the number of synonymous substitutions per synonymous site (dS) for each of the four markers. Variances were generated by bootstrapping (5000 replicates) the data and then the null hypothesis of neutral evolution (H0: dN = dS) was tested using a Z-test covering the overall average (per marker) of dN and dS.To detect the type of selection (positive or negative) per site the fixed effects likelihood (FEL) and the random effects likelihood (REL) methods were implemented in HyPhy , . Both methods are based on ML estimates for the parameters of a nucleotide substitution and codon model, testing whether dN/dS >1 per site. Unlike REL, FEL estimates are conditional on a specific phylogeny; for these FEL analyses the TE phylogeny was used. The REL method assumes distributions for the synonymous and nonsynonymous rates and identifies positively selected sites using empirical Bayes factors. To determine significance, a cutoff p-value of 0.05 was used for FEL and an acceptance ratio of 0.05 for REL.Tree distance metrics were generated and statistical tests were carried out on all five (four individual markers plus the concatenated dataset) topologies to assess their pairwise similarity or difference. Two tree distance metrics were used to evaluate topology congruence: the symmetric distance of Robinsons and Foulds , carried out in Phylip , and the SPR (subtree pruning and regrafting) heuristic distance , implemented in TNT . The Symmetric Difference measures how many partitions are on one tree and not the other, whereas SPR distance measures the minimum number of SPR moves required to transform one tree into another. For both tree metrics, trees were treated as unrooted and for SPR distances, 2000 replications per comparison were carried out.Because a tree metric is not a statistical hypothesis test, three paired sites tests, the KH , AU and SH tests were carried out. All tests compared likelihood differences among tree topologies (the five generated for this study) to the empirical variation in log likelihoods for a given dataset, with the AU and SH tests approximately correcting for multiple trees. The null hypothesis (H0: all topologies for comparison are equally good explanations of the data) was rejected when P<0.05. The hypothesis tests were implemented in Treefinder using the RELL (resample estimated log-Likelihood) nonparametric bootstrap method for the KH and SH tests with 50,000 replicates to generate the null distributions. A multiscale bootstrap technique was used for the AU test, which is considered to exhibit the least amount of bias and is less conservative than the SH test . The models used to calculate the likelihoods for each topology were the same as those chosen using the AICc in Treefinder .To determine if incongruence was caused only by the presence of positively selected codons, any marker that was found to be incongruent had its positively selected sites (all those sites detected using FEL and REL) removed and another round of KH, AU and SH tests was carried out in Treefinder . If sites under positive selection are the only cause of incongruence, their removal should result in failure to reject the null hypothesis of the paired sites tests. […]

Pipeline specifications