Computational protocol: Differential strengths of selection on S-RNases from Physalis and Solanum (Solanaceae)

Similar protocols

Protocol publication

[…] Amino acid and nucleotide S-RNase sequences were obtained from GenBank for 12 Physalis cinerascens, 36 P. longifolia, 17 Solanum carolinense, 32 S. chilense and one Antirrhinum hispanicum (Ahis5) allele used as an outgroup sequence. Automated alignment of the complete dataset containing all S-alleles was performed using ClustalX [] and manually adjusted using Se-Al v2.0 []. A nucleotide alignment was matched with corresponding amino acids to produce a codon alignment using PAL2NAL [] that resulted in 131 codons. A phylogeny of all S-alleles (n = 98) was created using Mr. Bayes v3.1 [] to generate a 50% majority consensus topology. The analysis was run under a GTR+ Г + I substitution model for 1,000,000 generations, sampling every 100th tree for a total of 10,000 trees. The initial 2501 trees were discarded as the burn-in phase. The remaining trees represent generations on which posterior probabilities were calculated.Separate datasets were compiled for each genus: one that contained 48 Physalis and the other with 49 Solanum S-alleles. Corresponding topologies for each dataset were pruned from the Bayesian consensus tree using TreeEdit v1.0a10 [] to maintain genealogical relationships found when all taxa's alleles were included. The use of 2 species from each genus simply enlarges each dataset as the genealogical patterns exhibited for congeners are shared because of trans-specific polymorphism. The same tree topology for each dataset was used in all subsequent selection analyses that utilize phylogenies unless otherwise stated. A general time reversible (GTR) model of nucleotide substitution is used for all subsequent phylogenetic selection analyses so that direct comparisons can be made across models and datasets. Pairwise nucleotide divergence π was estimated for synonymous and non-synonymous substitutions for all taxa using DNASP 4.0 []. Sequence alignments, Newick string tree topologies and HYPHY likelihood functions for Physalis and Solanum datasets can be found as Nexus files in online Supplementary data. [...] To estimate the ratio (ω) of non-synonymous (dN) to synonymous (dS) substitutions at individual amino acid sites we first used the program codeml in PAML 3.15 []. Values of ω < 1 for individual codons indicates purifying selection while sites with ω = 1 are considered neutral. Positive selection at the amino acid level is predicted when ω > 1. A series of nested neutral and selection models first developed by Nielsen and Yang [] use likelihood ratio tests (LRT) to determine the model that best fits the data. The null model M1 (neutral) constrains all sites to be either of class ω = 0 or ω = 1 while the alternative model M2a (selection) adds a third class in which ω > 1 at individual sites. Model M3 (selection) assumes three discrete site classes (ω0 , ω1, and ω2) with three corresponding proportions (p0, p1, p2) estimated from the data. Models are then compared and rejected by likelihood ratio tests as described in the section above. Sites estimated to be under positive selection are determined by an empirical Bayes approach [] where posterior probabilities are estimated from rates within each site class. Because we are primarily concerned with comparing posterior probabilities from the robust general discrete (M3) model with a subsequent coalescent analysis, we forgo full analyses including models with more complex rate distributions (i.e. M7 and M8).The Bayesian coalescent method was conducted using OmegaMap v0.5 [] which implements a population genetics likelihood approximation to the coalescent to infer recombination and estimate ω. The model of base substitution including transition/transversion rates among codons was adopted from Nielsen and Yang []. Rather than using a maximum likelihood approach to estimate the selection parameter, OmegaMap employs a Bayesian method with a Markov Chain Monte Carlo (MCMC) process to estimate posterior distributions of parameters. This allows the use of posterior densities of ω to investigate whether dN/dS is greater at any particular codon in one dataset versus the other without the need for nested models. This can only be done if datasets are the same length, encode for homologous genes, and have reliable alignments of codon positions. By sampling from the distribution of ω values we are able to determine the ratio of ω estimated from Physalis relative to Solanum. Rejection of the null hypothesis that sites have equivalent ω values is observed when the 95% posterior density of ratios exclude 1 (H0: w1HPD w2HPD = 1).Rather than estimating ω for each dataset using a variable model along pre-defined blocks of adjacent codons, we assumed an independent model for each site with an improper inverse distribution of rates. The MCMC chain was iterated over 500,000 generations sampling every 100th generation. We ran each dataset twice to check for convergence and removed a burn in of 50,000 generations using R http://www.r-project.org/. The chain generates upper and lower posterior densities (highest posterior density HPD) to determine mean point estimates of ω at each codon position for each dataset. Because the independent model is computationally intensive, we ran the OmegaMap analyses using the Cornell BioHPC server http://cbsuapps.tc.cornell.edu/omegamap.aspx. The upper and lower HPD of ω values from each dataset were then combined and re-sampled after a burn in of 25,000 generations to get HPD's and the geometric mean for the ratio of ω's using R. [...] We also used a fixed-effects likelihood (FEL) method to infer differential selection at individual sites among datasets []. FEL differs from the REL type models of PAML and the coalescent method of OmegaMap in that dN and dS are estimated at individual sites directly rather than using pre-defined distributions of rates []. Alignments of each dataset were first used to estimate global parameters such as nucleotide frequencies, topology, and branch lengths. We use separate trees for each dataset (rather than a single phylogeny including both genera). These parameters were then fixed throughout the selection estimate procedure. The null model H0: dN1/dS1 = dN2/dS2 and alternative model HA: where dS1, dN1, dS2, dN2 are free to vary are fitted to every codon and, because they are nested, likelihood ratio tests can be used to determine significantly different selection pressures on individual sites. We estimated selection using the CompareSelectivePressure batch file in HYPHY v0.99. Actual dN/dS values for each dataset were then checked for any potential false positive estimates of differential positive selection. Here it is possible for the model to reject the null hypothesis that dN/dS ratios are equivalent across datasets but codons may not actually have ω estimates > 1.We conducted simulations for Physalis and Solanum datasets independently to determine the power of the FEL test for given p-values. We simulated 100 replicates of each dataset and corresponding phylogeny using the site-by-site rate estimates from the FEL method with 25% of sites evolving neutrally. This produced 13100 sites with non-zero rates (131 codons × 100 replicates) to estimate false positive rates over bins of p-values of width 0.01. The power analysis was conducted using a batch command program in the HYPHY v0.99 package. […]

Pipeline specifications