Computational protocol: Nuclear and plastid haplotypes suggest rapid diploid and polyploid speciation in the N Hemisphere Achillea millefolium complex (Asteraceae)

Similar protocols

Protocol publication

[…] Sequences were assembled with the ContigExpress program (Informax Inc. 2000, North Bethesda, MD), aligned with ClustalX 1.81, and then manually improved with BioEdit version 7.0.1. To prevent possible sequencing errors, single mutations in the nuclear gene data sets likely generated by the cloning sequencing method were excluded from the analyses. Furthermore, unique sequences in the nuclear gene data matrix, which do not fall into any majority-rule consensus sequence group [,] or show inconstant branch positions in trees based on different subsets of data, i.e., with partial characters or randomly selected sequences, during the initial analyses were eliminated to avoid influence of PCR-mediated recombination [,,]. The final numbers of individuals/clones analyzed at each locus for each population are listed in Table . All the sequences analyzed were submitted to the NCBI GenBank under accession numbers HQ601971-HQ602593 (the nuclear ncpGS and SBP genes) and HQ450864-HQ451071 (the plastid loci).The allelic data sets of the two nuclear genes, ncpGS and SBP, were analyzed separately, whereas sequences of three cpDNA fragments were combined as one locus.Gaps in the nuclear data sets were treated as missing data, whereas each indel position (no matter how many nucleotide sites it contained) of the plastid data set was coded as a binary character (0/1 = A/C) using the program GapCoder [].As A. millefolium agg. consists of species with short evolutionary history [,], Neighbor Joining (NJ), Maximum Parsimony (MP) and Median-Joining network were applied to the present data. For the nuclear sequences, Neighbour Joining and Parsimony analyses were performed with MEGA 5.05 and PAUP* 4.0b10a, respectively. All nucleotide substitutions were equally weighted. Gaps were treated as missing data. We first analyzed data of the diploid species to show diversification of the gene lineages, and then of all the taxa to investigate relationships among the polyploids and diploids within A. millefolium agg. The NJ analysis was conducted with Kimura's 2-parameter distances [] and bootstrapped with 1000 replicates. For the MP method, heuristic searches were performed using 1000 random taxon addition replicates with ACCTRAN optimization and TBR branch swapping. Up to 10 trees with scores larger than 10 were saved per replicate. The stability of internal nodes of the MP tree was assessed by bootstrapping with 1000 replicates (MulTrees option in effect, TBR branch swapping and simple sequence addition).Median-Joining network analysis implemented in Network ver. 4.5.1.6 available at http://www.fluxus-engineering.com/sharenet.htm[] was applied to the cpDNA data set. All variable sites were equally weighted and the homoplasy level parameter (ε) was set to zero given that variation rates of the closely related species is low, especially in their plastid DNA.To understand the population demography at the time of speciation of the diploid species of A. millefolium agg., we applied a probabilistic model, the Isolation with Migration Model for multiple populations implemented in IMa2 [], to three widespread and closely related species A. asplenifolia-2x and A. setacea-2x and A. asiatica-2x. These species are here regarded as three diverged populations which share nuclear sequence variation. Shared alleles could reflect ancestral polymorphism or gene flow after separation of the populations or species. Assuming neutrality, retention of ancestral polymorphisms is likely if speciation is fast relative to drift, which is inverse in intensity to the effective population size. As a rule of thumb, species are well separated with little ancestral polymorphism and thus almost complete lineage sorting, if the time of separation is at least as long as four times the effective population size []. Secondary genetic exchange between the diverging species can also lead to shared alleles observed []. The multipopulation model IMa2 allows both ancestral polymorphism and gene flow subsequent to divergence. It assumes a known history of the sampled populations, which can be represented by a rooted bifurcating tree. In earlier analysis using AFLP data [], we inferred the rooted species tree as: ((A. asiatica, A. asplenifolia), A. setacea). We note that Ima2 provides posterior distributions of parameters, such that the confidence in the inference of each parameter can be obtained from observing the spread of the posterior distribution. The IM model also assumes neutral genetic variation, freely recombining unlinked loci and no intragenic recombination or gene conversion []. Sequences of the two nuclear loci, the ncpGS and the SBP genes, and of three plastid fragments were used for this analysis. The polymorphic sites of the sequenced nuclear and plastid loci are mostly of introns or intergenic spacers and thus should fit the neutral variation model. Using the four-gamete criterion [], we do not find intragenic recombination in the nuclear sequences among these three species. The data of the three plastid fragments were combined because the chloroplast genes are generally linked and no evidence of recombination between the three regions is found.To run IMa2, one random haplotype per plant individual was chosen for the nuclear gene data sets, and the plastid data set was composed of sequences from the same plant individuals. This avoids bias but decreases the amount of information and thus leads to broader posterior distributions. The IS (Infinite Sites) model [] of sequence evolution was chosen for the plastid locus, whereas, the HKY model [] which allows for multiple substitutions was selected for the two nuclear loci because double mutations were found for a few polymorphic sites at both loci. The inheritance scalar was set to 1.0 for the nuclear and to 0.25 for the plastid loci, respectively.To set upper bounds on the prior distributions of the parameters, we estimated for each of the three species the geometric means of the population mutation rate 4Nu across all three loci using Watterson's estimator θ (per sequence not per site). The largest mean value was found with A. asiatica-2x (an estimate of 4Nu = 9.8205), and this was used to set the upper bound on uniform prior for each of the three population demographic parameters: population size (θ = 4Nu), splitting time (t = Tu, where T is the time in generations since the common ancestry, and it is of the same order of 4N) and migration rates (2NM = 4Nu × m/2). The priors were finally set as follows: the upper bound of population sizes q = 100, splitting times t = 5 and migration rates m = 2.0, respectively. We ran the Markov-chain Monte Carlo (MCMC) simulations with 1,000,000 burn-in steps and 20,000 genealogies sampled per locus. The analysis was done with 10 independent runs in the M mode, each using identical priors and 20 Metropolis-Coupled chains with different random number seeds. The genealogies sampled from the M mode runs were combined in an L mode run to build an estimate of the joint posterior probability of the parameters [,]. […]

Pipeline specifications