## Similar protocols

## Protocol publication

[…] Sequences of mtDNA were multiply aligned using Clustal X 2.0 after manual editing and checking using DNASTAR software. A combined sequence was constructed by concatenation of Cyt b + CR to combine the phylogenetic signal to maximise the possibility of finding an optimal topology.Population genetic analysis was carried out on Cyt b, CR and Cyt b + CR sequences, respectively. Haplotype diversity (Hd), nucleotide diversity (π), number of segregating sites (S), number of haplotypes (h) and number of nucleotide differences (K) were estimated for the whole data set and for each population using **Arlequin** 3.11 and **DNASP** 5.0. **Popart** was used to reconstruct phylogenetic relationships among haplotypes with a median-joining network method. Because Cyt b + CR provided most information we focus primarily on the concatenated sequence.Hierarchical analysis of molecular variance (AMOVA) was performed to test for genetic differentiation between regions (west/east) or among populations using Arlequin 3.11. Statistical significance of fixation indices (Φ) was tested with 10,000 permutations. Genetic distances within/between populations based on the K2P model were calculated using **MEGA** 5.0. **GenAlEx** 6.5 was used to examine isolation by distance (IBD) by testing for a linear relationship between pairwise population ΦST/(1 − ΦST) statistics and log (geographical distances (km)) for all seven populations, and for the six western populations ().To determine relationships among populations and if there was any evidence of historical dispersal, phylogenetic analysis was carried out on Cyt b, CR and the concatenated data set (Cyt b + CR) independently using neighbour-joining (NJ) and maximum likelihood (ML) in MEGA 5.0. The best-fitting nucleotide substitution model with the lowest Bayesian information criterion (BIC) score was determined using MEGA 5.0. ML analyses were performed under the TN93 + G (Cyt b), T92 + G + I (CR) and HKY + G + I (Cyt b + CR) model. In the NJ analysis, the K2P model was used. The reliability of nodes was assessed using 1000 bootstrap replicates. The congeneric species S. waltoni (GenBank accession no. JX202592), S. macropogon (GenBank accession no. KC020113) and Schizopygopsis younghusbandi (GenBank accession no. KC351895) were designated as outgroups.A continuous phylogenetic analysis was conducted using **BEAST** 2.3.2 to incorporate spatial and temporal components into the phylogeny. Analyses were performed with Cyt b and the concatenated data set (Cyt b + CR) independently. The clock model and tree prior model were compared by calculating Bayes factor (BF). BF was estimated according to the marginal likelihood estimated by using the smoothed harmonic mean estimator, as implemented in the Tracer 1.5 (http://beast.bio.ed.ac.uk/Tracer). In this study, we tested two clock models: a strict clock and an uncorrelated lognormal relaxed clock. Two tree prior models, Yule process speciation and coalescent Bayesian Skyline, were also compared using BF calculation. As a result, all the data sets were analysed using a HKY model of nucleotide substitution under an uncorrelated exponential relaxed clock with a coalescent Bayesian Skyline tree prior model. The data of each model are shown in and . Unfortunately, the timing of the diversification processes cannot be inferred with accuracy due to a lack of fossils in S. o’connori and hindered by the exhibition of different mutation rates by different lineages, therefore the molecular dating based on a fixed rate of mutation is not particularly suitable in this fish but at least gives us an idea about the potential age of diversification events. Nucleotide substitution rates for Cyt b and CR were assumed to be 1.0% and 3.6% per site per million years according to the previous study. We assumed the previously established average nucleotide substitution rate of 1.69% per site per million years that has been calibrated for concatenated sequence (Cyt b + CR) in schizothoracine fishes. The Markov chain Monte Carlo (MCMC) analyses were run for 100 million generations, with parameters logged every 20,000 generations. Convergence was assessed from the effective sample size (ESS) after a 15% burn-in using Tracer 1.5. ESS values of parameters of interest above 200 were accepted. Uncertainty in the estimates was indicated by the 95% highest posterior density (95% HPD) intervals. The first 10% of the generations were discarded as burn-in, and the rest were retained as valid samples for further analyses. The maximum clade credibility tree was generated by TreeAnnotator 2.3.2 and was viewed in **FigTree** 1.4 (http://tree.bio.ed.ac.uk/software/figtree/) with S. waltoni and S. macropogon used as outgroups.Historical population dynamics were estimated using coalescent-based Bayesian skyline plots (BSPs) and mismatch distributions. BSPs were performed using BEAST 2.3.2 to describe demographic history by assessing the time variation of effective population size. All the sequences data and similar settings as above were used in the BSPs analysis. The results of BSPs were visualized in Tracer 1.5, which summarized the posterior distribution of population size over time. Mismatch distributions were implemented in Arlequin 3.11. In addition, Fu’s Fs tests were used to determine if the two mtDNA sequences conformed to the expectations of neutrality. The sum of squared deviations (SSD) and the raggedness index (r) were estimated in Arlequin 3.11 to determine if sequences deviated significantly from a model of population expansion. If evidence of population expansion was found, the possible number of generations since population expansion (t) was estimated from the equation t = τ/2u, where u is neutral mutation rate for the entire sequence per generation and is calculated as u = 2 μkTgen, where μ is the mutation rate per nucleotide per year or million years; k is the average sequence length of the mtDNA region under study; and Tgen is the generation time in years. The approximate time of population expansion in years (T) was calculated by multiplying the number of generations (t) by the generation time (Tgen). Generation time for S. o’connori was estimated to be 10 years based on approximate time at which females become mature. We only consider the female age because the mitochondria of males are not transmitted to offspring. [...] Because it is unknown if S. o’connori is an autotetraploid or an allotetraploid species the microsatellite data were treated as phenotypic data (describing the identity and presence of observed bands, but not band copy numbers) and were analysed in two ways. First, all bands at each locus were combined to form multilocus allelic phenotypes for each individual. These phenotypes record the presence and identity of each allele, but are not genotypes because they contain no information about allele copy numbers unless individuals are homozygous (four identical alleles giving one band) or have four different alleles. Second, each microsatellite locus was treated as dominant and then the multilocus allelic phenotypes were transformed into binary - presence (1) or absence (0) - for each individual. Although there is some information loss using binary data, such data are free from unrealistic assumptions about allelic identities and may be used in analysis of molecular variance (AMOVA). These approaches are routinely employed to evaluate genetic diversity and population structure of polyploidy species.Binary allelic diversity was measured as total number of inferred bands (Total bands) and number of bands unique to a single population (Private bands) and they were calculated using GenAlEx 6.5. Nei’s gene diversity index (H), Shannon’s information index (I), and the percentage of polymorphic loci (PPL) were calculated using **POPGENE** and GenAlEx 6.5. Estimation of genetic differentiation (FST) and analysis of molecular variance (AMOVA) were performed using Arlequin 3.11.Estimation of genetic distance between populations of polyploid organisms is problematical because there is no accepted methodology for the calculation of genetic distance with binary data. It is however, reasonable to use genetic distance indices as descriptive measures, because banding patterns in polyploids specify phenotypes. Thus, Nei’s unbiased genetic distance was calculated based on binary matrix data using GenAlEx 6.5. The relationships between individuals and populations (PCoA) based on pairwise Euclidean distances were also performed using GenAlEx 6.5. Isolation by distance (IBD) was assessed as previously described. Using the original genotypic data, a Bayesian clustering method implemented in STRUCTURE 2.3.4 was used to determine the most likely number of genetic clusters, regardless of site of collection. Given the nature of the polyploidy microsatellite data set it cannot be known if assumptions for STRUCTURE such as Hardy-Weinberg equilibrium and no linkage disequilibrium are met, so we employ this approach as an exploratory analysis. The admixture model was chosen, allele frequencies were assumed to be correlated and analyses were conducted with a burn-in of 10,000 and followed by 100,000 MCMC repetitions. Ten independent runs were carried out for each cluster set (K) from 1 to 7. Further, we used the ΔK metric to determine the statistically most supported number of clusters as implemented in the web interface **STRUCTURE** HARVESTER (http://taylor0.biology.ucla.edu/structureHarvester/). […]

## Pipeline specifications

Software tools | Clustal W, Arlequin, DnaSP, PopART, MEGA, GenAlEx, BEAST, FigTree, POPGENE, Structure Harvester |
---|---|

Applications | Phylogenetics, Population genetic analysis |

Organisms | Danio rerio |