Computational protocol: Inferring Kangaroo Phylogeny from Incongruent Nuclear and Mitochondrial Genes

Similar protocols

Protocol publication

[…] The primary mitochondrial dataset (Mt16) combines the NADH1, NADH2 and Cytb protein-coding genes with the 12S and 16S rRNA genes. The sequences were initially aligned in ClustalW2 with penalties of 5 for gap opening and 0.2 for gap extension. Manual adjustments were then made in Se-Al 2.0a , where sites with ambiguous homology were excluded, leaving a 5,593 bp mtDNA alignment. An expanded matrix (Mt17) includes M. dorsalis; although poor tissue preservation resulted in low DNA yields and the sequence is 72% complete, missing 147 bp of NADH2, all of Cytb and 269 bp of the rRNA genes. The only mt sequence available on GenBank for M. antilopinus was included in a 1,146 bp Cytb18 matrix (with the taxa from Mt16 and an additional W. bicolor sequence) and was sufficient to confidently place M. antilopinus as sister to M. robustus, in agreement with nuclear genes ().Previous research has demonstrated a requirement to ameliorate nucleotide compositional biases among marsupial mt genomes for phylogenetic inference of ordinal level relationships , . By contrast, the present focus on closely related genera is relatively shallow. Our composition homogeneity χ2 testing on Mt16 in PAUP* 4.0b10 with uninformative and gapped sites excluded offers little evidence for base compositional non-stationarity among the Macropus and Wallabia ingroup (protein 1st codons: P = 0.4222, 2nd codons: P = 0.9708, 3rd codons: P = 0.5810, RNA stems: P = 0.9941, RNA loops: P = 0.3400).We analyse the mt sequences alongside the nuclear dataset of Meredith et al. , which includes protein-coding segments from BRCA1 (exon 11, breast and ovarian cancer susceptibility gene), ApoB (exon 26, Apolipoprotein B), IRBP (exon 1, interphotoreceptor retinoid binding protein gene), RAG1 (intronless recombination activating gene-1) and vWF (exon 28; vonWillebrand factor gene). Aligning the nuclear sequences followed the procedure described above for the mtDNA. Two 5,988bp matrices were constructed, Nuc16, with taxon sampling matching Mt16 and also Nuc17, which further includes M. antilopinus. Combined analyses (MtNuc16) concatenated the Mt16 and Nuc16 matrices. [...] Kangaroo phylogeny was inferred under maximum likelihood (ML) and Bayesian inference from the mitochondrial and nuclear sequences separately and concatenated, as well as for the individual nuclear genes. Substitution model categories for each data partition employed the more general of the jModelTest 0.1.1 hLRT or AIC recommendations () or the next most general available for each phylogenetic inference program. Substitution was modelled separately among the mt protein-coding codon positions and RNA stem and loop sites.Our initial efforts to reconstruct kangaroo phylogeny employed Bayesian inference in MrBayes 3.1.2 and ML in RAxML vGUI093 . MrBayes analyses ran two independent sets of two MCMC chains for 6,000,000 (Nuc17, MtNuc16) or 4,000,000 (Mt16, Mt17, Cytb18 and individual nuclear genes) generations, with trees sampled every 2,500 generations. Burn-ins varied from 500,000 to 1,200,000 generations, and were chosen to ensure that –lnL had plateaued, clade frequencies had converged between runs and estimated sample sizes for substitution parameters were >200 (using Tracer v1.5 ). ML analyses in RAxML carried out 500 full bootstrap replicates. Branch-length multipliers and substitution models were partitioned among protein codons and RNA stems and loops for each of the ML and Bayesian analyses, with MtNuc16 further partitioned between mt and nuclear sites.Support among alternative topologies was further examined with the approximately unbiased (AU) test , using the RELL method (100,000 replications) within CONSEL . Site likelihoods employed in CONSEL were inferred in PAUP*, with all substitution parameters and branch-lengths ML optimized separately for each of the protein codon and RNA structural partitions, for each tree hypothesis. Maximum likelihood trees conforming to the alternative Mt16 and Nuc17 placements of W. bicolor and M. irma () were identified for each gene in PAUP* with 20 random addition heuristic searches. Support among these individual genes for the alternative placements was compared with SH tests , which as pairwise comparisons reduce to equivalency with AU and KH tests. ML bootstrapping (500 replicates) for each gene was also performed in PAUP* with the substitution model parameters estimated in the earlier heuristic searches.We estimated a mitochondrial timescale for kangaroo evolution using BEAST v.1.6.1 with Mt16 partitioned as per the phylogenetic analyses. An uncorrelated relaxed clock model was used with rates among branches distributed according to a lognormal distribution. Note that likelihood ratio tests in PAUP* rejected strict clocks for both the mt and nuclear sequences (P<0.01). Four independent runs totalling 40,000,000 MCMC generations ensured estimated sample size values >100 (as estimated in Tracer v1.5) for all node height, prior, posterior, −lnL, tree, and substitution parameters. Chains were sampled every 5,000 th generation after burn-ins of 1,000,000 generations.Four fossil-based priors were used to calibrate the BEAST analysis. (i) Potoroidae/Macropodidae (15.97–28.4 Ma), with the minimum based on the Early Miocene macropodid, Ganguroo and the maximum covering putative Late Oligocene macropodids and potoroids . (ii) Macropodidae (11.6–23 Ma), with the Middle Miocene macropodid, Wanburoo providing the minimum bound and the maximum acknowledging that earliest Miocene macropods fall outside of the macropodid crown. Ganguroo is also a candidate for calibrating Macropodidae, however, its placement within this crown clade is not well resolved . (iii) Dendrolagini (4.46–16.0 Ma), with the minimum based on the Hamilton fauna Dendrolagus and the maximum bound recognising that all middle Miocene macropods fall outside the Dendrolagini crown. (v) Macropus/Lagorchestes (4.46–16.0 Ma), with the minimum based on Hamilton fauna Macropus and the maximum bound recognising that all middle Miocene macropods fall outside of this crown clade.Palaeontological data do not clearly favour any particular timing within the given bounds for calibrations (i), (iii) and (iv) and hence, flat priors were employed. A normal prior was employed for calibration (ii), in line with the recommendation of Ho and Phillips for when the balance of evidence , , suggests crown divergences fall well within the bounds. The normal prior was applied conservatively (90% of prior probability between the bounds). [...] To identify incongruence between partitions we performed the incongruence length difference test in PAUP*. This test however, can be biased in cases where parsimony is statistically inconsistent . To overcome this concern we also perform likelihood-based parametric bootstrap tests. For these we infer one ML score (MLF) with branch lengths and the models (as shown in ) partitioned across genes, but assuming a single topology (T*) and another ML score (MLV) for which the topology is allowed to vary across genes. The difference between MLF and MLV provides a critical value for testing the null hypothesis that all genes evolved on the same phylogeny. Next we used Seq-Gen 1.3.2 to simulate 200 datasets partitioned into the original mtDNA and five nuclear gene sequence lengths and evolved on topology T* with the original branch lengths and model parameters for each gene. The distribution of MLF - MLV differences from the 200 simulated datasets was then compared with the critical value from the original dataset. [...] *BEAST analysis within BEAST v1.6.1 employed the multi-species coalescent to infer the species tree underlying the mt and five nuclear gene trees coestimated from MtNuc16. The mtDNA was further partitioned into protein codon positions and RNA stems and loops for substitution modelling. Separate mt and nuclear uncorrelated lognormal relaxed clock models were used, with a Yule species tree prior and differential ploidy (autosomal nuclear and haploid mitochondrial). Eight independent runs totalling 80,000,000 MCMC generations ensured estimated sample size values >100 (estimated in Tracer v1.5) for all node height, prior, posterior, −lnL, tree, and substitution parameters. Chains were sampled every 5,000 th generation after burn-ins of between 1,000,000 and 4,000,000 generations. Given our focus on phylogeny rather than dating, we did not calibrate the *BEAST analysis, and so avoid potentially misleading influences of calibration priors on clade posterior probabilities . Instead we provided a nominal mean substitution rate of 1.00 with unspecified time units for the nuclear data and allowed the mt rate to vary relative to this.Minimizing deep coalescences, MDC trees were inferred under the dynamic programming mode in PhyloNet 2.4 from the mtDNA and five nuclear gene trees (, ), which were each estimated under ML bootstrapping in PAUP*. We collapsed branches that received <50% bootstrap support in these source trees to reduce the influence of stochastic artefacts among the individual genes on MDC tree building.Bayesian concordance analysis within BUCKy was run on 500 bootstrap replicate trees inferred in RAxML, for each of the six loci. Bootstrap distributions typically reflect stochastic variation in the gene tree estimates more closely than do Bayesian posterior distributions , . Otherwise, BUCKy analyses employed default parameters, except where stated. [...] In order to better understand whether ILS could plausibly account for incongruence among gene trees we simulated the evolution of the five nuclear genes and the mtDNA under a coalescent process in MCcoal within BPP 2.1 . Two alternative guide trees were used for MCcoal, the combined data and nuclear-only *BEAST species trees. In the former case M. irma was excluded from the *BEAST analysis because of the concern that its mtDNA derives from introgression (see ), which violates the assumptions of *BEAST. Instead, M. irma was grafted onto the tree – its temporal placement along the stem lineage from the other Notamacropus members was scaled in proportion to the nuclear-only *BEAST tree. For comparability the combined data and nuclear-only guide trees were both scaled to a root height of 20 Ma, closely matching both the mt BEAST estimate (21.3 Ma) and Meredith et al. 's estimates from the concatenated nuclear genes (ave. BEAST estimate, 20.0 Ma).The model that MCcoal simulates under (JC+Γ) is less complex than the models selected in jModelTest for each locus. Therefore we used a two-step simulation process (illustrated in ). First, MCcoal was run on the species guide tree to provide simulated coalescent trees. Gene sequences were then simulated on these coalescent trees under their respective substitution models () in Seq-Gen. All model parameters were estimated from the original data and the simulations maintained the aligned sequence length for the mtDNA and each nuclear gene.MCcoal requires a population dynamics parameter θ = 4Neμ (2Neμ for mtDNA), where Ne is the effective population size and μ is the mutation rate per site per generation. Mutation rates per site per year were obtained by scaling PAUP* ML treelengths for each locus to the *BEAST timetree length. Generation time across macropods is not well studied, but we used an average of 7 years in consideration of life history data from most macropodid species . The influence of effective population size was evaluated with Ne varied from 1,000 to 1,000,000. […]

Pipeline specifications

Software tools Clustal W, Se-Al, PAUP*, jModelTest, MrBayes, RAxML, CONSEL, BEAST, Seq-Gen, PhyloNet, BPP
Applications Phylogenetics, Nucleotide sequence alignment
Organisms Osphranter robustus