Computational protocol: Tectonic collision and uplift of Wallacea triggered the global songbird radiation

Similar protocols

Protocol publication

[…] We de-multiplexed raw reads using CASAVA ver. 1.8.2 and trimmed low-quality bases and adaptor sequences from reads using illumiprocessor ver. 1 (https://github.com/faircloth-lab/illumiprocessor) which batch processed reads using Scythe (https://github.com/vsbuffalo/scythe) and Sickle (https://github.com/najoshi/sickle). We used the Python package PHYLUCE for subsequent data processing. We assembled cleaned reads into contigs using Trinity ver. r2013.08.14 (ref. ) and extracted contigs for each taxon that matched UCE loci. We assembled an incomplete data set containing UCE loci that were present in at least 75% of all 106 taxa for maximum likelihood (ML), Bayesian (BI), and SVDquartets analyses. For gene tree-based species tree analyses, we assembled complete data sets consisting of loci common to all taxa being analysed. For each data set, we aligned each locus using MAFFT, and we trimmed resulting alignments to allow missing nucleotides at the flanks of each alignment only if at least 65% of taxa contained data, which is the default option in PHYLUCE. We further trimmed uncertain alignment regions using Gblocks with default parameters except for the minimum number of sequences for a flank position in Gblocks, which we set at 65% of taxa.Protein-coding genes can provide convergent signal and base-composition bias in phylogenomic data sets. We undertook two approaches to address this potential concern. First, using BEDTools we aligned the UCE probes to known chicken protein-coding genes and identified UCE loci whose probe(s)+300 bp of flanking region overlap with these genes. Of the 4,155 UCE loci recovered in the incomplete matrix, only 316 (7.6%) overlapped with chicken protein-coding genes. We assembled a data set without these 316 loci. Second, we recoded the incomplete matrix as purines and pyrimidines (RY-coding). We analysed both data sets with a maximum likelihood approach (; see below). [...] We performed maximum likelihood (ML) inference on the concatenated loci of the incomplete and complete data sets using RAxML ver. 8.1.3 (ref. ) assuming a general time reversible model of rate substitution and gamma-distributed rates among sites. We evaluated node support using 500 rapid bootstrap replicates. We tested for convergence of bootstrap replicates a posteriori using the ‘autoMRE' option in RAxML. For BI, we used Exabayes ver. 1.4.2 (ref. ) with loci partitioned according to evolutionary model (determined with Cloudforest http://github.com/ngcrawford/cloudforest). We unlinked state frequencies, rate heterogeneity among sites, and the substitution matrix among partitions; we linked branch lengths across partitions. Exabayes analyses tended to converge on single topologies with no subsequent topology swaps accepted when using default settings. To more thoroughly search parameter space, we adjusted the temperature of the heated chains (heatFactor), topology proposals (for example, increase parsimonySPR and likeSPR) and parameters for various moves (parsSPRRadius, parsimonyWarp, likesprmaxradius, and likesprwarp). Changes in these settings often represented a trade-off between more thorough tree searching at the expense of greatly increased computation time, especially increasing the likelihood SPR settings. We eventually converged on settings with more liberal parsimony criteria (for example, radius 10,000, increased parsimony proposals (∼20%), etc.) and moderate likelihood criteria (for example, radius 10, modest likelihood proposal frequency (∼5%), etc.) that prevented getting stuck in local optima while remaining computationally tractable. All Exabayes analyses included multiple independent runs (2 runs, each with one cold and two heated chains); we assessed convergence between runs with the average standard deviation of split frequencies <0.01, plots of likelihood and parameter estimates, the potential scale reduction factor close to one (<1.1), and effective sampling size (ESS) >200.To obtain a species tree estimate that did not rely on prior inference of individual gene trees, we used SVDquartets, a method that analyzes quartets of species in a coalescent framework using singular value decomposition of the matrix of site pattern frequencies and then assembles a species tree from the quartets using a supertree method. This method was developed for analysis of unlinked single nucleotide polymorphism data, but extended to analysis of DNA sequences, especially large, multi-locus data sets. We generated 100 bootstrap replicates of the data using RAxML ver. 8.1.3 (ref. ). For each bootstrap replicate we generated all quartets using the implementation of SVDquartets in a test version of Paup, ver. 4.0a146 (ref. ) and then assembled the species tree using the quartet max-cut method. For gene tree-based coalescent methods for species tree analysis, we performed gene-tree inference and bootstrapping with RAxML ver. 8.1.3 (ref. ) using the Python package PHYLUCE. We modified the PHYLUCE scripts to implement multi-locus bootstrapping (that is, sampling with replacement of loci and sites), and we generated 500 multi-locus bootstrap replicate sets of gene trees for each data set. On each replicate set of gene trees, we ran four methods using default parameters: STAR and STEAC as implemented in the R package phybase ver. 1.3 (ref. ), MP-EST ver. 1.4 (ref. ), and ASTRAL ver. 4.7.7 (ref. ). ASTRAL is not strictly a coalescent method, but it is statistically consistent with the multispecies coalescent model, and as such, other authors have included it in the family of coalescent-based gene tree methods. Command line, Python, and R scripts used to process the data and run species tree analyses are available from https://github.com/carloliveros/uce-scripts. [...] Most previous analyses of passerine diversification used a putative biogeographic event as a calibration for dating (that is, the separation of NZ and Australia at ∼80 Myr ago to date the node separating Acanthisitta from all other passerines), but incorporated no geological information in their biogeographic reconstructions. For example, Wallacea and New Guinea are unconstrained and allowed to occur as ancestral areas (distributions) anywhere in the phylogeny, even though these regions formed recently compared with continental areas and the inferred time frame of bird evolution. We took the opposite approach for biogeographic analysis, using biogeography independent calibrations derived from fossils for dating and applying geologically informed palaeogeographic models for biogeographic reconstruction.The fossil record for passerine birds is sparse and many described fossil taxa lack information necessary for use as informative calibration points. A recent genomic study of birds used fossil calibrations from a variety of bird orders to calibrate the avian phylogeny. The resulting Jarvis et al. tree sparsely sampled passerines (five species) but included nodes shared with our sampling, such as the split between Corvides and Passerides, which we included in our dating analysis as a normal distribution with a mean of 21 Myr ago and 95% confidence interval of 17–26 Myr ago based on their results. Another recent genomic study of birds produced a timescale similar to, but slightly older than that of Jarvis et al. We also used an alternative calibration (normal distribution with mean 28 Myr ago and 95% confidence interval of 18–38 Myr ago) from Prum et al. for the same node (split between Corvides and Passerides) in separate analyses to test the influence of these different dates. The dates from Jarvis et al. have been criticized as enforcing a lower bound on the phylogeny that was too recent, which might force unrealistically recent dates on the entire phylogeny. However, further analyses with older lower bounds produced almost identical dates for the passerine nodes. Because secondary calibrations are a step removed from fossil calibration data, we also included a calibration based on fossils of the oscine Orthonyx kaldowinyeri for two reasons. First, multiple fossils of this distinctive taxon have been described, alleviating concerns about potential misidentification. Second, these fossils are the earliest identified crown-oscines (Oligo–Miocene), and are from the well-characterized Riversleigh World Heritage Area of Australia, which provides broad comparative context for dating the fossils. For this calibration we used a truncated Cauchy distribution with minimum age bound of 24 Myr ago and a relatively flat distribution (offset P=1.0, scale c=5.0). To constrain the base of the tree, we applied a maximum divergence of 45 Myr ago when using the Jarvis et al. secondary calibration and 55 Myr ago when using the Prum et al. secondary calibration. These lower bounds are older than the lower end of the 95% confidence interval of dates for the split between oscines and suboscines in the respective papers. We inferred dated phylogenies with the complete UCE data set, a constrained ML topology, and two different sets of calibration points (Orthonyx fossil with either the Jarvis et al. or Prum et al. estimates for the Corvides-Passerides split and bounds on the base of the tree) using MCMCTree in PAML ver. 4.8 (ref. ). Divergence times were estimated using the independent rates model in MCMCTree and a birth–death process with species sampling. We applied the HKY85+GAMMA model of nucleotide substitution with four rate categories, the most complex model implemented in MCMCTree, but one that accommodates rate variation among nucleotide sites. Inclusion of variable site rates is likely important for age estimation with UCE data because UCEs display a unique pattern of rate variation, with few substitutions in the core UCE region and more substitutions in flanking regions. Because of the size of the data matrix, we used approximate likelihood calculations with branch lengths estimated using the baseml programme of PAML. […]

Pipeline specifications