Computational protocol: C4 Photosynthesis Promoted Species Diversification during the Miocene Grassland Expansion

Similar protocols

Protocol publication

[…] The majority of recent phylogenetic work in Poaceae has focused on specific subfamilies or genera and has employed a variety of fast-evolving chloroplast and nuclear markers (e.g. –). The nature of these studies has resulted in a wealth of sequence data for Poaceae, but many markers are both poorly sampled across the entire group and difficult to align across the entire clade. To circumvent the phylogenetic problems that arise from such data, specifically poor alignments, large amounts of missing sites, and large matrices ill-suited to computationally intensive analyses, we subdivided the tree-building approach. Fourteen sub-trees were constructed separately and subsequently inserted into a fossil-calibrated backbone phylogeny. This approach relies heavily on recent work in the grasses that has resolved deep relationships among the subfamilies and clarified discrepancies in various molecular dating efforts –, , .Sequence data was collected from Genbank with the PHLAWD tool ( ) using the plant GenBank database generated in March 12, 2012. To avoid synonymy problems, all genus names were transformed to those accepted by the Kew taxonomic database, using the GrassBase synonymy database. Because the taxonomic classification in Genbank is not consistent with the latest developments in grass taxonomy, clades based on GenBank names are not always monophyletic. Species were, therefore, sorted into groups based on previous studies and inspected on preliminary phylogenetic trees as necessary. In general, monophyletic groups were defined to correspond to traditionally recognized clades. The Bambusoideae, Ehrhartoideae, Chloridoideae, Danthonioideae, Andropogoneae, Paspaleae, and Paniceae were all used. The species-poor sister clades Arundinoideae and Micrairoideae were combined, as were the outlying Panicoideae sensu GPWGII 2012 . The Pooideae was too large to analyze in one piece, so after marker selection, 3 monophyletic clades were separated from the Pooideae backbone and each was analyzed individually. Two representatives of each separated clade were retained with the remaining backbone Pooideae so that their monophyly and divergence date could be constrained, and the separated lineage could be reinserted later. PHLAWD was then used to create alignments for the most frequently sampled gene regions in each of the 14 clades using a coverage threshold of 0.4 and an identity threshold of 0.1. The three plastid markers matK, ndhF, and rbcL were included in each group and an additional 2 to 10 gene regions were added depending on the group sampled (). In total, 35 gene regions were incorporated in the analysis (sampling information in , ).Once the alignments were complete, the software trimAl was used to remove sites with more than 70% missing data for each gene region and the MEGA software was used to manually edit the alignment where necessary. In each group, the alignments were concatenated with Phyutility and species names were checked against the GrassBase synonymy database. A small number of names were referenced in Tropicos but not in GrassBase , and were consequently considered to be recently described species. Synonyms, misspellings, subspecies, and varieties were manually removed whenever possible to leave a single representative sequence per accepted species. At this point, RAxML software was used to build a tree with 20 maximum likelihood searches, retaining the tree with the highest likelihood score across them. The phylogeny inferred for each group was manually inspected to identify taxa that had very long branches, representing potential errors. The sequences of these taxa were inspected by BLAST searches against GenBank, and putatively erroneous sequences, corresponding to either sequencing or identification errors, were removed. [...] To estimate the age of the main grass lineages, dating analyses were first performed with a dataset of three previously sampled chloroplast genes and 543 taxa covering the entire grass family . The software BEAST 1.7.2 was run under a GTR+G+I substitution model, a Yule process for the prior distribution of node ages and a log-normal distribution for the prior on evolutionary rates among branches. Time-calibrated trees where obtained with two contrasting hypotheses for the placement of fossils . Under calibration #1, which is based only on macrofossil calibrations and does not take into account fossil phytoliths whose placement is somewhat controversial , the crown age of the BEP-PACMAD clade followed a normal calibration density with a mean of 51.2 Ma and a standard deviation of 6.0 Ma . Under calibration #2, which incorporates fossil phytoliths , the age of this same node followed a normal calibration density with a mean of 82.4 Ma and a standard deviation of 7.5 Ma . In this second analysis, we also constrained the stem of Oryzeae to obtain dates compatible with phytolith fossil evidence , using an exponential distribution with a mean of 10 Ma offset by 67 Ma. For these two analyses, the topology was not fixed, except for the monophyly of the ingroup (all taxa except Pharus). Trees were sampled every 5,000 generations for 15,000,000 generations after a burn-in period of 5,000,000 generations. Convergence, effective sample size, and the adequacy of the burn-in period were assessed using Tracer 1.5 .A phylogeny was then inferred separately for each previously defined group of grasses using the software BEAST as described above . Crown node ages were fixed (uniform prior with range of 0.01 around the fixed value) to the dates obtained from the Bayesian consensus phylogeny estimated from the 543-taxon dataset (above), under calibration #1. All trees were then scaled to match the dates under calibration #2. All subsequent analyses were performed on both sets of time-calibrated phylogenetic trees. The monophyly of the ingroup was enforced to ensure proper rooting. For each dataset, two independent Markov Chain Monte Carlo analyses were run for 10–50 million generations, sampling every 1000–5000 generations, depending on the size of the dataset. Convergence, effective sample size, and the adequacy of the burn-in period were assessed using Tracer . A burn-in period of 2,500,000–6,000,000 generations was chosen, again depending on the size of the dataset. For clades of over 150 taxa, convergence from random starting trees was extremely slow, and so the best of our previous 20 maximum likelihood RAxML trees was dated using non-parametric rate smoothing in r8s and used as a starting point for each run.For each group, the maximum clade credibility tree was selected with TreeAnnotator and the node heights of this tree were scaled in R to match each of the dating hypotheses by multiplying all branch lengths by the fraction (hypothesis root age/current root age). The calibrated phylogenetic trees were then manually inserted into the associated backbone phylogeny of 543 grasses , preserving the deep relationships among the groups and forming a set of all-inclusive, ultrametric phylogenies with 3595 species each. With 544 genera represented, this tree contains more than 29% of the species and 71.2% of the recognized genera in Poaceae. Of the missing genera, only 6 have more than 10 species .To take into account both phylogenetic uncertainty and variation in dating hypotheses, we repeated diversification analyses on 100 topologies drawn randomly from the population of trees sampled post burn-in by BEAST for each of our 14 groups. A unique, calibrated phylogeny for each group was scaled and added to each of our two backbone phylogenies to produce 100 alternative phylogenies of the grasses under each set of dating conditions. […]

Pipeline specifications

Software tools trimAl, MEGA, phyutility, RAxML, BEAST, r8s
Application Phylogenetics
Chemicals Carbon Dioxide