Computational protocol: Genome-Guided Phylo-Transcriptomic Methods and the Nuclear Phylogenetic Tree of the Paniceae Grasses

Similar protocols

Protocol publication

[…] RNA-seq data were quality filtered following standard procedures,. Transcriptomes were assembled de novo using Trinity– and processed as described in Yang and Smith (2013).The sequenced genomes of S. bicolor and S. italica were used for syntenic ortholog determination because both are high quality and publically available, they represent an ingroup and outgroup taxa to the tribe Paniceae, and neither genome contains a recent whole genome duplication event,. Syntenic orthologs between S. bicolor and S. italica were infered using the SynMap tool in CoGe (https://genomevolution.org/CoGe/) with QuotaAlign set to filter out syntenic paralogous regions using a quota setting of 1:1,. This insures that only genes which are present in a single copy and at the same syntenic location in both species are included. Protein sequences of the S. bicolor representative of the 1:1 S. italica/S. bicolor set were used as the reference sequence for the remainder of the analyses. The assembled tribe Paniceae transcripts (excluding outgroup transcriptomes) were then mapped to the S. bicolor reference orthologs using BLAST with a cutoff E-value of 0.00001 and 85% amino acid identity. When a given S. bicolor gene corrisponded to more than one transcript in a species, all transcripts maping to that S. bicolor gene where discarded to avoid the potential inclussion of paralogs. These sequences where then grouped into orthologous sets for each gene and a multiple alignment was created using mafft,. In this way, the use of all-by-all BLAST and the MCL algorithm are completely avoided. After further filtering with phyutility and several scripts from Yang and Smith, concatenated trees, coalescence-based quartet summary species trees, and binned coalescence-based quartet summary trees were created using RAxML, ASTRAL, and binning followed by ASTRAL, respectivly– (Fig. ). To investigate syntenic block phylogenies, data from the genome-guided gene trees were grouped based on conserved syntenic blocks across the S. bicolor and S. italica genomes (again obtained from CoGe). Each transcript was mapped to its syntenic block and trees created using RAxML based on concatenated transcripts from each syntenic block. The same method was applied to the grape data set, except that the V. vinifera and Arabidopsis thaliana genomes where used and the E-value and protien identity cutoffs where lowered to 0.0001 and 75%, respectively, to account for the increased phylogenetic distances represented in the grape data set. Whole genome duplication events occuring in the ancestor of A. thaliana but not V. vinifera are excluded from the analysis because of the 1:1 setting in QuotaAlign. Scripts and instructions for the genome-guided method are available at: bitbucket.org/washjake/transcriptome_phylogeny_tools.Figure 1Two gene tree topology-based approaches to orthology inference where also used for comparison: the Agalma pipeline (version 0.5.0) by Dunn, et al. and the Ya Yang pipeline,. As above, RAxML, ASTRAL, and binning combined with ASTRAL were used to infer phylogenies. Phylogenetic trees and other figures were generated using FigTree, Inkscape, and Vennerable in R–.For the grass data used for benchmarking, several additional analyses were run. Single copy syntenic orthologs were found in a pairwise fashion between O. sativa and each of the other genomes using CoGe as described above. The likelihood of two non-orthologous genes evolving to have not only high sequence similarity, but also to be in the same physical location, and in single-copy across multiple species is incredibly small, giving us high confidence in orthologs inferred using this method. These orthologs where used to create a set of fully synteny-based, one-to-one orthologs across the grasses. While this set does not include all possible single-copy orthologs, it does include all of them for which we can have high confidence based on the available data and current methods, and is hence the best achievable ortholog set for the grasses at the current time. We refer to this as the benchmarking data set.Each of the ortholog inference methods described above was then run using the transcriptomes generated by the genome sequencing projects referenced above. In this way, the transcripts could be followed by name through the pipelines (except for the Agalma method for which this could not be easily accomplished due to the way the pipeline is packaged). Ortholog sets derived from the genome-guided method and the Yang and Smith method were then compared to the benchmarking set to determine how many orthologs each method was able to find in common with the benchmark orthologs. […]

Pipeline specifications