Computational protocol: Phylotranscriptomics: Saturated Third Codon Positions Radically Influence the Estimation of Trees Based on Next Gen Data

Similar protocols

Protocol publication

[…] The general workflow from data collection to analysis is summarized in . Our taxon sampling included a total of 19 species, 14 ingroup and 5 outgroup. Ingroup taxa included seven species from Saturniidae, five from Sphingidae, and two from Bombycidae. Our outgroup sampling consisted of two species of Noctuidae, two species of Pyraloidea, and one butterfly. We collected available expressed sequence tag (EST) data from the B. mori () and M. sexta genomes (Agricultural Pest Genomics Resource Database,, last accessed November 8, 2013) and from 12 taxa in the GenBank EST database (). We generated six new transcriptomes, from two species of Saturniidae and four species of Sphingidae (supplementary table S1, Supplementary Material online). These transcriptomes were generated from purified RNA extracted with the SV Total RNA Isolation System (catalogue no. Z3100; Promega) from flash-frozen or RNAlater-preserved tissue. Purified RNA (5 µg) was sent to the University of Missouri DNA Core, where RNA quality check, RNA-Seq library construction, and Illumina HiSeq 2000 runs were performed as 100 bp SE reads with four samples per lane (one lane included two additional samples that were not included in this study). We used SOAPdenovo-Trans v1.01 for de novo assembly of the transcriptomes of each taxon using five k-mer values (13, 23, 33, 43, 63) following the additive multiple-k assembly method of . We combined redundant contigs from the multiple k-mer assembles with CD-HIT-EST () and removed all sequences below 100 bp with FASTX-Toolkit (, last accessed November 4, 2013). To identify orthologous genes between transcriptomes and ESTs, we used HaMStR v8b (). HaMStR was implemented using the Insecta HMMER v3-2 core ortholog data (1,579 core orthologs) built upon the InParanoid transitive closure approach using proteomes from six primer taxa: Apis mellifera, B. mori, Capitella sp., Daphnia pulex, Ixodes scapularis, and Tribolium castaneum. The “-representative” option was implemented in HaMStR, and we used B. mori as the reference species for reciprocal Blast of candidate EST contigs. Fig. 2.—To reduce the amount of missing data in the final matrix, we included genes that were represented by at least half (10) of the taxa in this study. We translated the nucleotide data into amino acids in TranslatorX () and aligned the data set using MAFFT v7.029b (). The aligned amino acid sequence was then back-translated to nucleotide data in TranslatorX. Each gene alignment was visually inspected for misaligned regions, and if such regions were found, either the taxon was removed or the poorly aligned region was removed. Selected genes were concatenated using Geneious v5.5.8 (). We tested the concatenated alignment for nucleotide saturation at each codon position and created a saturation plot by nucleotide position in DAMBE v5.3.16 (). The Akaike information criterion (AIC) score in PartitionFinder v1.0.1 () was used to select the best model for a concatenated amino acid alignment. Furthermore, we used the AIC score to choose the best model for the nucleotide alignment under three different partitioning strategies: 1) no partitions, 2) partitioned separately by codon position, and 3) two partitions, with the first and second codon position (nt1 and nt2) as one partition and the third codon position (nt3) as the other.We estimated our phylogeny with two different approaches: a standard concatenation approach and a species tree analysis using gene tree parsimony. For phylogeny estimation using concatenation, we ran maximum likelihood (ML) analyses in RAxML () and maximum parsimony (MP) analyses in TNT (). For ML analysis, we created six separate data matrices consisting of 19 taxa for all genes: 1) the full nucleotide data matrix consisting of all nucleotide positions (nt123), 2) a matrix consisting of only the first and second codon positions (nt12), 3) a matrix consisting of only the third codon position (nt3), 4) a data set that removed all synonymous signal, thus leaving only non-synonymous changes at all coding positions (degen1), 5) a data set consisting of only nt3 from the degen1 matrix (degen1-nt3), and 6) an amino acid matrix (AA). In RAxML, we estimated ML trees for each of these six concatenated matrices. We conducted 200 tree searches starting from a random topology and also implemented the “–F a” option for a combined ML tree search with 100 bootstrap replicates. The concatenated matrices were partitioned following the optimal partitioning strategy as identified in PartitionFinder. The degen1 data sets were created with the degen v1.4 perl script from , which degenerates nucleotides to IUPAC ambiguity codes at all sites that have synonymous substitutions. A benefit of using degen1 over nt12 is that degen1 removes all synonymous signal while keeping non-synonymous substitutions from all nucleotide positions, whereas nt12 retains synonymous signal at nt1 and completely removes all signal from nt3 (). The CAT model of implemented in PhyloBayes () has also been shown to effectively deal with saturated sites and synonoymous signal at nt3. However, because PhyloBayes cannot analyze large data sets without jackknifing the data into smaller data sets (), we chose not to use this program.Parsimony analyses were conducted in TNT (). We used the 19 taxon, nt123 data set and conducted a new technology search including three rounds of tree fusing, ten rounds of drifting, and tree ratcheting for 100 random addition replicates (xmult = rss, fuse 2, drift 3, ratchet 10, replications 100). One hundred parsimony bootstrap replicates were estimated in TNT with a new technology search including three rounds of tree fusing, ten rounds of drift, and tree ratcheting for ten random addition replicates (xmult = rss, fuse 3, drift 10, ratchet 10, replications 10). We also conducted 100 ML and MP bootstrap analyses using methods described above for each codon position to determine how each codon position supported different regions of the tree.To examine the effect of concatenation on our resulting topology, we used the program Random Addition Concatenation Analysis (RADICAL) (). RADICAL produces multiple random concatenation paths by sequentially concatenating genes arbitrarily, without replacement, starting with a single gene and ending with all genes in the data set. RADICAL analysis examines how many genes are necessary to resolve a node. RADICAL produces two statistics for each node: 1) a fixation/degradation point, which is estimated as the number of genes needed to fix a node or ensure a node no longer appears in any concatenation path, and a area under the curve (AUC) value which represents the percent of the total number of trees from the RADICAL analysis that have that particular node. The RADICAL curve provides a visual representation of the emergent phylogenetic support of a node by plotting the normalized average consensus fork index (CFI) () of all concatenation paths at each concatenation point by the number of genes at these points. RADICAL analyses were conducted on six data matrices (nt123, nt12, nt3, degen1, degen1-nt3, and AA) with ten random concatenation paths using the “–step 25” option, which adds 25 genes to each step in the concatenation path using RAxML for tree construction. We deleted five taxa (Antheraea mylitta, A. pernyi, B. mandarina, Lonomia obliqua, and Ostrinia nubilalis) to limit the amount of missing data and reduce the possible effects of missing data on smaller concatenation points in the analysis. We chose a step of 25 to reduce the number of total trees estimated, thereby significantly reducing computational time needed to complete the analysis. Applying a RADICAL approach to different matrices allows us to examine the phylogenetic contribution of each codon position, the effect of saturated sites on phylogenetic analysis, and the amount of agreement/disagreement for nodes among data matrices.To account for individual gene history, we estimated species trees using gene tree topologies. Individual gene trees for all 938 genes were estimated in RAxML using the –F a option with 100 bootstraps on nucleotide alignments including all codon positions. Species trees were estimated with gene tree parsimony implemented in iGTP1.1 () using 50 replicate searches that minimize duplications (MD), minimize duplications and losses (MDL), or minimize deep coalescence (MDC). To estimate confidence in our estimated species tree, we created 100 bootstrap pseudoreplicate gene tree sets by sampling our gene trees with replacement () and analyzed them in iGTP under the “minimize duplications and losses” model. […]

Pipeline specifications