Computational protocol: AST: An Automated Sequence-Sampling Method for Improving the Taxonomic Diversity of Gene Phylogenetic Trees

Similar protocols

Protocol publication

[…] We used the EvolveAGene program to generate three types of simulated gene trees: symmetric trees, random trees and asymmetric trees, as each of these three types may exist in reality. EvolveAGene can simulate the evolution of DNA sequences through mimicking mutation and natural selection, and generate the off-spring sequences, whose ultimate structures will be symmetric or random trees based on specified parameters. Here we take each simulated DNA sequence in each generation as the whole genome of an organism, with each node of the simulated trees as a taxon. The relationships among the taxa (nodes) are available from the output of EvolveAGene.To generate the simulated trees for this study, we randomly chose the xisC gene of bacterium Nostoc sp. PCC 7120 (GenBank accession: U08014) as the initial root sequence and generated 1,024 simulated sequences using the program, where 1,024 is used because it is a power of 2 as required for generating a symmetric tree, and this size is comparable to the order of magnitude of the number of currently sequenced genomes .EvolveAGene provides an option for generating random trees without any specified tree topology. When using this program, we set the average branch length at 0.3 and the number of leaf taxa as 1,024 with all the other parameters set at their default values.To generate asymmetric trees, we first generated a symmetric tree with 2,048 leafs, and then select 1,024 leafs to construct an asymmetric tree using the following procedure. We randomly chose x percentage of the selected 1,024 leafs from the left branch of the symmetric tree and (1.0−x) percentage of 1,024 leafs from the right branch. Here we used x values equal to 0.1, 0.2, 0.3, and 0.4 to generate the trees. [...] To construct the phylogeny for rpS5 of E. coli and other related organisms, we performed multiple sequence alignments using MAFFT (version 6.603) , employing the L-INS-I model, which adopts local pair-wise alignments by the Smith-Waterman algorithm and is considered to be one of the most accurate multiple sequence alignment methods currently available , . Then a phylogenetic tree was constructed using the FastTree program (version 2.1.3) , which implements a superfast but fairly accurate approximate maximum likelihood method .To study the phylogeny and horizontal gene transfer in the class-I of glycosyl-transferase gene family 8 (GT8), we adopted a rigorous PhyML analysis as used in previous analyses of GT8 . For the PhyML analyses, trees were built with the JTT substitution model along with the following parameters: estimated proportion of invariable sites, four rate categories, estimated gamma distribution, and optimized starting BIONJ tree . Bootstrapping was performed using 100 replications. MrBayes analyses were used with a mixed amino acid model estimated in the run, an estimated proportion of invariable sites, an estimated gamma distribution parameter, and one million of generations. […]

Pipeline specifications

Software tools EvolveAGene, MAFFT, FastTree, PhyML, BIONJ, MrBayes
Applications Phylogenetics, Population genetic analysis
Organisms Escherichia coli