ZPS: visualization of recent adaptive evolution of proteins
BackgroundDetection of adaptive amino acid changes in proteins under recent short-term selection is of great interest for researchers studying microevolutionary processes in microbial pathogens or any other biological species. However, independent occurrence of such point mutations within genetically diverse haplotypes makes it difficult to detect the selection footprint by using traditional molecular evolutionary analyses. The recently developed Zonal Phylogeny (ZP) has been shown to be a useful analytic tool for identifying the footprints of short-term positive selection. ZP separates protein-encoding genes into evolutionarily long-term (with silent diversity) and short-term (without silent diversity) categories, or zones, followed by statistical analysis to detect signs of positive selection in the short-term zone. However, successful broad application of ZP for analysis of large haplotype datasets requires automation of the relatively labor-intensive computational process.ResultsHere we present Zonal Phylogeny Software (ZPS), an application that describes the distribution of single nucleotide polymorphisms (SNPs) of synonymous (silent) and non-synonymous (replacement) nature along branches of the DNA tree for any given protein-coding gene locus. Based on this information, ZPS separates the protein variant haplotypes with silent variability (Primary zone) from those that have recently evolved from the Primary zone variants by amino acid changes (External zone). Further comparative analysis of mutational hot-spot frequencies and haplotype diversity between the two zones allows determination of whether the External zone haplotypes emerged under positive selection.ConclusionsAs a visualization tool, ZPS depicts the protein tree in a DNA tree, indicating the most parsimonious numbers of synonymous and non-synonymous changes along the branches of a maximum-likelihood based DNA tree, along with information on homoplasy, reversion and structural mutation hot-spots. Through zonal differentiation, ZPS allows detection of recent adaptive evolution via selection of advantageous structural mutations, even when the advantage conferred by such mutations is relatively short-term (as in the case of "source-sink" evolutionary dynamics, which may represent a major mode of virulence evolution in microbes).
[…] Two input files are used: (i) a DNA alignment in FASTA format (e.g., .fasta) [see Additional files and ] using a DNA alignment software, such as ClustalX ; and (ii) a maximum-likelihood DNA tree topology (e.g., .ml.tre) [see Additional files and ] generated by PAUP* . In the representative haplotype name, the user should only use alphanumeric characters (i.e. only decimal digits and alphabets). To allow for haplotype size/frequency-based analysis, duplicate haplotypes need to be removed in the input files, but with the user marking haplotypes with multiple representatives in the dataset by n< no. of representatives> . For example, if seqA, seqB and seqC haplotypes are identical, the user should use seqAn3 (or seqBn3 or seqCn3) as input. If there is a single representative of a haplotype, the user can use the name as it is and the program would be able to detect it as 'n1'. [...] There is one tree output – "zp_tree.dnd" where each node name (for example, 'E4-seqA-n3-2S/1N-A77D' or 'P3-seqE-n8-5S/0N') depicts (i) haplotype separation to either the External ('E') or Primary ('P') zone, with intermediate hypothetical (unresolved) nodes marked as 'H'; (ii) followed by an arbitrary number assigned to a protein variant encoded by the haplotype (e.g. 'E4' or 'P3'); (iii) original name of the representative haplotype and the user defined number of haplotypes that are identical to it in the dateset (e.g. 'seqA-n3' or 'seqE-n8'), with ZPS automatically adding '-n1' to the haplotypes with single representatives; (iv) number of synonymous(S)/non-synonymous(N) SNPs along the connecting branch (e.g. '2S/1N' or '5S/0N'), and (v) specification of amino acid changes due to the non-synonymous SNPs (e.g. 'A77D'). The ZPS output tree can be viewed with tree-presenting software, like TreeView  or HyperTree . The latter application also enables usage of color coding to visually distinguish different type of haplotypes and branches. Keeping HyperTree in mind, ZPS generates an additional color-code file, for the output tree file, to color-code the Primary and the External zone representatives. Two color-codes have been used: blue for all the Primary zone haplotypes that exhibit same-protein silent variability and red for all the External zone representatives. To color-view "zp_tree.dnd" in HyperTree, the user needs to 'import colors' calling "color-zp_tree.txt" file.There are two analytical outputs: "pairwise-variation.txt" and "analysis-results.txt". The former file details the positions and specific changes along each branch in the tree, while the latter presents (i) the Primary and External zone representatives; (ii) haplotype ratio (as a ratio of the number of External zone haplotypes to the total number of haplotypes in the dataset); (iii) position-wise structural mutation information, both overall and zone-based structural hot-spot frequency (as a ratio of the number of hot-spot structural mutations to the total number of structural mutations), and (iv) calculations of α and Simpson's diversity statistics . […]