Computational protocol: Evolutionary Dynamics of Overlapped Genes in Salmonella

Similar protocols

Protocol publication

[…] We used the outgroup genome Yersinia pestis CO92 (YPes) to root the tree of 15 Salmonella strains, E. coli K-12 (EColK12), E. coli O157:H7 (EColO157), and Shigella flexneri (SFle), which was derived from a set of reliable orthologous clusters across genomes. To construct these clusters we first obtained a set of orthologs from 8 genomes (see ) from ATGC database, which identifies orthologs between closely-related microbial genomes . Then, we conducted reciprocal BLAST searches in our genomes missing from the ATGC database. We only included genes that share a high degree of amino acid similarity (E < 10−10 and > 80% similarity) to take a conservative approach. All “hypothetical”, “unknown”, and “putative” genes were excluded in order to generate clusters of genes with known functions only.Because phylogenetic analysis can be misled by genes with extremely different GC contents within and among species, we excluded all genes whose GC content showed outlier tendencies as compared to the other orthologs using the Grubb’s test . We also constructed gene-by-gene multiple sequence alignment using MUSCLE in MEGA5 , and then computed synonymous divergence using yn00 model in PAML4 . All genes containing gene pairs with synonymous divergence > 1.5 substitutions per site were excluded, because high sequence divergence among strains or species can mislead phylogenetic inference. Short proteins (<150 amino acids) were also removed.Even after all these exclusions, the final dataset contained 474 genes and 214,491 codons. These genes were concatenated head-to-tail and the fourfold-degenerate sites across all genomes were extracted in MEGA5 (66,202 sites) for phylogenetic analysis; we focus on fourfold-degenerate sites because Salmonella genomes are extremely similar to each other at the protein sequence level. We used Neighbor-Joining, Maximum Likelihood, and Bayesian methods for phylogenetic analysis ; , with a GTR+G+I model for nucleotide substitution from the report of modelTest . [...] We determined pan-Overlaps, which are pairs of genes overlapping in at least one of the 18 genomes (15 Salmonella, 2 E. coli and 1 Shigella). Using the phylogeny, we constructed a system for querying ancestral states of pan-Overlaps across genomes. Sequence alignments were constructed by using the four possible configurations (C, S, W, and D) of overlapping genes in pan-Overlaps as the state symbols. We then inferred ancestral states using parsimony with a user defined matrix (mymatrix) in PAUP 4.0 . The mymatrix was defined asThe transforming possibility between C and S is higher than that among C and W and D because it is rare that widowed and dead overlapping genes become overlapped again as compared with the separated genes to be overlapped during evolution. […]

Pipeline specifications

Software tools MEGA, PAML, ModelTest-NG, PAUP*
Application Phylogenetics
Organisms Escherichia coli
Diseases Salmonella Infections
Chemicals Nucleotides