Computational protocol: Bacterial intra-species gene loss occurs in a largely clocklike manner mostly within a pool of less conserved and constrained genes

Similar protocols

Protocol publication

[…] Phylogenetic trees were constructed using the neighbor-joining algorithm implemented by the PHYLIP package’s Neighbor program. In order to create rooted trees an outgroup strain outside of each species was selected for each of the 15 examined species. The metric used to estimate the distance between each genome pair within a species (including the outgroup) was average nucleotide dissimilarity (AND) defined as 100 – ANI, where ANI stands for average nucleotide identity. In order to minimize possible effects of recombination, we used only genes belonging to the ‘core’ pangenome that showed no trace of recombination for calculating ANI and AND.The ‘core’ non-recombining gene sequences of strains belonging to the same bacterial species were compared in the pairwise manner using FASTA. Orthologous gene pairs were identified as reciprocal best hits. Following the thresholds set by POGO-DB, only putative orthologous gene pairs sharing 30% identity over at least 70% of the gene length were used for ANI calculation. Next, the % nucleotide identity of each pair of orthologs was calculated and based on these % identities for all orthologs within each genome pair, ANI and AND were calculated for all pairs of genomes within the given bacterial species. After calculating AND values, a custom Perl script was used to generate strain dissimilarity matrices, and the trees were generated based on these matrices. [...] To calculate dN/dS for each pangene, individual gene sequences within each pangene were aligned using MACSE with default settings and genetic code 11 (The Bacterial, Archaeal and Plant Plastid Code). Next, the PAML (Phylogenetic Analysis by Maximum Likelihood) CodeML program was used to calculate dN/dS for each pangene, based on the multiple sequence alignment generated usng MACSE. A single dN/dS value was calculated for each species (using the model = 0 setting).For dN/dS calculation, CodeML requires a phylogenetic tree of the input sequences. The phylogenetic trees we generated for each species represent all strains of a species and could therefore not be used to calculate dN/dS of near core pangenes that are absent from some strains. To generate trees that contain only those strains in which each near core gene is present, we removed the strains from which the pangene was absent from the dissimilarity matrix used to construct the original trees and reconstructed trees based on these trimmed down matrices.To avoid possible computational biases due to inclusion of gene sequences with a low number of variable sites, only pangenes with 0.0001

Pipeline specifications

Software tools PHYLIP, MACSE, PAML
Databases POGO-DB
Applications Phylogenetics, Nucleotide sequence alignment
Organisms Bacteria