Computational protocol: Evidence of Selection upon Genomic GC-Content in Bacteria

[…] The Popset database of Genbank was searched for the keyword “bacteria”. From this we extracted datasets in which we had at least 8 sequences from the same species, defined as a group of bacterial strains with the same species and genus name. These sequences were translated, aligned using MUSCLE and back translated to DNA. We inferred the direction of mutation using two methods. In the first we used the allele frequencies inferring the minor allele to be the new mutation; sites with more than two alleles, or two alleles at equal frequency were discarded. In the second method, we reconstructed the phylogenetic tree between strains using minimum evolution as implemented in FastME , rooted the tree assuming a molecular clock and then used parsimony to infer the ancestral state. We only analysed species for which we had at least 10 synonymous GC↔AT single nucleotide polymorphisms (SNPs) segregating at 4-fold degenerate sites. To estimate the confidence intervals for the GC4 value at which the regression line intercepted the Z = 0.5 or Z-Zpred = 0 lines we bootstrapped the data by species. We inferred the GC-content to which a sequence would evolve under mutation bias from the current GC4 and the numbers of GC→AT SNPs, U, and AT→GC SNPs, V as(1)A similar equation allows one to infer the predicted GC-content at 2-fold sites. To detect possible cases of horizontal gene transfer we ran the maxchi test with a slight adjustment to improve sensitivity as suggested by . […]

Pipeline specifications

Software tools MUSCLE, FastME
Databases Popset
Applications Phylogenetics, Nucleotide sequence alignment
