Computational protocol: Comparison of 61 Sequenced Escherichia coli Genomes

Similar protocols

Protocol publication

[…] The sequences encoding 16S ribosomal RNA were extracted from the analyzed genomes using RNAmmer []; sequences with an RNAmmer score above 1,400 were considered reliable and were kept for analysis. From every genome, the gene with highest similarity to rrsH of E. coli K12 MG1655 was selected and these sequences were aligned using ClustalX []. A phylogenic tree was generated by ClustalX using the Bootstrap neighborhood-joining method, showing the bootstrap values at branch points, visualized by NJPlot []. [...] The predicted proteomes comprising all protein-coding genes were extracted from the GenBank files for the published genomes. For unpublished genomes, they were predicted using EasyGene []. All predicted proteomes were compared by BLASTP reciprocal pairwise comparison. Two genes were attributed to a single gene family and considered 'conserved' when they shared at least 50% amino acid identity over at least 50% of the length of the longest gene.A hierarchical clustering was performed for the complete pan-genome as described by Snipen et al. []. Briefly, a pan-genome matrix was constructed consisting of 1 s and 0 s where each row corresponds to a gene family, as described above, and each column to a genome. Cell (i,j) in the matrix is 1 if gene family i is present in genome j, or 0 if it is absent. Manhattan distances were calculated and used for hierarchical clustering to generate the tree. The plotted distance between two genomes shows the proportion of gene families where their present/absent status differs. Thus, pan-genome hierarchical clustering analyses genes that are not conserved, but vary in their presence or absence between genomes. Shorter distances represent genomes with more gene families in common. Genes only occurring in a single genome (singletons) were not included in the analysis. Bootstrap values (per mil) were computed for each inner node by re-sampling the rows of the matrix.A pan- and core genome plot was constructed according to []. The order of genomes was chosen based on the pan-genome tree, starting with the largest E. coli O157 genome. For the pan-genome curve, all cumulative BLAST hits found in the genomes were plotted as a running total, which increases as more genomes are added. The number of gene families with at least one representative in every genome was plotted for the core genome and this slowly decreases with the addition of more genomes, as these genomes may lack genes from gene families that had been conserved in the previously plotted genomes.A BLAST atlas was constructed as described by Hallin et al. []. […]

Pipeline specifications

Software tools RNAmmer, Clustal W, NJplot, EasyGene, BLASTP
Applications Genome annotation, Phylogenetics
Organisms Escherichia coli