Computational protocol: Population Diversity of ORFan Genes in Escherichia coli

Similar protocols

Protocol publication

[…] Clusters of homologous sequences used in this study ultimately derive from the Protein Clusters database (), a collection of automatically clustered Reference Sequence proteins from complete genomes (of prokaryotes, plasmids, viruses, organelles, and complete and incomplete genomes of protozoa and plants). Some clusters include curated information on protein function (); those without known function are annotated as “hypothetical protein.”Data on protein clusters in prokaryotes obtained from NCBI represent the January 2010 version of the clusters database. Specifically, the PRK_summary, PRK_AllProteins.bcp, Clusters.bcp, and NonCuratedClusters.bcp files were downloaded from These clusters include proteins from 35 E. coli genomes and 88 E. coli plasmid genomes (including Shigella and Shigella plasmid genomes). The E. coli genomes used in the study are shown in . The plasmid genome ids used in the study are listed in the supplementary data, Supplementary Material online. A local database was created using mySQL ( [cited 2011 January]) and populated with data from the earlier mentioned files, so as to contain data on cluster ID, protein GIs, taxon ID, scientific name of each taxon, and genome ID. Three groups of E. coli clusters (t1, t2, and ORFan) were designated, as described in the next section. The DNA sequences were downloaded from NCBI according to their protein GI number, and added to the mySQL database. The DNA sequence FASTA file for each protein cluster was prepared from the mySQL database and aligned by MUSCLE ().Some of the clusters originally from the NCBI clusters database have paralogous subfamilies (880 ORFan clusters, 148 t1 clusters, and 32 t2 clusters). To remove paralogy, we redefined these as the largest subcluster within each cluster that includes only one gene copy from each E. coli accession. The clustering method is the unweighted pair group method with arithmetic mean, performed in MUSCLE () [...] Non-ORFan control groups representing different phylogenetic depths were identified as shown in . If a protein cluster has members distributed beyond E. coli, but within the t1 group of species, it falls in the t1 group. More widely distributed members fall into the t2 group. These two groups represent robust clades in a more detailed phylogenetic tree computed from the data of . Because the original tree of Wu et al. did not include bootstrap support values, a new tree was computed from a pruned alignment using the RAXML () and consense () software available via the CIPRES server ( [cited 2011 January]). The consense program was used to compute a consensus tree from 1000 RAXML bootstrap replicates performed using the WAG model. The phylogeny is provided as a PDF image (Fig S1) and as a NEXUS file in the supplementary data, Supplementary Material online. From the resulting tree, we identified two monophyletic clades with strong support (>96% bootstrap support), corresponding to the t1 (younger) and t2 (older) control groups. Fig. 1.— […]

Pipeline specifications

Software tools MUSCLE, RAxML
Databases ProtClustDB
Applications Phylogenetics, Nucleotide sequence alignment
Organisms Escherichia coli
Diseases Dysentery, Bacillary