Computational protocol: Analysis of and function predictions for previously conserved hypothetical or putative proteins in Blochmannia floridanus

Similar protocols

Protocol publication

[…] The original genome sequence and annotation [] was reanalyzed for all conserved hypothetical or putative protein predictions. Genome context [,] and metabolic context [,] was considered and sequences and predicted pathways were extensively compared to available completely sequenced genomes to better assign and identify the encoded proteins therein. Furthermore, iterative sequence analysis compared sequences to other organisms and public databases (reviewed in []). The statistical expectancy value for reporting hits by chance was generally set at a conservative threshold of an expected value E of 10-6.Specific sequence searches were done by applying HMMER []. Intrinsic sequence feature predictions were derived from the ExPASy suite of tools []. To independently check and test sequence analysis results, we applied not only other programs with similar function such as HMM or fasta searches, but also complementary tools and methods such as domain analysis, phylogenetic analysis, analysis of context and clusters of orthologous genes.In addition, we applied the different tools for metabolic reconstruction and pathway alignment using extensive sequence analysis protocols as described previously []. Amongst other tests, this included verification of found similarities by reciprocal searches from identified sequences and determination of the exact region of sequence similarity. To delineate enzymatic capabilities, the multi-domain architecture of many proteins was taken into account: Individual parts of the protein sequence encode different domains with different functions. Sequence analysis analyzed these regions separately to identify these specific functions and the different domains in the protein. Function assignments were tested and confirmed including sequence searches from the sequence with experimentally determined function []. Significant links to experimentally determined function were established. Proteins classified with a high confidence (Blast e-value below 10-6) and informative assignment were categorized as "good" (right column; see ). However, if there remained minor uncertainties in the function, this assignment was categorized as "fair". A protein function was classified as "putative" (15 cases; see ), if its sequence had similarities to well characterized protein sequences or protein domains with an e-value less than or equal to 10-3 and there was only a first indication on the protein function. All other cases were classified as "unknown".Phylogenetic analysis was applied to investigate the distribution of identified proteins at different taxonomic levels (specific for Blochmannia, in Enterobacteriacea, in Proteobacteriae, spread among all bacteria). Further, this helped to analyze gene duplication events and to better clarify the substrate specificity of the encoded enzymes.Further information regarding the sequence and protein family classification involved comparative genomics, gene context methods and comparisons of domains and sequences [] including iterative searches and multiple alignments exploiting the following databanks: Clusters of orthologous groups of proteins (COGs) [], conserved domain server [] as well as the different protein family databases PFAM, SMART and Interpro [].Duplicated genes were examined further to determine which of them was the real ortholog in gene sequence comparisons []. Replacement by unrelated sequences (non-orthologous displacement; []) hampers function identification by sensitive sequence alignment procedures. In such cases, gene neighbourhood and operon context helped to determine function of reading frames. Besides this, more elaborate genome context methods were used.Genome context methods and searches for functional associations exploited the STRING database [,]. Functional association as well as direct interaction on the protein level is predicted in the database by looking at the conservation of genome context in many different species. A first observation [] was that reading frames which are conserved as neighbours in many genomes are a useful predictor for direct interaction of the encoded proteins. This was validated by considering proteins known to interact and the position of their reading frames. This approach allowed also to predict new interactions []. Subsequent studies refined genome context methods and include now also observation of gene fusion of the reading frames in one or several genomes as an even stronger predictor of interaction as well as common presence or common absence of reading frames which are functionally associated or in common pathways []. Furthermore, data mining (co-occurrence of genes in articles) and direct interaction data (yeast two hybrid, large scale tap-tag screens) were added as functional association indicators in the updated version of the database used for our predictions []. To compare and collate these different types of predictions, a prediction score is calculated, ranging from 1.0 (certain) to 0.0 (no functional association) and using Bayesian probabilities [,]. Four categories are distinguished []: Highest (0.9) and high confidence (0.7), medium confidence (0.4) and low confidence (0.15). Only the high and medium categories were used for predictions here.Pathway alignment [] compared the reading frames found to be present for a pathway of interest and a specific organism to the version present in other organisms. Sequence searches established presence of reading frames with orthologous function in better experimentally characterized prokaryotics species such as E.coli. These predictions were retested using biochemical data (to test for enzymes with diverged sequences escaping detection)-and calculating metabolic fluxes by elementary mode analysis (in particular to test whether missing enzyme activities are compensated by detours or alternative paths). Thresholds in the pathway alignment for sequence searches against databases were set at an expected value e below 10-6 and accepted if passing the other two tests. [...] For several of the Blochmannia sequences with previously unknown function we identified homologous sequences in other species with a solved three dimensional structure. For some of these solved structures the function was not yet known as structures were solved as part of a large scale structural genomics project in that species (e.g. E.coli, Aquifex aeolicus). If such homologous sequences with known three dimensional structure had been identified by us, then homology models were obtained using the SWISS-MODEL server []. The server selects a template, creates an alignment and builds a homology models including energy minimization and WhatCheck [] reports. Specifically, the ProModII program was used for modelling; energy minimization used Gromos96 (parameter set ifp43B1) applying steepest descent with 200 cycles and conjugate gradient with 300 cycles. The template for Bfl316 (the full Blochmannia sequence had 153 residues) was pdb entry 1oz9 (protein 1354 from Aquifex aeolicus, resolution 1.89 A). The template allowed modelling the Blochmannia residues 33 till 130 in the homology model shown. The template for Bfl499 (303 residues) was pdb entry 1nv8 (transferase HemK from Methanococcus jannaschii, resolution 1.80 A). The template allowed prediction of the residues 77 till 242 in the homology model. The template for Bfl341 was pdb entry 1ri6 (putative isomerase from E.coli, resolution 2.00 A). The template and the homology model obtained covered the whole Bfl341 sequence (338 residues) except the three most N-terminal residues. The root-mean-square-deviation (RMSD) for each homology model to its template was calculated. All homology model coordinates are available on request from the authors. […]

Pipeline specifications

Software tools HMMER, WHAT_CHECK
Databases ExPASy
Applications Phylogenetics, Protein structure analysis
Chemicals Inositol, Methylcellulose, Nitrogen, Ubiquinone