Computational protocol: Connecting the Sequence-Space of Bacterial Signaling Proteins to Phenotypes Using Coevolutionary Landscapes

Similar protocols

Protocol publication

[…] An amino acid sequence s=(A1,A2,…,AL) for a protein or interacting proteins can be viewed as being selected from a Boltzmann equilibrium distribution, i.e., P(s)=Z−1 exp ⁡(−H(s)). The Boltzmann form of P was previously derived for an evolving population in the limit where the product of the population size and mutation rate is very small (). Specifically, it was shown () that H(s)=−vx(s), where v is the population size and x(s) is the additive fitness landscape (i.e., log of the fitness). A high population size suggests many viable sequences that a protein can mutate to, which makes the population more robust to deleterious mutations. Related work () modeled site-specific selection of sequences, which has been extended by numerous works (; ; ).Under certain limiting conditions, H(s) appears to share a correspondence with the energetics of protein folding [See review: ()]. Assuming that the sequence diversity is completely due to stability considerations, H(s)=βE(s) where E(s) is the energy of the folded protein with respect to the unfolded state and β=(kBTsel)−1 is the inverse of the evolutionary selection temperature from protein folding theory (, ; ). Several studies have reported strong linear correlation between mutational changes in H(s) with mutational changes in protein stability (; ; ). However, H(s)=βE(s) may not be an appropriate approximation for proteins that have evolved with interacting partners, for which sequence selection is plausibly influenced by additional factors such as binding affinities as well as binding/unbinding rates.Often it is of interest to solve the inverse problem of inferring an appropriate H(s) when provided with an abundant number of protein sequences. Typical approaches to this problem have applied the principle of maximum entropy to infer a least biased model that is consistent with the input sequence data (; ), e.g., the empirical single-site and pairwise amino acid probabilities, Pi(Ai) and Pij(Ai,Aj), respectively. The solution of which is the Potts model: (3)H(s)=−∑i=1L−1∑j=i+1LJij(Ai,Aj)−∑i=1Lhi(Ai) where Ai is the amino acid at position i for a sequence in the MSA, Jij(Ai,Aj) is the pairwise statistical couplings between positions i and j in the MSA with amino acids Ai and Aj, respectively, and hi(Ai) is the local field for position i . We estimate the parameters of the Potts model, {J,h}, using the pseudo-likelihood maximization Direct Coupling Analysis (plmDCA) [See Ref: () for full computational details]. It is important to note that the mutational context-dependence (epistasis) between residues i and j is naturally captured in the model through the statistical couplings, Jij(Ai,Aj).Previous studies have applied DCA to a number of problems in structural biology. DCA has been used to identify highly coevolving pairs of residues to predict the native state conformation of a protein (; ; ), including repeat proteins (), as well as identify additional functionally relevant conformational states (; ; ) and multi-meric states (; ; ; ; ). Structural and coevolutionary information share complementary roles in the molecular simulations of proteins [See review: ()]. The Potts model () obtained from DCA has been related to the theory of evolutionary sequence selection () as well as mutational changes in protein stability (; ; ). Additional work has applied DCA to protein folding to predict the effect of point mutations on the folding rate () as well as construct a statistical potential for native contacts in a structure-based model of a protein () to better capture the transition state ensemble.DCA and inference methods have also been applied to study problems in systems biology, such as the identification of relevant protein-protein interactions in biological interaction networks (; ). Recently, a number of studies have focused on inferring quantitative landscapes that capture the effects of mutations on biological phenotypes (; ; ) by constructing models from sequence data (). Two separate studies, which focused on antibiotic drug resistance in E. coli () and viral fitness of HIV-1 proteins (), respectively, inferred a Potts model (i.e., additive fitness landscape), H, and calculated mutational changes as ΔH. This approach is analogous to the approach adopted in this study. Likewise, a study examining the mutational effects on TCS phosphotransfer () constructed a mutational landscape from an information-based quantity. All of these approaches capture epistatic effects and rely on the accuracy of the inferred Potts model () from sequence data. [...] ZEMu consists of a multiscale minimization by dynamics, restricted to a flexibility zone of five residues about each substitution site (, which is followed by a mutational change in stability using FoldX (). ZEMu has been used to explain the mechanism of Parkinson’s disease associated mutations in Parkin (; ). The minimization is done in MacroMoleculeBuilder (MMB), a general-purpose internal coordinate mechanics code also known for RNA folding (), homology modeling (), morphing (), and fitting to density maps ().We use the ZEMu ( method to predict the mutational change in binding energy between PhoQ and PhoP. ZEMu first treats mutations as small perturbations on the structure by using molecular dynamics simulations [See Ref. for full computational details] to equilibrate the local region around mutational sites. ZEMu can then estimate the binding affinity between the mutant-PhoQ/PhoP, ΔGTCSZEMu(mutant), and the wild type-PhoQ/PhoP, ΔGTCSZEMu(WT), using the knowledge-based FoldX () potential. This allows for the calculation of the mutational change in binding affinity as: ΔΔGTCSZEMu=ΔGTCSZEMu(mutant)−ΔGTCSZEMu(WT).ZEMu calculation was performed according to Ref: (, with the following two differences. First, due to the large number of mutations we capped the computer time permitted to 3 core-hours per mutant, whereas in ( the limit was 48 h. This meant that of 122,802 mutants attempted, 42,923 completed within the time limit, whereas in (, almost all mutants converged. The major reason for non-convergence in the current work involved mutation to bulky or constrained residues. Steric clashes produced by such residues force the error-controlled integrator () to take small time steps and hence use more computer time. Exemplifying this, the amino acids F, W, and Y are the most common residues for non-converging mutations at positions 285 and 288 in PhoQ. The second difference was that we permitted flexibility in the neighborhood of all four possible mutation sites, even when not all of them were mutated, whereas in ( only the mutated positions were treated as flexible. This allowed us to compare all of the mutational energies to a single wild-type simulation, also performed with flexibility at all four sites.Database of TCS partners, Potts model, and code for calculating can be obtained from: http://utdallas.edu/~faruckm/PublicationDatasets.html; last accessed August 10, 2016. […]

Pipeline specifications

Software tools DCA, plmDCA, FoldX
Application Protein structure analysis
Organisms Bacteria, Escherichia coli