Computational protocol: Biophysical and structural considerations for protein sequence evolution

Similar protocols

Protocol publication

[…] To evaluate the fit of a sequence s to a backbone conformation c, where the number of residues N is the same for both s and c, s is threaded into c (see above) and then a weighted scoring function V(s, c) is computed: (1) V ( s , c ) = w b e n d V b e n d + w L J V L J + w h e l i x V h e l i x + w b e t a V b e t a + w i o n V i o n + w s o l v V s o l v + w S - S V S - S where:Vbend is a harmonic bending potential around the equilibrium bond angles for Cα-Cβ bonds: (2) V b e n d = ( 1 ∕ 2 ) K Θ ∑ i = 2 N ( Θ i - Θ 0 t ( i ) ) 2 + ( 1 ∕ 2 ) K Θ ∑ i = 1 N - 1 ( Θ i + 1 - Θ 0 t ( i ) ) 2 KΘ is the force constant for the bending potential (10.0 kJ mol-1 rad-2), Θi and Θi+1 are defined as described above, and ΘOt is the equilibrium bond angle for a Cβ bead of type t.VLJ is the pair-wise vdW interaction potential, approximated as a sum of Lennard-Jones potentials: (3) V L J = 4 ∑ i , j ε i , j [ σ i j r i j 12 - σ i j r i j 6 ] εij is the interaction parameter between beads i and j (a pair of Cα beads, a pair of one Cα bead and one Cβ bead or a pair of two Cβ beads), based on their respective hydropathy indexes, σij is the collision diameter of beads i and j (i.e. the sum of their radii), and rij is the defined in Figure . The sum ij runs over all pairs of beads noted above.Vhelix is the helical potential, an approximation of the entropic and steric effects of the backbone forming an α-helix: (4) V h e l i x = ∑ i = 3 N - 3 1 2 K i 1 - 3 r i , i + 2 - r h 2 + 1 2 K i 1 - 4 r i , i + 3 - r h 2 Ki1-3 is average helical propensity of residues i, i+1 and i+2. ri, i+2 is the distance between Cα beads i and i+2. ri, i+3 is the distance between Cα beads i and i+3. Ki1 - 4 is the average helical propensity of residues i, i+1, i+2, and i+3. Finally, rh is the equilibrium helix inter-bead distance (5.5 Å).Vbeta is the beta-sheet potential, constructed analogously to Vhelix but using an equilibrium beta torsion angle instead of a bead-bead distance: (5) V b e t a = K i 1 - 4 ∑ i = 2 N - 2 C b ( φ i - φ b ) 2 Ki1 - 4 is the average beta propensity of residues i-1, i, i+1, and i+2. The beta propensity for a given residue type is constructed using the same linear scaling of helical propensities used by Mukherjee, but is based on Kim and Berg's beta propensity scale [] rather than Pace and Scholtz's helix propensity scale []. Cb is a scaling constant (0.01) selected to scale the range of torsion angles to a similar magnitude as the range of bead-bead distances in Vhelix. Torsional angle φi is defined above (Figure ), and φb is the equilibrium beta sheet value (210°) for a two-bead representation as described by Bahar and coworkers [].Vion is an electrostatic potential based on Coulomb's Law: (6) V i o n = C c ∑ i , j q i q j ε r i j Cc is a scaling constant (1,000) selected to scale the term to similar magnitude as other terms, qi and qj are the charges of residues i and j (+1 or -1 depending on residue type), rij is the distance between beads i and j, and ε is the dielectric constant of the protein interior (3.0)[], with the sum running over all pairs of charged residues with a SASA (see below) of less than 0.25. This roughly approximates the strong screening of charged interactions due to the highly polarizable surrounding water, and the nearly negligible effects of the largely non-polarizable interior of the protein []. The whole term is then scaled by a weighting term wion when used in V(s, c) (eq. 1), so Cc has no effect on the parameterization, but is necessary computationally so as not to introduce errors due to lack of floating point precision.Vsolv is the potential due to solvation of the protein, approximated by a novel implicit solvent model based on solvent-accessible surface area (SASA): (7) V s o l v = ∑ i = 1 N h i S A S A ( i ) + p i 1 - S A S A ( i ) hi is the hydrophobicity-based interaction parameter of residue i (equivalent to εii above) and pi is the "polarity index", constructed as pi = (hmax-hmin) - hi. SASA(i) is the fraction of solvent-accessible surface area of residue i calculated by the NeighborVector method of Durham [] as adapted for the two-bead representation.VS-S represents the stability contribution due to formation of predicted disulfide bonds: (8) V s - s = ∑ i , j f r i , j where (9) f r i , j = { - 1 , r i , j < r s s 0 , o t h e r w i s e and rij is the distance between two cysteine Cβ beads and rSS is the maximal Cβ-Cβ distance in a typical disulfide-bonded cysteine pair (4.5 Å[]). The sum ij runs over all pairs of cysteine residues. This form of Vcys predicts disulfide bonds with a specificity of 0.98, sensitivity of 0.91 and Matthews Correlation Coefficient of 0.79 as measured on the structural data set used to construct SCWRL 3.0 [].Finally, wx are weights for each individual term in eq. 1, determined in a procedure described below. [...] Once a set of weights for V(s, c) (eq. 1) are obtained (the parameters Uab of E(s, c) in eq. 12 are fixed), the folding score gap landscape and its minima in sequence space were characterized for both scoring functions by varying s. Sequences were sampled based on Sgap (eq. 15) both near to and far from the native state using a typical [] MCMC approach, with acceptance probability: (19) p ( s k ) = 1 i f S g a p s k > S g a p s k - 1 e s g a p ( s k ) - s g a p ( s k - 1 ) T o t h e r w i s e where each step k was a single substitution in the protein sequence.For near-native sampling two temperatures were employed: a very high temperature for nearly unbiased sampling (T = 10 for the informational function E(s, c) of eq. 12, T = 50 for the physics-based function V(s, c) of eq. 1) and a lower temperature to find local minima (T = 0.1 for E(s, c), T = 1.5 for V(s, c)). The low temperature was chosen to make the chain capable of passing the barrier in the energy landscape caused by an average deleterious substitution under the informational model (~0.1 units of E(s, c)) with a reasonably high probability (~0.1). This was also calibrated with respect to the expected fraction of sequences with better folding stability for real proteins and indirectly on dN/dS. The high temperature was chosen to make the sampling nearly independent of Sgap (> 0.95 expected acceptance probability). Temperatures for the physics-based function were then chosen such that the actual acceptance frequencies obtained from sampling under the informational model were reproduced. Differences in the observed dN/dS ratios between the physics-based and informational models were due to differences in the ruggedness of the landscape and its sequence context dependence (local ruggedness).To increase sample density, divergence of the chains from the starting state was restricted to 5%-30%, in increments of 5%, and two chains were run at each temperature and divergence limit. The number of attempted steps was adjusted to obtain ~8,500 unique samples for each function. Samples were projected onto a two-dimensional representation of sequence space using Sammon mapping [] and score surfaces were calculated using Gnuplot v 4.2.To discover regions of the landscape where most mutations are destabilizing (i.e Sgap is near a maximum), an important feature of biological protein sequences leading to observed dN/dS ratios, ten replicate chains attempting 100,000 steps were run for each function at the lower sampling temperature without any divergence restrictions. The chains were thinned by taking every fourth unique sample and the final 100 such samples were retained, resulting in 1,000 unique samples for each scoring function. Samples were Sammon-mapped and sequence logos [] were calculated to assess convergence. Structural properties of the sampled equilibria were further characterized by projecting the consensus sequence at each position with > 2 bits of information onto the protein conformation. [...] As a test of the utility of the methods in an evolutionary context, simulations were performed under negative selection for protein folding and binding. In a manner similar to that of Rastogi et al. [], a population of 1,000 virtual organisms containing a single copy of the SAP protein [] (PDB ID: 1D4T) was evolved for 200,000 generations at a rate of 10-5 mutations per bp per generation at the DNA level, with transitions being twice as probable as tranversions. To obtain a thermodynamically stable starting point, non-native optima of Sgap (eq. 15) in protein sequence space were sampled as described above and a starting DNA sequence was randomly selected from the reverse translation of those samples.In each generation, the mutated DNA sequence in each organism was translated to protein, threaded through the native SAP conformation cprotein, and the fitness calculated as (20) w ( s ) = 0 i f S g a p ( s ) < S g a p s t a r t o r V b i n d ( s ) > V b i n d s t a r t 0 . 9 i f S g a p ( s ) ≥ S g a p s t a r t a n d V b i n d ( s ) ≤ V b i n d s t a r t a n d V d e c o y ( s ) ≤ V b i n d s t a r t 1 o t h e r w i s e where Sgap(s) is the fold gap score (eq. 15) of the current sequence s, Sgapstart is the fold gap score of the starting sequence, Vbind(s) is the binding score V(s, cprotein, sligand, cligand) (eq. 10), sligand and cligand correspond to the SLAM peptide (the native ligand of SAP), Vbindstart is the binding score of the starting protein sequence to the same ligand, and Vdecoy(s) is the binding score V(s, cprotein, sdecoy, cligand) for a decoy ligand. Simulations under the informational model used E(s, c) (eq. 12) to compute Sgap and E(s1, c1, s2, c2) (eq. 13) to compute binding scores. In other words, mutations that make the protein or the protein-ligand complex less stable than the starting point are infinitely deleterious, mutations that maintain those stabilities but enable binding of the decoy ligand are somewhat deleterious, and all other changes are neutral. Organisms were then propagated to the next generation by random sampling weighted by fitness w (eq. 20), with replacement, while maintaining a constant population size.The decoy ligand were constructed by threading a sequence sdecoy = IWMTIYMIIIT through the SLAM peptide conformation cligand for the informational function, and sdecoy = RLPTIYICITG for the physics-based function. Both sequences are variations on the experimentally determined XXXTIYXX(VI)XX SAP-binding motif [], and do not bind the protein at the starting point of the simulation.Finally, each simulation was replicated 10×, and the resulting sequences analyzed. Structural patterns of evolutionary rates (dN/dS) were measured by comparing sequences to the original sequence and calculating according to the PBL method [,], and the distribution of position-specific residue preference was compared with that in the Pfam database [] for the same fold by constructing sequence logos [] from the evolved sequences. The sequences were also tested for emergence of rate heterogeneity by assessing the superior fit of the data to the rates-across-sites model over an equal-rates model using ProtTest []. To discern if the novel solvation model preserves the general pattern of hydrophobic core and hydrophilic surface the proportion of hydrophobic and hydrophilic residues in parts of the protein were measured before and after simulation. […]

Pipeline specifications