14540
J. Phys. Chem. 1996, 100, 14540-14548
Optimizing Potential Functions for Protein Folding Ming-Hong Hao and Harold A. Scheraga* Baker Laboratory of Chemistry, Cornell UniVersity, Ithaca, New York 14853-1301 ReceiVed: March 20, 1996X
A new procedure for optimizing the potential function for proteins is presented, and the folding of latticechain protein models was studied with optimized potentials. The pairwise contact energies between residues in the lattice protein models were optimized by an iterative algorithm that maximizes the free-energy difference between the (given) native structure and the non-native states of the protein. Optimization of the energy parameters for a variety of sequences of the model protein was carried out. The statistical-mechanical properties of the model proteins were analyzed with different sets of energy parameters to investigate the effects of optimization of the energy parameters on protein folding. For many sequences, optimization of the potential leads to strongly foldable protein models that can be folded to the target native structures in relatively short Monte Carlo simulations. It is found that consistency among the different components of the interactions in the native structure of a protein is a necessary condition for the existence of an exact and efficient folding potential. The results of this work reveal some crucial correlations between the sequence and the native structure of a protein, which determine the unique folding of the protein.
I. Introduction Unlike most polymers, which collapse to compact disordered states, proteins fold to precise and unique structures called the native state. What distinguish proteins from other polymers are the unique interactions encoded in the specific amino acid sequences of the proteins. To be able to fold proteins theoretically, one must first unveil the interactions that dictate the folding of proteins. However, despite many investigations, a general potential that can fold different proteins has not yet been obtained. There are two convenient theoretical approaches with which foldable models of protein can be obtained and the nature of the correct potentials for folding proteins can be studied. One approach is to optimize the amino acid sequence1-5 of a polypeptide with respect to a fixed (and not necessarily correct) potential so that the resulting sequence becomes the most favorable one for the given target native structure with the assumed potential. The other approach is to optimize the potential6-11 for a given protein with respect to the known native structure so that the latter becomes the thermodynamically most stable structure among all non-native conformations. These two approaches share a common basis in folding proteins, that is, both approaches optimize the interactions among the amino acid residues in the natiVe structure of a protein, which is apparently a key condition for foldable models of proteins. It is clear that only with the establishment of a good potential can one fold proteins of given sequences. Therefore, in this work, we investigate the problem of optimizing the potential functions for protein folding. Because of the complexity of protein systems, simplified empirical energy functions based on pairwise contact interactions among the constituent residues of a protein have been used widely in theoretical studies of protein folding. Most of these simplified potentials for proteins are derived on the basis of the statistical distributions of conformations in protein crystal structures (the PDB).12-18 Despite the usefulness of such potentials, there are a number of problems with this traditional approach:19 the number of known protein crystal structures is quite limited; the crystal structures represent only a tiny fraction * To whom correspondence should be addressed. X Abstract published in AdVance ACS Abstracts, August 1, 1996.
S0022-3654(96)00856-8 CCC: $12.00
of the possible conformations of any single protein; deriving an interaction potential from the protein data requires major approximations, such as the independence of the interactions between different pairs of residues (the quasichemical approximation); it is questionable that sufficiently good potentials; which can fold proteins uniquely and deterministically, can be derived from the protein data base. Therefore, it is highly desirable to search for new ways that can either improve the traditional potentials or can derive new potentials for proteins. In this work, we describe an approach that derives optimized energy parameters based on the idea that thermodynamic foldability of proteins is a necessary and sufficient condition for obtaining a potential for folding proteins. This approach does not rely on the protein data base or on some assumptions about the non-native states of the proteins in deriving the energy parameters. Therefore, the approach is suitable for optimizing the potential for general protein models and provides a way to study the nature of the correct interactions in protein systems. Traditionally, the folding behavior of a model protein is studied and described from the results of many independent simulations of sufficient length. A better way to address the same problem is to determine the density of states of the protein model.20,21 Once the density of states of a protein model is determined, the thermodynamic problem of protein folding is solved, independently of the kinetics of the simulation algorithm. Of course, in order to fold a protein practically by simulations, the folding event of the protein model has to be short within the computational capacity.22 However, if one can establish a correlation between the statistical-mechanical characteristics, e.g., the density of states, and the folding kinetics for the protein model, then, once the statistical-mechanical character of a protein model is determined, one will be able to determine whether the protein model is foldable thermodynamically and also to deduce whether the folding kinetics of the model is fast enough for simulation studies. In the present study concerning the evaluation of the quality of the optimized energy parameters, we not only determined the density of states of the protein models but also examined the relationship between the statistical-mechanical characteristics and the folding kinetics of the model proteins. © 1996 American Chemical Society
Protein Folding
J. Phys. Chem., Vol. 100, No. 34, 1996 14541
The objectives of this study are to introduce a new procedure for optimizing the energy parameters and to demonstrate that the approach is efficient in deriving a folding potential for different protein models. The central theme of this paper is that thermodynamic foldability is a necessary and powerful condition upon which good potentials can be derived for folding proteins. However, this paper is not concerned with a universal potential for all proteins; the universial potential would require more conditions in addition to the thermodynamic foldability requirement. In section II, the optimization algorithm is described and the lattice protein model used in this work is defined. In section III, results are presented for the folding of model proteins with unoptimized and optimized energy parameters; the effects of the sequence and native structures on the optimized energy parameters are analyzed, and the sequences of model proteins are classified in terms of the characteristics of their optimized folding potentials. The conclusions of this work are provided in the last section. II. Theory and Methods Optimization Algorithm. To formulate the problem of optimizing the potential function for proteins, we consider the most basic interactions in a protein, i.e., the pairwise interactions between different pairs of its residues. The total energy of the protein in a given conformation is the sum of the pairwise interactions: N
E ) ∑eijh(ri - rj)
(1)
i,j
where N is the total number of residues, eij is the interaction parameter between residues i and j, and h(ri - rj) is the “contact function” whose value is unity when the two residues are in contact, but not connected on the chain, and is zero otherwise. Among all the possible conformations of a protein, the native structure is of special interest. The Boltzmann probability p0 of the native structure is defined by
p0 )
e-E0/(kT) e ∑ m
(2)
-Em/(kT)
where E0 is the energy of the native structure, k is the Boltzmann constant, T is the absolute temperature, and the sum is taken over all possible conformations of the protein. In order for the protein to be able to fold to the native conformation at a finite temperature, the Boltzmann probability of the native structure must be the largest among those of all other conformations.5 Therefore, the potential function for the protein can be optimized by finding a set of energy parameters, {eij}, that maximizes the Boltzmann probability of the native conformation. For convenience of operation, it is easier to maximize the logarithm of p0 instead of p0 itself. Taking the logarithm of both sides of eq 2 and noting that Z ) ∑m exp(-Em/(kT) is the partition function of the protein, the maximization of the probability p0 can be expressed as
max(kT ln p0) ) max[-E0 + F]
(3)
where F ) -kT ln(Z) is the Helmholtz free energy of the protein at a given temperature T. The free energy of the protein can be written as
F ) 〈E〉 - TS
(4)
where 〈E〉 is the average energy and S is the entropy. At a
finite temperature T, the free energy of the protein is dominated by only a fraction of the conformations of the protein. The unfolded state of a protein can be treated approximately by a “random-energy” model23-25 in which the energies of different non-native conformations are described as statistically independent random values. For such systems, at not too low a temperature, the thermally populated density of states is a unimodal function that is the product of an increasing function (the total density of states) and a decreasing function (the Boltzmann factor). As a result, the logarithm of the thermally populated density of state can be expressed as a series of the moments of the energy function. The first approximation to such a distribution is the Gaussian function26 in which the entropy of the protein can be calculated as
S ) S0 -
∆2 2kT2
(5)
where S0 is a constant for a given protein model and ∆ is the root-mean-square deviations (rmsd) of the energies of the nonnative conformations. Although higher orders of approximation for the free energy of the protein can also be obtained,26 they involve higher moments of the energy function, which are computationally very expensive (requiring calculations of higher dimensional matrices as compared to eq 7 below). Therefore, in this work, we restricted ourselves to the Gaussian approximation for the free energy of the non-native states of a protein. Now we describe how the maximization of the free-energy difference in eq 3 is carried out computationally. When the contact functions between all residues of type µ and all residues of type ν in a given structure are summed together, we obtain the total contact function between two residue types: N
hµν ) ∑hijδi,µδj,ν
(6)
i,j
where δi,µ ) 1 if the ith residue belongs to the µ type and otherwise δi,µ ) 0. To distinguish between the index of a single residue and the type of residue in the expression, we will use the Greek letter subscripts to indicate the type of residue and use the Roman letter subscripts to indicate the index of a single residue. With the vector notations h ≡ {...hµν...} and e ≡ {...eµν...}, the average energy can be written as 〈E〉 ) e〈h〉 and the fluctuations of the energy can be expressed as ∆2 ) e[B]e where the elements of the matrix [B] are calculated as B(µν),(µ′ν′) ) 〈hµνhµ′ν′〉 - 〈hµν〉〈hµ′ν′〉. With the constant term in the entropy expression of eq 5 omitted, the optimization problem of eq 3 can be expressed as
max(〈δh〉e - (a/2)e[B]e)
(7)
where δh ) h - ho with ho being the total contact function between residue types in the native conformation and a ) -1/ (kT). The quantities 〈δh〉 and [B] in eq 7 are thermodynamic averages of the protein system at a finite temperature. Therefore, their values will depend on the potential that governs the system. To overcome the interdependency between the energy parameters to be optimized and the distributions of protein conformations that are required for the optimization task, we use the following iterative algorithm to obtain a set of selfconsistent optimized energy parameters: (i) start with an arbitrary set of energy parameters; (ii) determine the quantities 〈δh〉 and [B] with conventional MC simulations; (iii) modify the energy parameters according to the gradient-descent formula enew ) eold + sG, where s is a step size and G is the modified gradient of eq 7:
14542 J. Phys. Chem., Vol. 100, No. 34, 1996
G ) 〈δh〉 - a[B]e - γho
Hao and Scheraga
(8)
where γ is the Lagrange multiplier chosen to maintain the energy of the native conformation, E0 ) eho, at a specified value during the optimization process. The latter term is necessary for fixing the reference point and setting the scale of the energy for the protein. Optimization of the energy parameters produces changes not only in these parameters but also in the overall energy of the protein; the latter has no effect on the optimization of the energy parameters. Using the Lagrange multiplier, the overall energy of the protein is fixed; only the relative changes among the different energy parameters are retained. In optimization of the energy parameters, either step iii or both steps ii and iii are iterated until the protein becomes strongly foldable, which is the indicator that the energy parameters have been sufficiently optimized. The computationally demanding part of the above algorithm is step ii; therefore, it is desirable that step ii be carried out as few times as possible in the optimization process. However, simulations with a poor set of energy parameters usually cannot provide a good representation of the low-energy non-native states of the protein system. As a result, step ii must be carried out at least twice during the iteration process. It is of interest to note that the formula of eq 8 for optimizing the energy parameters is similar, with some difference in the coefficient a, to an earlier formula for the same purpose.11 The earlier formula was derived by maximizing an R ratio,6,7 which is related to the thermodynamic foldability of the protein but also has an approximate connection with the optimal kinetic condition. The similarity of the optimization criteria based on the kinetic and thermodynamic considerations indicates that there is an intrinsic connection between the thermodynamics and kinetics of protein folding, i.e., when the thermodynamic condition is optimal, the kinetic condition also becomes favorable. We note that recently Wolynes and co-workers also independently considered an iterative procedure27 for optimizing the energy parameters that is similar, in spirit, to our approach.11 Although eq 8 is used for treating only one protein, the approach can easily be generalized for optimizing a common set of energy parameters for many different proteins. The generalized gradient expression for optimizing the energy parameters for many proteins simultaneously is
G ) ∑(〈δhr〉 - a[Br]e) - γho
(9)
r
where the sum over r is for the number of different proteins that are necessarily included in the optimization. It should be noted here that, in order to set the scale for the energies of an ensemble of proteins, one needs to fix the energy of only one protein in the ensemble. Lattice Protein Model. In this work, we used a simple protein model, a heteropolymer chain on the cubic lattice with nearest-neighbor site contact interactions to study the folding of proteins. The main advantage of this model is that the simulation algorithm is simpler and faster than that for the more complicated lattice models that we28,29 and others30-34 have studied previously. We feel that there is much to be learned from simplified models about the nature of the interactions that lead to foldable proteins before one can successfully construct a correct potential for real proteins. The simpler model permits us to determine the full statistical mechanics of the system with high numerical accuracy in a reasonable amount of time. The present model does not include the hydrogen-bond interaction, which is thought to be important in real protein native structures. The main reason is that, for the cubic-lattice protein models studied in this work, the simple contact potential is sufficient
Figure 1. Two target native structures of the model protein on the cubic lattice.
to fold the model protein uniquely to the target native conformation. We believe that the most crucial factor in determining protein folding is the interplay between the shortand long-range interactions in a protein (here the short- and long-ranges are defined as the distance between residues along the polymer chain rather than the range of interaction forces). When the short- and long-range contact interactions are optimized, the present lattice model can fold precisely to the target native structure, as do real proteins. Therefore, the contact potential captures the essential interactions in the cubic-lattice protein model. The insights about the basic physics of protein folding gained from the simple cubic lattice chain should be compatible with those obtained from more sophisticated models. However, it is highly likely that, in other more sophisticated models,29,32-34 hydrogen-bond interactions have to be included in order to fold the proteins. Two different structures on the cubic lattice, shown as I and II, respectively, in Figure 1, were studied in this work as the target native structures of the model protein. Structure I contains 45 residues that are packed on a 3 × 3 × 5 fragment of cubic lattice in an irregular path. Structure II contains 46 residues and has a sheetlike structure with a Greek-key topology often found in real proteins.34 The native structures I and II have 52 and 45 contacts, respectively. Therefore, structure I is more densely packed than structure II. We assumed that the model proteins contain 12 different residue types, labeled a, b, ..., l. The sequences studied in this work are listed in Table 1, where the sequences with structure I as the native structure are labeled as S#/I while the sequences with structure II as the native structure are labeled as S#/II. For such labeled sequences, the energy parameters were optimized individually for each sequence. For the two sequences labeled S4/(I,II) and S′4/(II,I) in Table 1, a common set of energy parameters was optimized for both sequences with structures I and II as the native structure of the first and the second set of sequences, respectively. We study the folding of the protein models with both statistical-mechanical analyses and MC folding experiments. In MC simulations of the cubic-lattice chains, the elementary moves in the conformation of the chain include the conventional end site flips, one-site moves around corners, and two-site crankshaft moves.22,35,36 The statistical mechanics of protein folding is defined by the density of states of the system. We used two procedures for determining the density of states of the model proteins: the Monte Carlo histogram (MCH) method37,38 and the entropy sampling Monte Carlo (ESMC) method.20,39-41 The MCH procedure is relatively simpler and faster than the ESMC method. We have confirmed that, for the present protein models, both procedures produce similar
Protein Folding
J. Phys. Chem., Vol. 100, No. 34, 1996 14543
TABLE 1: Selected Sequences of the Model Proteinsa label
sequence
S0/I S1/I S2/I S3/I S4/(I,II) S1/II S2/II S3/II S′4/(II,I)
eabcgcbjcdedbjcfjkfkhacfhdfklgbligeaiehalghdi gfhbidbecadjaekfflgchbaeihcjdkcdlhgabefkljgci hgbdbdibehbecelgfkfajajeihkalcdhgciafdfklcjcg iecbiecehelakgigifhbjbhgdldacfklhfkajgdbcfdaj kekgdgdldigjcejeakabibdecilblfcahfgbhhcjffcah dgbhbdibehbegclefkfajcjeihkalcdhgciafdfklkjcga fihchcaeeajbjalffikidcdcegeglhlakkkfdjdbcgbgbh agejiakededdbdlbgjhjhaiakfbcbcgefkhkhglfifcclc jckcjggfalaibibikcfceceghdadbhblffekegkjadldhh
a These sequences contain a total of 12 types of residues represented as a, b, c, ..., l, respectively. Sequences S0/I to S4/(I,II) pertain to structure I shown in Figure 1 as the native structure, and sequences S1/II to S′4/(II,I) pertain to structure II as the native structure.
results when the model is foldable to the native state in conventional MC simulations. But for unfoldable protein models, there are severe difficulties in sampling the low-energy states of the protein with conventional MC methods. Consequently, the density of states of such protein models cannot be determined reliably by the MCH method. The problem in sampling the low-energy states of unfoldable proteins is overcome by the ESMC procedure combined with the jumpwalking technique.21,42 The essence of the jump-walking procedure is that a fraction of the conformations of the protein sampled in the MC run is saved in a conformational pool and, in the next MC run, the simulation stochastically selects new conformations from the pool to start a new trail of MC moves. This technique improves the sampling stability and reproducibility of the MC simulations.21 In this work, we normally use the simpler MCH method to determine the density of states, but for those unfoldable protein models or when a sampling problem appears, as indicated by nonreproducible histograms, we used the ESMC procedure with the jump-walking technique to obtain the density of states for these models. III. Results and Discussion 1. Optimization of the Energy Parameters. The effects of optimization of the energy parameters on protein folding for short 27-mer cubic-lattice chain models have been studied previously.10,11 In the present work, we studied the longer lattice-chain protein models described above. In this section, we use sequence S0/I of Table 1 as an example to describe how the energy parameters for the different sequences were optimized and how the folding of the model protein is affected as a result of optimizing the energy parameters. There is a total of 12 × 13/2 ) 78 pairs of residue types in the sequences. Therefore, there is a maximum of 78 contact energy parameters to be optimized for each sequence. However, because the oddnumbered sites of a cubic-lattice chain can make contact with only even-numbered sites, in certain sequences the number of effective energy parameters is less than the maximum number of pairs of residue types in the sequence. The starting energy parameters for each sequence were generated randomly from a Gaussian distribution with a mean of -1.45 and a standard deviation of 0.35. The statistical properties of the unfolded state of the protein, i.e., the average contact vector 〈δh〉 and the fluctuation matrix [B], were evaluated for a given set of energy parameters in a Metropolis MC simulation with 200 million trial moves. We confirmed that such an MC run was sufficiently long to obtain the converged average values of the interesting properties for the present protein models. The energy parameters were optimized according to eq 8 (or eq 9 if the energy parameters were optimized for several sequences). In general, for good sequences (as discussed below), four to five iterations with a properly chosen step size (i.e., s in eq 8) were sufficient
Figure 2. Variation of the energy parameters in different stages of optimization for sequence S0/I. “Start Set” is the starting energy parameters, and the “1st Iter.” to “4th Iter.” represent the sets of energy parameters after the first to fourth iteration of optimization. Values of the energy parameters in different sets have been shifted vertically arbitrarily in the figure to distinguish one from the other, and each curve shows the normalized values of the energy parameters such that the sum of the squares of the energy parameters is unity.
to obtain the optimized energy parameters with which the foldability of the protein model is maximized. Figure 2 shows how the energy parameters vary at different iterations of optimization: curve “Start Set” is the relative values of different energy parameters in the parameter set before the optimization; curves “1st Iter.” to “4th Iter.” show the relative values of the different energy parameters after the first, third, and fourth iterations of optimization. The values of the energy parameters shown in Figure 2 have been normalized such that the sum of the squares of all the energy parameters in each set is unity. The vertical positions of the different curves in Figure 2 were shifted in order to distinguish the curves from each other. The starting set reflects a random variation of energy parameters. After the first iteration, the major variations among the different energy parameters in the set were basically determined. However, in the later optimization iterations, the energy parameters were further fine-tuned. An essential physical property of optimization of the energy parameters in the above process is an increase in the consistency of interactions among different pairs of residues in the natiVe structure. Such a feature can be measured by the correlation coefficient11 between the energy parameters and the native contact array, i.e., ho defined in eq 6 when the conformation is the native structure. The correlation coefficients between the native contact array and the energy parameters from the start set to the fourth set in the above example are 0.022, 0.280, 0.335, 0.341, and 0.350, respectively. The quality of the different sets of energy parameters can be measured by the statistical-mechanical properties of the model protein with the corresponding potentials. Figure 3 shows the densities of states of the model protein with the four sets of energy parameters shown in Figure 2. It should be noted that the zero points for the different S(E) curves in Figure 3 have been set differently in order to distinguish the curves from each other. With the starting and the first sets of energy parameters, the protein model could not be folded to unique structures in simulations. The densities of states of the protein under these
14544 J. Phys. Chem., Vol. 100, No. 34, 1996
Figure 3. Variation of the density of states of the model protein as a result of optimization of the energy parameters. Sequence of the model is S0/I listed in Table 1, and the native structure is structure I shown in Figure 1. Sets of energy parameters of start, first, third, and fourth correspond to those shown in Figure 2. The small peaks at the lowest energy (on the three lower curves) correspond to the energy of the native structure.
two potentials were determined by the ESMC procedure in over 20 updatings. The histograms in each updating of the entropy curve consisted of 100 million trial moves in an ESMC simulation. With the starting set of energy parameters, the number of lowest-energy conformations was unknown; the whole entropy curve for this case was determined up to an arbitrary constant. With the first set of energy parameters, no conformation with energy lower than the target structure (which could be calculated directly) was ever found in all MC runs that were carried out. With the third and fourth sets of energy parameters, the folded lowest-energy structure found in MC simulations was always the target native structure, which could be reached repeatedly and consistently in simulations. The densities of states of the latter two foldable protein models were determined by the MCH method with 10 histograms generated by conventional MC simulations at different temperatures. In Figure 3, the origin of the entropy curve with the starting energy parameters was set arbitrarily; the origins of the other three entropy curves were set such that the entropy of the native state was unity as indicated by the small peaks in the lowest-energy region in each curve. The foldability of each protein model can be determined from the density of states of the model. As shown in Figure 3, with the starting set of energy parameters, there is an abrupt drop in entropy of the protein near the ground state. This corresponds to a glassy ground state for which unique folding is impossible. For the third and fourth sets of energy parameters, by contrast, the native state is well separated from other non-native states in the energy spectrum, and if the densities of states in the lowerenergy region are smoothed, they form strongly concave curves. These features indicate a strong first-order folding transition with a unique and well-determined folded structure. The slopes of the thin lines on these two entropy curves in Figure 3 are equal to the reciprocal folding temperatures of the corresponding systems. An intriguing case arises with the energy parameters from the first optimization iteration (the first set). As shown in Figure 3, although the native structure appears to be the lowest-energy state, there is an abrupt drop in the entropy of the model at an energy level above the lowest-energy. For this
Hao and Scheraga
Figure 4. Energy trajectories of the model protein S0/I as a function of MC steps in computer simulations. Enative is the energy level of the native structure. Top curve was obtained from an MC simulation with the first set of energy parameters shown in Figure 2 at a temperature of 1.44 (in reduced units), which is slightly below the collapse transition temperature for this set of energy parameters. Bottom curve was obtained from an MC simulation with the third set of energy parameters of Figure 2 at a temperature of 1.64.
case, when the protein folds from a higher-energy unfolded state, the protein is always trapped in the glassy state, and the native state could not be approached from the higher energy levels. To illustrate the correspondence between the kinetics of the folding process and the statistical mechanics of the protein models described above, we show in Figure 4 the energy trajectories of two of the protein models in MC simulations at a temperature near their folding (or collapse) transition temperatures. The top curve of Figure 4 shows the energy trajectory of the protein model S0/I with the first set of energy parameters in a Metropolis MC simulation at a temperature of 1.44 (in reduced units). With this set of energy parameters, the protein model is unfoldable; the simulation temperature of 1.44 is slightly below the collapse temperature of this model. Additional MC runs at temperatures 0.2 unit above or below the above temperature resulted in similar energy trajectories. It can be seen from this trajectory that the energy of the protein model fluctuates, but the energy never reached the ground state over the entire MC run. In comparison, the bottom curve of Figure 4 shows the energy trajectory of the same model protein with the third set of energy parameters in an MC simulation at a temperature of 1.66 (which is close to the folding temperature), starting from an arbitrary conformation. With this set of energy parameters, the protein model becomes foldable. As can be observed from Figure 4, for the foldable protein model at the transition temperature, the protein can reach the exact native state in an MC simulation repeatly and reversibly. The length of time that the protein stays in the native state depends on the temperature. It is important to note that although the foldability of the protein models was defined above in a statistical manner, such defined foldability contains a deterministic implication: for unfoldable models, the protein is impossible to fold to the native state in practical terms; for foldable models, on the other hand, the probability that the protein folds to the native state in a simulation is unity within a reasonably short time. We have investigated all sequences listed in Table 1 with either structure I or II as the native structure. In particular, for
Protein Folding
Figure 5. Logarithm of the density of states (the thick curves) of four different sequences of the model protein I determined with their respective optimized energy parameters. Sequences are shown in Table 1. Origin of each curve is the segment of horizontal line at the lowestenergy part of curve. Thin line on each curve is the tangent to the entropy function of the unfolded state of the protein model.
sequences S4/(I,II) and S′4/(II,I), a common set of optimized energy parameters has been obtained. The densities of states of these two sequences under the optimized energy parameters will be discussed in the next section. We were also able to obtain a common set of optimized energy parameters for 5 different sequences of 27-mer cubic-lattice protein models such that each of the models can fold deterministically to its respective native structure in relatively short MC simulations (unpublished results). The general features of the optimization process of the energy parameters for the different sequences are similar to the above example. The main information gained from these optimizations is that with a random set of energy parameters, the energy parameters can be optimized for many different sequences such that the sequences can be folded deterministically to the native structure with the optimized energy parameters. 2. Sequence Effects on Optimized Energy Parameters. In earlier studies21,36,,43 it has been found that the foldability of different sequences to the native structure varies significantly. Some sequences can easily be folded to the native structure in a simulation, while others cannot be folded to a unique structure in simulations. These earlier studies, however, were based on subjectively (or somehow arbitrarily) chosen energy parameters. In this work, with the optimization algorithm, we studied the sequence effects on protein folding from a different point of view, that is, whether a set of good energy parameters does exist for any given sequence with which the sequence is definitely foldable. In this section, we use the different sequences listed in Table 1 as examples to illustrate the sequence effects on protein folding with optimized energy parameters. Figure 5 shows the density of states for the four sequences of S1/I to S3/I and S4/(I,II), with structure I as the native structure, with the optimized energy parameters. The slopes of the thin lines tangent to the entropy curves in Figure 5 correspond to the reciprocal folding temperatures of the corresponding sequences, and the small peaks in the lowest-energy region of the densities of states indicate the native states. As indicated above, for sequences S1/I to S3/I, the energy parameters were optimized individually for each sequence, but for S4/(I,II), a common set of energy parameters for both S4/
J. Phys. Chem., Vol. 100, No. 34, 1996 14545 (I,II) and S′4/(II,I) were optimized. By examining the densities of states of the four sequences and comparing their folding behavior in MC folding simulations at their respective folding temperatures, we found that the foldability of sequence S1/I is very weaksas a matter of fact, in MC folding experiments at or below the folding temperature, there was a less than 10% chance that the native structure could be reached in an MC run of 200 million trial moves. Sequence S2/I is foldable, based on both thermodynamic and kinetic criteria, but its foldability is weaker than that of the sequences S3/I and S4/(I,II). As can be seen in Figure 5, the density of states of sequence S2/I in the low-energy region is less concave than those of sequences S3/I and S4/(I,II). The latter two sequences are strongly foldable protein models. In general, a stronger foldability of a protein model is associated with the following characteristics: (1) the folding temperature is higher, (2) the energy difference between the native state and the dominant unfolded state is larger, (3) the entropy curve is more strongly concave in the region from the dominant unfolded state to the native state,44 (4) the protein can fold to the native structure in a wider range of temperature, and (5) the folding of the protein to the native structure is faster. The first three properties can be observed directly from the density of states. The latter two properties are kinetic measures that can be obtained in MC folding simulations. In our simulation studies, we found that the kinetic characteristics of the folding of a sequence are closely correlated with the thermodynamic characteristics of the sequence. Recently, Abkevich et al. have carried out a thorough study of the relationship between the thermodynamic foldability and the folding kinetics of lattice-chain models of proteins.45,46 Our results are consistent with their finding that thermodynamically favored sequences in folding also correspond to those sequences with fast folding kinetics. A particular interesting observation of Abkevich et al. is that the shortest mean first passage time (MFPT) of both weakly and strongly foldable sequences of identical length are quite similar. The difference is that for weakly foldable sequences, the temperature at which the MFPT is the shortest is higher than the temperature at which the native state becomes dominant, but for strongly foldable sequences, the temperature at which the MFPT is shortest is very close to that at which the native state of the protein model is sufficiently stable.46 By examining the folding rate of the different sequences studied in this paper at different temperatures, we also observed a similar phenomenon. Figure 6 shows the density of states for the four sequences of S1/II to S3/II and S′4/(II,I), respectively, with structure II as the native structure with the energy parameters optimized separately for each sequence [the special label for S′4/(II,I) is to emphasize that a common set of energy parameters was optimized for both sequences of S4/(I,II) and S′4/(II,I) where the first sequence is for structure I and the second sequence is for structure II]. The thin lines to each entropy curve in Figure 6 have a physical significance similar to those in Figure 5. On the basis of the density of states shown in Figure 6, it can be concluded that sequence S1/II is unfoldable and sequence S2/ II is foldable but its foldability is weaker than that of sequences S3/II and S′4/(II,I). All these conclusions were verified by MC folding simulations. Comparing the densities of states of the sequences shown in Figures 5 and 6, we also observe that with structure I as the native structure, the concave extent of the entropy curves of the foldable sequences in the low-energy region is generally stronger than that of the foldable sequences with structure II as the native structure. It follows that there is a higher free-energy barrier between the native and non-native states of structure I
14546 J. Phys. Chem., Vol. 100, No. 34, 1996
Figure 6. Same as Figure 5 but for another four sequences shown in Table 1 with structure II as the native structure.
Hao and Scheraga optimized energy parameters can be obtained for a given target native conformation. In general, there is a relationship between the sequence(s) and the native structure(s) that determines whether a set of optimized energy parameters can be obtained to fold protein models definitely to their native structures. The following section is concerned with such a general relationship. 3. Classification of the Foldability of Different Sequences. Here, we rationalize the foldability of different sequences based on consistency of the short- and long-range interactions in the native structures of the protein models. We emphasize again that the terms short- or long-range in the present context pertain to the nearest-neighbor interactions in the lattice between residues that are close or far away along the polymer chain. We make use of the total contact function, eq 6, between two residue types in the native structure introduced in section II to classify the sequences. As described earlier, the contact function between a pair of residues i and j in the native conformation is defined as hijo ) 1 if i and j are nearest neighbors in the native structure that are not connected by the bonds of the chain, and 0 otherwise. The total native contact function between two residue types µ and ν is obtained by summing up the contact functions of all residues of type µ with all residues of type ν in the native structure, N
o hµν ) ∑hijo δi,µδj,ν
(10)
i,j
where N is the total number of residues in the sequence. We assume that the N residues are classified into n residue types and there are M contacts in the native structure of the protein. The N residues of the protein have a total of P ) N(N - 1)/2 possible pairings. Among them, there are M native contacts and P - M - N + 1 possible non-native contacts. The total number of pairings of residue types (which equals the number of energy parameters) in the sequence is m ) n(n + 1)/2. Given the sequence and the native structure, the total native contact functions between different pairs of residue types can be arranged in an array: o o , ..., hµν , ..., honn} ho ≡ {hο11, ..., ho1ν, ..., hµ1
(11)
and based on eq 10, we have n
Figure 7. Free-energy of sequences S4/(I,II) and S′4/(II,I) at their respective transition temperatures. Only the relative value within each curve is physically significant.
than between the native and non-native states of structure II for a general sequence. To illustrate this point more clearly, in Figure 7 we show the free-energy functions of sequences S4/ (I,II) and S′4/(II,I) at the transition temperatures of these two sequences, respectively. The free-energy curves were calculated from the entropy curves of sequences S4/(I,II) and S′4/(II,I) shown in Figures 5 and 6, respectively. The letters N and U in Figure 7 indicate the native and the dominant non-native states, respectively. It can be seen from Figure 7 that the free-energy barrier between the native and non-native states of sequence S4/(I,II) is much higher than that of sequence S′4/(II,I). The possible explanations of this result are that structure I is more compact and unusually packed than structure II and that there are fewer folding paths to the native state of structure I than to the native state of structure II. The above results show that the sequence of a model protein plays an important role in determining the foldability of the sequence with the optimized energy parameters. In other words, the sequence of a model protein determines how well the
o hµν )M ∑ µ,ν
(12)
Permuting the order of the residues in a sequence in all possible ways while fixing the composition of the residues in the sequence, one can obtain an ensemble containing a very large number of different sequences. This is the “sequence soup” as discussed by Chan and Dill.43 Each sequence will correspond to a specific total native contact array as defined by eq 11, but the total native contact array of some different sequences may be identical. A classification of the sequences in the ensemble can be made based on their total native contact array, which reflects how the native contacts distribute among different pairs of residue types. To illustrate the relevance of such a classification to the optimization of the energy parameters for different sequences, let us consider two limiting classes of sequences. In the first class, the M native contacts in the sequence are evenly distributed among the m pairs of residue types. Then each term of the array of eq 11 will have an o ) M/m. In the second class, the native identical value, i.e., hµν contacts are concentrated into a minimum number of pairs of residue types. Then, in the array of eq 11, most terms are zero, o have quite large values. For the first class but a few terms hµν
Protein Folding
J. Phys. Chem., Vol. 100, No. 34, 1996 14547
of sequences, it will be very difficult, if not impossible, to find a set of optimized energy parameters with which the sequence is foldable because it is impossible to assign a set of energy parameters that stabilizes the native contacts while not stablizing the non-native contacts. For the second class of sequences, on the other hand, it is very easy to select a set of energy parameters that stabilize only the native contacts, and as a result, the sequence becomes strongly foldable to the native structure. In general, realistic sequences of proteins lie between these two limiting classes of sequences. The quality of the optimized energy parameters for a given sequence can be correlated to the total native contact array of the sequence. The quantity relating the optimization of the energy parameters to a sequence can be defined by the variances of the total native contact array of the sequence: n
n
µ,ν
µ,ν
o 2 o 2 ∆2h ) (1/M)∑(hµν ) - [(1/M)∑hµν ]
(13)
The variance ∆h is simply a measure of the unevenness of the distribution of the native contacts among m pairs of residue types in a protein: the larger the ∆h function, the more nonuniformly the native contacts are distributed among the pairs of residue types, i.e., the native contacts are concentrated in fewer pairs of residue types. For the model protein studied in this work and for many real proteins where M < m, the minimum of ∆h 0 defined by eq 13 is zero (this occurs when hµν ) 1 for all µ and ν). Because of eq 12, the second term on the right-handed side of eq 13 is always unity. For multiple sequence-structure systems, the corresponding ∆h function can be defined as nr
2 - 1] ∆2h ) ∑[(1/Mr)∑hµν,r r
(13a)
µ,ν
where r indicates the rth protein. For a given composition of residues in a chain of given length, the number of possible sequences is on the order of N!/ ∏ni ki!, where ki is the number of the ith type residue. There is a distribution of sequences, i.e., the variation of the number of sequences, as a function of the ∆h value in the sequence ensemble. The probability of occurrence of the sequences as a function of ∆h depends strongly on the ∆h value. In Figure 8, we show the relative probabilities of sequences as a function of ∆h in the ensembles of sequences with the compositions shown in Table 1 and with structure I and structure II as the native structures, respectively. Because the number of sequences is too large to be enumerated exactly, the probability of occurrence of sequences for a given ∆h value can be determined only up to an arbitrary constant. The top curve in Figure 8 is the joint relative probability of two sequences, one with structure I and the other with structure II as their native structures, as a function of ∆h. These distributions of probability were calculated by the MC histogram method in the sequence space with ∆h as the cost function. As expected, the curve of joint probability of two sequences is approximately equal to the product of the probability curves of the two single sequences (see Figure 8). It is found that the probability of sequences with the most probable value of ∆h is several orders of magnitude higher than the probability of sequences with the largest ∆h values. We examined the foldability of sequences in different regions of the ∆h function with a number of selected sequences. The selected sequences are indicated in Figure 8 by the closed or opened circles. The probability curves on which the circles are located signify the ensemble of sequences to which the individual sequences belong, and the abscissa of the circles
Figure 8. Logarithm of the relative probability of sequences as a function of ∆h defined by eq 13 in the text. Curves labeled I and II pertain to the ensemble of sequences with structures I and II as the native structures, respectively. Curve labeled I + II is the joint probability of the ensembles of I and II of sequences as a function of their total ∆h. Open circles pertain to sequences, with the indicated ∆h values, that are unfoldable with optimized energy parameters. Closed circles pertain to sequences, with the indicated ∆h values, that are foldable after their energy parameters were optimized.
indicates the ∆h value of the individual sequences. The sequences indicated by open circles in Figure 8 are unfoldable or very weakly foldable with optimized potentials, while the sequences indicated by closed circles in Figure 8 are foldable with optimized energy parameters. The foldability of these sequences was determined from both the density of states of these sequences and by their folding behavior in MC simulations. From Figure 8, it can be seen that foldable sequences all have relatively large ∆h values, while unfoldable sequences all have relatively small ∆h values in their respective sequence ensembles. Roughly speaking, the larger the ∆h of a sequence, the stronger is the foldability of the sequence. Sequences with the most probable ∆h values (i.e., the ∆h value with the maximum probability) are generally unfoldable even when their energy parameters have been optimized. The above results show that a large ∆h value for a sequence is a condition that a set of good energy parameters can be found for the sequence. This result can be rationalized by the following simple considerations. First, the large ∆h value of a sequence reflects the fact that the native contacts in the sequence are concentrated in a small number of pairs of residue types, i.e., the native contacts in the protein are dominated by only a few types of residues. Second, a large ∆h value occurs most likely when the native contacts formed by residues at short distances on the chain and by residues at long distances on the chain occur in the same residue-type pair. In other words, the short- and long-range contacts are consistent in terms of residuetype pairs. The number of native contacts in a structure is limited. In order for the number of native contacts between two types of residues to be close to their maximum possible pairing number, which is the basis of large ∆h values, there must be as many as possible short- and long-range native contacts formed between the two given types of residues. For sequences with the above two characteristics, it is relatively easy to assign attractive parameters to the group of pairs of residue types that form contacts in the native structure while assigning repulsive energies to the other group of pairs of residue types
14548 J. Phys. Chem., Vol. 100, No. 34, 1996 that do not make contact in the native structure. With such a set of energy parameters, the protein model is generally foldable. Therefore, foldable sequences are characterized by consistent interactions in the native structure between pairs of different types of residues and between short- and long-range interactions.23,47,48 It should be noted that, according to our results, a protein is foldable when the short- and long-range interactions in the native structure of the protein are mutually consistent to a sufficiently large extent, but not necessarily 100%. The percentage of consistency of the short- and long-range interactions in a protein for foldability depends both on the sequence and on the native structure of the protein. It would be interesting to examine whether real protein sequences exhibit the same relationship between their ∆h values and their foldability as in the above lattice-chain model of proteins. However, in real proteins, contacts actually occur between individual atoms instead of residues. In the calculation of the contact patterns of real proteins based on the current approach of representing each residue by a single point (i.e., the united-residue model), some contacts in real proteins are omitted while other extra contacts are introduced in an artificial and random manner. As a result, the calculated ∆h value would deviate from the underlying pairwise contacts among the residues that stabilize the native structure of real proteins. The corrections to those random contacts resulting from the unitedresidue models of real proteins are a complicated problem itself and therefore was not treated in the present work. IV. Conclusions In this work, we developed a new procedure for optimizing the energy parameters for protein folding. The lattice-chain protein models were studied by both statistical-mechanical analyses and conventional MC simulations. By optimizing the energy parameters, we investigated the nature of the correct potentials that fold proteins. We showed that the foldability of a protein model can be characterized quantitatively by the density of states of the model and that the thermodynamic characteristics of foldable model proteins are closely correlated with their kinetic folding behavior. We demonstrated that for many sequences, optimization of the potential leads to strongly foldable protein models that can be folded to the target native structure uniquely and deterministically, but the sequence of a model protein is an important factor in determining the goodness of the optimized energy parameters. We further obtained a classification for the sequences about their foldability with optimized energy parameters. Even though the present study was based on the simple cubiclattice chain protein model, the consistency of the various components of the interactions in the protein exhibited in the present system should be a general condition for a good folding potential for either simpler or more sophisticated models of real proteins. A novel result of this work is the demonstration that MC simulations with arbitrary energy parameters can provide useful information, on the basis of which the energy parameters can be optimized. On the basis of this approach, one can relax the stringent requirement for a very high quality of energy parameters in order to fold proteins in theoretical simulations because one can optimize the energy parameters from a poor starting set of parameters. Acknowledgment. This work was supported by grants from the National Institutes of Health (GM-14312) and the National Science Foundation (MCB95-13167). Support was also received from the Association for International Cancer Research and the Cornell Biotechnology Center. Part of the simulations in this work were carried out on the IBM SP2 parallel computer
Hao and Scheraga at the Cornell National Supercomputer Facility, which is a resource of the Cornell Theory Center, which is funded in part by the National Science Foundation, New York State, the IBM Cooperation, and the members of its Corporate Research Institute, with additional funding from the National Institutes of Health. References and Notes (1) Shakhnovich, E. I.; Gutin, A. M. Proc. Natl. Acad. Sci. U.S.A. 1993, 90, 7195. (2) Shakhnovich, E. I.; Gutin, A. M. Protein Eng. 1993, 6, 793. (3) Shakhnovich, E. I. Phys. ReV. Lett. 1994, 72, 3907. (4) Yue, K; Dill, K. A. Proc. Natl. Acad. Sci. U.S.A. 1995, 92, 146. (5) Deutsch, J. M.; Kurosky, T. Phys. ReV. Lett. 1996, 76, 323. (6) Goldstein, R. A.; Luthey-Schulten, Z. A.; Wolynes, P. G. Proc. Natl. Acad. Sci. U.S.A. 1992, 89, 4918. (7) Goldstein, R. A.; Luthey-Schulten, Z. A.; Wolynes, P. G. Proc. Natl. Acad. Sci. U.S.A. 1992, 89, 9029. (8) Maiorov, V. N.; Crippen, G. M. J. Mol. Biol. 1992, 227, 876. (9) Wang, Y.; Zhang, H.; Li, W.; Scott, R. A. Proc. Natl. Acad. Sci. U.S.A. 1995, 92, 709. (10) Shrivastava, I.; Vishveshwara, S.; Cieplak, M.; Martin, A.; Banavar, J. R. Proc. Natl. Acad. Sci. U.S.A. 1995, 92, 9206. (11) Hao, M.-H.; Scheraga, H. A. Proc. Natl. Acad. Sci. U.S.A. 1996, 93, 4984. (12) Tanaka, S.; Scheraga, H. A. Macromolecules 1976, 9, 945. (13) Miyazawa, S.; Jernigan, R. L. Macromolecules 1985, 18, 534. (14) Sippl, M. J. J. Mol. Biol. 1990, 213, 859. (15) Bowie, J. U.; Lu¨thy, R.; Eisenberg, D. Science 1991, 253, 164. (16) Ouzounis, C.; Sander, C.; Scharf, M.; Schneider, R. J. Mol. Biol. 1993, 232, 805. (17) Kolinski, A.; Godzik, A.; Skolnick, J. J. Chem. Phys. 1993, 98, 7420. (18) Sasai, M. Proc. Natl. Acad. Sci. U.S.A. 1995, 92, 8438. (19) Kocher, J.-P. A.; Rooman, M. J.; Wodak, S. J. J. Mol. Biol. 1994, 235, 1598. (20) Hao, M.-H.; Scheraga, H. A. J. Phys. Chem. 1994, 98, 4940. (21) Hao, M.-H.; Scheraga, H. A. J. Phys. Chem. 1994, 98, 9882. (22) Socci, N. D.; Onuchic, J. N. J. Chem. Phys. 1994, 101, 1519. (23) Bryngelson, J. D.; Wolynes, P. G. Proc. Natl. Acad. Sci. U.S.A. 1987, 84, 7524. (24) Shakhnovich, E. I.; Gutin, A. M. Biophys. Chem. 1989, 34, 187. (25) Frauenfelder, H.; Wolynes, P. G. Phys. Today 1994, 47, 58. (26) Amadei, A.; Apol, M. E. F.; Nola, A. D.; Berendsen, H. J. C. J. Chem. Phys. 1996, 104, 1560. (27) Wolynes, P. G. In Protein Folds; Bohr, H. Brunak, S. Eds.; CRC Press, Inc.: New York, 1996; p 3, and references therein. (28) Hao, M.-H.; Scheraga, H. A. J. Chem. Phys. 1995, 102, 1334. (29) Hao, M.-H.; Scheraga, H. A. Supercomputing’95, Proceedings of the 1995 ACM/IEE Supercomputing Conference, December 3-8, 1995, San Diego, CA; ACM: New York, 1995; available on CD ROM. (30) Guo, Z.; Thirumalai, D.; Honeycutt, J. D. J. Chem. Phys. 1992, 97, 525. (31) Camacho, C. J.; Thirumalai, D. Proc. Natl. Acad. Sci. U.S.A. 1993, 90, 6369. (32) Skolnick, J.; Kolinski, A. J. Mol. Biol. 1992, 221, 499. (33) Kolinski, A.; Skolnick, J. Proteins 1994, 18, 338. (34) Kolinski, A.; Galazka, W.; Skolnick, J. J. Chem. Phys. 1995, 103, 10286. (35) Skolnick, J.; Kolinski, A. Annu. ReV. Phys. Chem. 1989, 40, 207. (36) Sali, A.; Shakhnovich, E.; Karplus, M. J. Mol. Biol. 1994, 235, 1614. (37) Ferrenberg, A. M. Swendsen, R. H. Phys. ReV. Lett. 1988, 61, 2635. (38) Ferrenberg, A. M. Swendsen, R. H. Phys. ReV. Lett. 1989, 63, 1195. (39) Berg, B. A.; Celik, T. Phys. ReV. Lett. 1992, 69, 2292. (40) Lee, J. Phys. ReV. Lett. 1993, 71, 211.; 1993, 71, 2353 (erratum). (41) Hansmann, U. H. E.; Okamoto, Y. J. Comput. Chem., 1993, 14, 1333. (42) Tasi, C. J.; Jordan, K. D. J. Chem. Phys. 1993, 99, 6957. (43) Chan, H. S.; Dill, K. A. J. Chem. Phys. 1989, 90, 492. (44) The concave extent of the entropy curve is the primary statisticalmechanical criterion for the cooperativity of the folding transition. The sigmoidal shape of the thermal folding/unfolding curve, on the other hand, is only a secondary measure that does not provide a clear indication as to whether there is a free-energy barrier in the collapse transition of heteropolymers. (45) Abkevich, V. I.; Gutin, A. M.; Shakhnovich, E. I. J. Chem. Phys. 1994, 101, 6052. (46) Abkevich, V. I.; Gutin, A. M.; Shakhnovich, E. I. J. Mol. Biol. 1995, 252, 460. (47) Wako, H.; Scheraga, H. A. Macromolecules 1981, 14, 961. (48) Goj., N. Annu. ReV. Biophys. Bioeng. 1983, 12, 183.
JP960856J