J. Phys. Chem. 1992, 96, 10102-10104
10102
Teaching Polymers To Fold Richard S . Judsont Centerfor Computational Engineering, Sandia National Laboratories. Livermore, California 94551 -0969 (Received: September 28, 1992)
A new method is preaentcd for predicting folding pathways of polymers. The folding pathway is described as a generic program or sequence of logical steps of such a form that a computer can carry them out to produce a folded structure. A genetic algorithm (GA)is used to learn specific sequences or folding pathways that carry a denatured conformation into a target final conformation. The method is demonstrated on a model 2-dimensional polymer for which the global energy minimum is known. The GA learns a program that will fold a denatured polymer into its global energy minimun conformation.
Genetic algorithms are optimization methods based on Darwinian evolution. Populations of individuals interact with one another through selection and mating operations to produce individuals that have increasingly higher “fitness”. In our case the f i t ” of a program will be the negative of the conformational energy at the end of the folding sequence. GAS have been used for a wide variety of global optimization problems ranging from jet engine designI7 to pulse optimization’*Jgto horse race handicapping.m Several groups have recently used them to find lowenergy structures of small-to-medium sized molecules using variants on standard conformation searching This letter demonstrates two important new ideas. The first is that a folding pathway can be specified by a generic program or set of des.Particular programs can be tested against known experimental data, most importantly whether they lead to the known final folded structure. The programs used here are not the most general possible, but they demonstrate mwt of the properties that a folding program should have: they are logically simple; they lead to a well-defined final state; they allow both the making and breaking of residueresidue van der Waals type interactions. The second important idea is that a GA can be used to let a computer sort through a large number of possible programsz6to find ones that meet a specified end state. Although I do not strese this here, ultimately it may be advisable to invoke a maximum parsimony principle to select one of several prograas that fold to the same known final state. A parsimony principle would state that the simplest pathway that leads to the end result is the most likely to occur in nature.
Iatroductim
One of the most striking properties of proteins is their ability to fold reproducibly into complex 3-dimensional shapes. This is especially surprising in the face of the well-known facts that (1) folding is driven by a free energy minimization process for many proteins and (2) the dimensionalityof conformation space for a medium size peptide is so large that random search for the global minimum energy conformation would take an astronomically long time. Although some proteins only fold with the aid of cellular machinery’ so that minimizingthe internal energy of the isolated protein is not the sole driving force in the folding process, there are proteins that fold spontaneously in an energy minimizing p r ~ c e s s .Clearly ~ ~ ~ evolution has played a large role in selecting peptide sequences that reproducibly fold into well-defined structures. These proteins must have the property that at any stage in the folding process there are only a limited number of directions in conformation space that are energetically downhill-the potential surface must have a small number of troughs that lead inexorably down to the native conformation which may or may not be the global minimum. The native conformation only has to possess the property that, like Rome, all roads lead to it. Much experimental and computational evidence shows that subdomains of the protein spontaneously fold into compact structures such as helices, sheets and turns, presumably through purely local energy minimization. These s u b domains then fold together to yield the native conformation. At each stage, welldefined structures emerge through minimization in a conformationalspace of much lower dimension than that of the total protein. comput8tionalMtthah Clearly, a logical way to predict structures of proteins or other Our model system is a 2-D linear (Le. unbranched) polymer structurally homogeneous polymers is to first predict which consisting of a number of atoms connected together by harmonic substructures will form spontaneously and then to understand how bonds, as is shown in Figure 1.” The energy is given by they fold into the final overall structure. A number of methods in this category have been developad and give useful and intereating predictions. Among these are molecular dynamics methods, typically using simplifted models of proteins that employ potentials where the first sum is over all bonds and the second is over all of mean force$J and homology-based methods that predict which pairs of atoms that are not bonded together. There is no bond subsequences will form helices, turns, etc., and then attempt to angle term. The equilibrium bond distance is chosen to be ro = see how those structuns will fdd into a f d compact ~ t r u c t u r e . ~ 1.O and the force constant is Kb= 100.0. Note that in ref 21 the This letter presents the first pieces of a new approach to disbonds were treated as rigid. The Lennard-Jones term has a covering folding pathways for polymers that should eventually minimum of depth 1.0 at a distance rr, = 1.0 so that the global be applicable to the specific caw of proteins. The principal idea minimum of the molecule is a hexagonal close-packed structure used is that a folding pathway can be thought of as a program with each atom having a maximum number of other atoms unit or wries of logical steps of the form: reaidue R1associates with distance away from it. Such a structure is shown in Figure 1 for residue R2(e.g., through hydrogen bonding), residue R3 associates the case of 19 atoms. with residue R,, etc. The computational task reduces to disThe folding program is given by a series of pairs of atoms that covering or learning a program that takes an unfolded polymer will be sequentially brought together. Each fold is carried out into its folded state without having to climb any unphysically high by adding a harmonic bond with spring constant of Kb/2 and energy barriers. I will describe a generic program that folds a equilibrium distance of 1.O between the two atom to be brought 2-dimensional model polymer and show how to use a genetic together and then performing a gradient minimization. A conto learn programs that will fold the polymer into jugate gradient method is used?7 Once the energy is minimized, its known global energy minimum conformation. the temporary bond is deleted and the atoms are moved slightly away from their equilibrium positions, 0.25 distance units, in a Email:
[email protected]. random direction. Once all of the folds have been carried out, 0022-3654/92/2096- 10102$03.00/0
@ 1992 American Chemical Society
Letters
The Journal of Physical Chemistry, Vol. 96, No. 25, 1992 10103
1-
Pop. 1
40-
20-
Px
w
Figure 1. Example of the global energy minimum conformation of the
-2v
19-atom polymer. Note that all the atoms lie on a hexagonal array.
a final gradient minimization is performed with no extra constraints. The initial 'denatured" conformation in all cases is an almost linear zig-zag with angles of t 5 O . The actual folding program is of fved length, typically 20 folds, but also contains a regulator gene that specifies how many of the total folds to actually carry out, up to the maximum. The folds are specified in such a way that an atom will never try to fold onto itself or onto one of its neighbors. A GA1*I6 is used to find programs that will take the initial denatured polymer into one of its global energy minimum conformations. A modified version of Mtihlenbein's breeding GA is used here.l4.Is As in all GAs we use populations comprised of a number of individuals. Each individual is specified by a chromosome which is just the string of integers that gives the folding ptogram; from the chromosome, we can calculate a fitness which is the negative of the final folded conformational energy. (Minimizing en,ergy maxi"fitness.) The initial chromosomes for the populabon are chosen at random and the fitness for each individual is calculated. The individuals in this first generation then praluce offspring who will be parents for the next generation. Parents produce children under the action of selection, recombination, and mutation operators. The selection operator selects the best fraction of the population for breeding; here the top 50% are chosen. In recombination, two parents' chromosomesare cut at a random locus, and the right and left halves of the two chromosomes am interchanged and given to the two children. Both children are placed into the population. Pairs of parents in the selected group are chosen, and children are formed until the new population is full. (A fmed-size population is used.) The elitist strategy is also used which means that one copy of the current best individual is always passed directly from one generation to the next. After recombination each of the new chromosomes is passed to the mutation operator which, with a probability given by the mutation rate, randomly changes value of the loci in the chromosome by *l. The rate used here is 2/N% where Nlociis the number of loci on the chromosome. Therefore, on average two sites on each chromosome are mutated each generation. MIihlenbeiin advocates using half this rate, but our regulator gene allows only on the order of half the loci to be expressed at any one time. The higher rate was chosen to allow on the order of one mutation to occur in the expressed portion of each chromosome each generation. Finally the fitness value of each individual is calculated and the cycle begins again. Several populations are evolved independently, exchanging their best individuals once every 10 generations. Numerid Rcsplts The test problem consisted of folding a 19-atom polymer. We use two populations of 20 individuals each. The maxi"number of folds is 20, so the length of the chromosome is 41; one integer to specify how many folds and then 20 pairs of atom indices. Figures 2 and 3 show the evolution of the average energy of the populations and the energy of the best individual found so far. From Figure 2 one 8cf8 that the average energy varies widely; this arises because recombination and mutation can occasionally produce folding pathways that lead to high energy knotted structures. See ref 21 for a discussion of the potential energy surface of this model polymer. One or two such structures of high
V
-v.
0
10
20
30
40
Generation
F'igure 2. Evolution of the population average and the best individual as a function of number of generations. The top, noisy curves are the population averages for the two populations. The lower, smooth curves show the energies of the best individuals found.
P
2
w
0
10
20
30
40
Generations
F'igure 3. Same as Figure 2 except only the evolution of the best energy found is shown, on an expanded scale. The plateau effect is typical of
GA optimization performance. energy are sufficient to make the population average very high. Figure 3 shows the how the best energy changes in the two populations. The sudden drops separated by plateaus lasting several generations is typical of GA optimizations. The final plateau, reached at generation 31 is at energy = -27.34, the global energy minimum. Figure 4 shows one folding pathway that leads to a global minimum energy conformation. Note that the final conformation here ('6") is not the same as shown in Figure 1. However, these only differ by the arrangement of the bonds and are energetically degenerate. At the top is the initial 'denatured" state (conformation 0) which folds through succagsive conformations 1-6. In each conformation, atom 0 is at the left or upper-left end of the chain. The pathway used five folds with atom pairs (6,14), (6,14), (5,16), (3,9), and (1,12). The X's show the pair of atoms connected by a temporary bond to fold one conformation to the next. Recall from the previous section that after a fold is carried out (Le., the conformation is minimized in the presence of the temporary bond) all of the atoms are shifted away from their minimized positions slightly (0.25 distance units in random dirsctions). This enables two successive folds with the same pair of atoms connected together, as we have here with (6,14) as both fold 1 and 2, to not lead to the same conformation. This is because the starting conformation for the second minimization may not be in the basin of attraction of the first local minima. The energits shown are for the conformation found on minimizing with the temporary bond, but without the energy contribution from that
10104 The Journal of Physical Chemistry, Vol. 96, No. 25, 1992
1 -20.08
2
-21.44
3 -3.57
-27.34
Figure 4. Folding pathway used to reach a global minimum energy conformation. b i d e each conformation is a sequence number (0-6) and the corresponding conformational energy (global minimum = -27.34). The atoms that are connected by a temporary bond to go from one conformation to the next are marked by X’s. In each of the conformations, atom number 0 is at the left or upper left end of the chain. To go from conformation 5 to conformation 6 a final gradient minimization was performed with no temporary bonds.
bond. conformation 6 is the mult of a final gradient “ h a t i o n of conformation 5 with no temporary bonds added. There are several important points to make from the folding pathway. First, in no case do the pair of atoms tied together by the temporary bond end up coming right next to one another. The temporary bonds tend to have a more global effect on the structure which is dependent on the starting conformation. Second, the GA does not necessarily learn when to quit. After fold 4, the global minimum had been found, but an additional fold produced a strained conformation slightly higher in energy. This only dropped back into the global minimum after the final gradient “ h a t i o n was carried out. F d y , the GA will allow construction of folding pathways that pass through high energy intermediates such as conformation 3. The only criteria used here for judging the relative merit of one pathway over another is the energy at the end of the pathway.
Discussion The example given in the previous section has shown that a simple set of folding rules can be used by a genetic algorithm to find the global energy minimum of a polymer. Before this approach can be applied to more realistic molecules, several extensions must be added. First, a more generic folding program must be devised. In particular, several of the simple pairwise folds used here must be allowed to occur simultaneously. For instance in a protein, the formation of a helix may occur all at once from the simple desire of the hydrophobic side chains to get out of the water environment, rather than occurring sequentially in a zipperlike fashion. Conditional folding rules need also to be devised. A conditional rule is one that is only camed out if it is appropriate given the current conformation of the molecule. For instance, in the example used here atom i would only be brought up to atom j if the temporary bond did not have to cross too many permanent
Letters bonds. Carrying out such a fold could only lead to crushing the molecule. Second, other criteria must be developed to discriminate between good and bad folding pathways. In particular, a pathway that requires that a high-energy barrier be climbed should be discarded even if the final folded conformation is low in energy. Finally, I will outline one variant of this approach that could shed some light on the protein folding problem. Test folding pathways would be generated and evaluated by seeing how well they folded a particular protein from a denatured state towards its known native conformation. For efficiency’s sake, it would probably be necessary initially to use a reduced representation of the protein as in many of the lattice models. The folding pathways would be evolved until one was found that could generate the native conformation without having to climb unphysicaliy high energy barriers. If this scheme was successfully carried out for a number of proteins then we should be able to see if any common themes show up in folding pathways from one protein to another. If such themes do arise, they could substantially aid in the pres diction of conformations of novel proteins. Acknowledgment. The author acknowledges partial support by the Department of Energy under Contract DE-ACO476DP00789 and by the Biomedical Division of Lawrence Livermore National Laboratory and helpful discussions with Dr. T. Ngo, Dr. H. MBhlenbein, Dr. M. Colvin, and Dr. L. Banzhaf.
References and Notes (1) Gierasch, L. In Protein Folding, Gierasch, L., King, J., Eds.; AAAS: Washington, DC, 1990; p 211. (2) Gh6lis, C.; Yon, J. Protein Folding, Academic Press: New York, 1982. (3) Goldberg, M. In ref 1, p 143. (4) Skolnick, J.; Kolinski, A. Science 1990, 250, 1121. ( 5 ) Hinds, D. A.; Levitt, M. Proc. Natl. Acad. Sci. U S A . 1992,89,2536. (6) Chou, P. Y.; Fasman, G. D. Biochemistry 1974, 13,222. (7) Jones, D. T.; Taylor, W. R.; Thornton, J. M. Nature 1992, 358, 86. (8) Overington, J.; Johnson, M. S.; Sali, A,; Blundell, T. L. Proc. R . Soc. London 1990,8241, 132. (9) Bowie, J. U.; Luthy, R.; Eisenberg, D. Science 1991, 253, 164. (10) Goldberg, D. E. Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley: Reading, MA, 1989. (1 1) Davis, L. Handbook ofGenetic Algorithms; Van Nostrand Reinhold New York, 1991. (12) Proc. Third Inr. Conf. Genetic Algorithms; Schaffer, J. D., Ed.; Morgan Kaufman: San Mateo, CA, 1989. (13) Proc. Fourth Int. Con$ Genetic Algorithms; Belew, R. K., Booker, L. B., Eds.; Morgan Kaufman: San Mateo, CA, 1991. (14) Mtlhlenbci, H. Foundations of Genetic Algorithms; Rawlins, G. J. E., Ed.; Morgan Kaufmann: San Mateo, CA, 1991. (15) Mtlhlenbein, H.; Schomisch, M.; Born, J, Parall. Compur. 1991,17, 619. (16) Banzhaf, W. Parallelism, Learning, Evolution; Baker, J. D.. Eisle, I., Mundemann, F. W., Eds.;Springer Lecture Notes in Artificial Intelligence; 1991, 565, 442. (17) Bramlette, M. F.; Cusic, R., in ref 12, p 213. (18) Judson, R. S.; Rabitz, H. Phys. Reu. Lett. 1992, 68, 1500. (1 9) Ngo, J. T.; Morris,P. G. Magn. Reson. Med. 1987,5,2 17. Morris, P. G.; McIntyre, D. J. 0.; Rourke, D. E.; Ngo, J. T. Magn. Reson. Med. 1991, 17, 33. Ngo, J. T.; Moms, P. G. J. Magn. Reson. 1987, 74, 122. (20) de la Maza, M. In ref 12, p 208. (21) Judson, R. S.; Colvin, M. E.; Meza, J. C.; Huffer, A.; Gutierrez, D. Int. J. Quantum. Chem. 1992, 277, 44. (22) Judson, R. S.; Jaeger, E. P.; Treasurywala, A. M.; Peterson, M. L. J. Am. Chem. Soc., submitted. (23) Legrand, S.; Merz, K. J. Global Opt., in press. (24) Lucasius, C. B.; Blommers, M. J. J.; Buydens, L. M. C.; Kateman, G. In ref 11, p 251. (25) Blommers, M. J. J.; Lucasius, C. B.; Kateman, G.; Kaptein, R. E i e polymers 1992, 32, 45. (26) Mjolsness, E.; Sharp,D. H.; Reinitz, J. J. Theor. Biol. 1991,152,429. (27) Press, W. H.; Flannery, B. P.; Teukolsky, S.A.; Vetterling, W. T. Numerical Recipes in C; Cambridge U.P.: New York, 1989.