A Limiting-Case Study of Protein Structure Prediction: Energy-Based

The accurate prediction of the 3-dimensional structure of a protein from its sequence is a major unsolved problem. The possibility of such a predictio...
0 downloads 0 Views 215KB Size
11370

J. Phys. Chem. B 2000, 104, 11370-11378

A Limiting-Case Study of Protein Structure Prediction: Energy-Based Searches of Reduced Conformational Space Robert J. Petrella† and Martin Karplus*,‡ Department of Chemistry and Chemical Biology, HarVard UniVersity, 12 Oxford Street, Cambridge, Massachusetts 02138, and Laboratoire de Chimie Biophysique, Institut le Bel, UniVersite´ Louis Pasteur, 67000 Strasbourg, France ReceiVed: May 18, 2000; In Final Form: August 10, 2000

The accurate prediction of the 3-dimensional structure of a protein from its sequence is a major unsolved problem. The possibility of such a prediction using an atom-based energy function and a systematic search procedure has been demonstrated here in a model problem. Three proteins (parvalbumin, a fibronectin domain, and CheY) in different classes (all R, all β, and mixed R/β, respectively) were studied. All the secondary structural elements and the side chains were fixed in the X-ray conformation, and only one backbone dihedral angle in each of the loops between the secondary structural elements was allowed to vary. Energy-based searches of this reduced conformational space were carried out with a solvent-modified empirical energy function. At each stage of the searches, many structures were generated by varying different combinations of angles, and the minimum-energy structure was selected as the starting structure for the next stage. The final minimum-energy structures for all three proteins were within 0.24 Å of the X-ray structures. Less extensive search protocols, in which only one angle was varied at a time, became trapped. The energy surfaces were found to become steeper as the absolute energy minimum was approached. The results support the inference from other studies that the energy function used here has its minimum at the native structure. In addition, they demonstrate the importance of using a search algorithm in which many structures, including variations in multiple degrees of freedom, are evaluated at each stage. Although the model calculations treat a highly simplified version of the protein folding problem, the methodology used here and its success provide insights that should aid in the study of more realistic models.

I. Introduction The prediction of the native structure of a protein from its amino acid sequence by ab initio methods is a major goal of present-day biophysical research.1,2 The availability of many more sequences from genomic analysis3 than of known protein structures from crystallographic or NMR studies has made the solution of this problem of practical, as well as fundamental, importance.4 Difficulties arise from the needs for an energy function that can discriminate native from non-native conformations (i.e., that has the native structure as its global minimum)5,6 and for a robust search algorithm that can find the native state on the complex energy landscape of a polypeptide chain.7 Because of the difficulty of the problem, solutions of simplified model problems are of interest. One simplification has assumed that the secondary structure is known. This approach was first tried for carp myogen by Warshel and Levitt8 and for apomyoglobin by Cohen et al.9 It continues to be explored.10-13 The most successful studies are those of Standley et al.,13 who used a simulated annealing Monte Carlo method with residue-based potentials for two small R/β proteins and obtained low-energy clusters of structures (second or third in energy rank) that were within a few angstroms of the native structures. Also, Mumenthaler and Braun14 used rigid modeling of the secondary structure and statistically derived distance-geometry constraints * Corresponding author. E-mail: [email protected]. Tel.: (33) 388 41 61 32. Fax: (33) 388 60 63 83. † Harvard University. ‡ Universite ´ Louis Pasteur.

to predict the native structures of six out of eight helical proteins to within 2.3-3.1 Å. These results raise the question of whether more accurate predictions are possible using a more physical approach. The present work addresses this question by using an atom-based energy function and an extensive, systematic search of a reduced conformational space to predict the tertiary structure of the three proteins: parvalbumin (an all-R protein), the tenth type-III cell adhesion module of fibronectin (FNIII10, an all-β protein), and CheY (a mixed R/β protein). The exact main-chain structures and side-chain conformations of each of the secondary structural elements are used, and only one of the angles in each of the loops between the elements is allowed to vary; all other angles are fixed in the X-ray conformation. For the protein CheY (Figure 1a), which consists of 128 residues that form five helices and five β strands, this reduces the search problem to one with 9 degrees of freedom. For parvalbumin, which has six helices, and FNIII10, which has seven β strands, the conformational spaces consist of 5 degrees and 6 degrees of freedom, respectively. A series of grid-based searches is performed with decreasing grid sizes for each protein. At each stage, the lowestenergy structure obtained from a search through all possible combinations of the N unknown angles (5, 6, or 9 degrees) taken n at a time was used as the starting structure for the next stage. The atomic-model effective energy function EEF1,15 which incorporates the CHARMM16 energy for the protein and an implicit solvation term based on a Gaussian energy density distribution with a volume exclusion component, is used to

10.1021/jp001847k CCC: $19.00 © 2000 American Chemical Society Published on Web 10/26/2000

Energy-Based Search for Native Protein Structure

J. Phys. Chem. B, Vol. 104, No. 47, 2000 11371

Figure 1. Ribbon diagrams of the CheY molecule (2chf) in (a) its native structure, (b) stage 1, (c) stage 4, (d) stage 7, (e) stage 8, (f) stage 9, (g) the final trapped structure for search A3, and (h) the final trapped structure for search A4. Some R helices (R), β strands (β), and loops or tested angles (φ) are indicated. Correctly predicted loops are indicated as solid lines.

evaluate the structures obtained in the search. The function has been shown to discriminate the native states of several proteins6 from ranges of well-constructed decoys,17 suggesting that the native states are at or near the global minima of the surfaces. In the reduced conformational spaces explored here, the minimum-energy structures obtained on final 1° grids are found to be within 0.19, 0.13, and 0.24 Å of the X-ray structures for the three proteins. The implications of these results are discussed. II. Computational Methods and Details A. Calculations. Unless otherwise indicated, energies were calculated with the CHARMM program16 using EEF1.15 EEF1 makes use of the polar hydrogen (PARAM19) empirical energy

function18 in CHARMM and adds an implicit solvation term based on a Gaussian solvation energy density function. The solvation energy of each atom, i, ∆G slv i is defined by ref ∆G slv i ) ∆G i -

∑ fi(rij)Vj i*j

where ∆G ref i is the reference solvation energy for atom i (the solvation energy of that atom in a suitably chosen small molecule in water), fi(rij) is the Gaussian solvation energy density function parametrized for atom i, rij is the distance between atom i and an atom j in the surroundings, and Vj is the volume of atom j. Hence, the solvation energy for each atom is the reference energy minus the reduction in solvation due to

11372 J. Phys. Chem. B, Vol. 104, No. 47, 2000 the presence of neighboring atoms. The total solvation energy of the molecule is ∑i∆G slv i . Ionic groups are neutralized, and a distance-dependent dielectric function is used to simulate the shielding effects of water on electrostatic interactions. Calculations with EFF1 are generally at least an order of magnitude faster than vacuum calculations with explicit water molecules for molecular dynamics.19 Because all degrees of freedom except for the chosen dihedral angles were held fixed in the present study, only the dihedral, nonbonded, and solvation terms were calculated; the angle, bond, and improper dihedral energy terms were not. A group-based nonbonded list was used with a 10 Å cutoff criterion, the value in the EEF1 model. Since displacements can be large with backbone dihedral angle rotations, the list was updated at every step of the calculations. The BYCC nonbonded option39 was used, which speeds up nonbonded list generation by a factor of ∼3 in the current calculations and reduces the total computational time by a factor of ∼1.3. Calculations were performed on DCG Viper PC164 Alpha workstations and averaged ∼0.083 s per structure. “Correct” or “near-native” angle positions were defined to be within (10° of the native value. An average (1dimensional) slope of the energy surface over the set of n-dimensional grids was defined for each stage as

∆E ∆p 1/n

m j) s

() c

where ∆E is an energy range, ∆p is the number of structures whose energies are within the range, s is the grid size, and c is the number of grids (in this work, c ) N!/(N - n)!n!). The denominator is a 1-dimensional measure of the hypervolume occupied by the average number (over c grids) of gridpoints with energies in the specified range. Hence, for the given energy range, m j gives the approximate magnitude of the average change in energy for a unit change in 1 degree of freedom. B. Grid Searches. Systematic rigid-rotation energy mapping was used to search the conformational space. For each protein, the internal coordinates of the fragments were held constant while the set of N loops/dihedral angles was rotated on the grids. An initial structure (stage 1) was generated by searching all N unknown dihedral angles simultaneously (i.e., on a complete N-dimensional grid) at 60° intervals for CheY, 30° intervals for the FNIII10 structure, and 15° intervals for parvalbumin. The grid size varied from protein to protein in this initial stage because the exact enumeration of the conformational spaces for the smaller proteins (which have fewer degrees of freedom) is possible with smaller grids. Subsequent stages of the protocols used finer grids and, to save computer time, consisted of partial though extensive searches. A set of structures, constituting a “stage”, was generated from a single starting structure by allowing n of the N dihedral angles to vary simultaneously throughout their n-dimensional space and by repeating this for all possible N!/(N - n)!n! combinations of the N angles taken n at a time (e.g., see Table 1 for CheY). The minimum-energy member of this set of generated structures was then used as the starting structure for the next stage, and the procedure was repeated. The grid size was decreased from one stage to the next when either (1) there was no change in the structure or (2) the number of stages completed with a particular grid size equaled or exceeded N/n. When n < N, the repetition at a given grid size allows for more than n angles to be sampled at that grid size, and the (N/n) limit prevents unnecessary lengthening of the calculations. The grid size for a given stage always exactly divided the grid size for the prior stage to ensure that the

Petrella and Karplus TABLE 1: Description of the Primary Search Protocol for CheYa stage

grid size

no. of angles

no. of comb’s/stg

no. of grid points/stg

CPU time/stg

1 2-3 4-6 7-10 11-19 totals

60° 30° 15° 3° 1°

9 4 3 2 1

1 126 84 36 9

10 077 696 2 612 736 1 161 216 518 400 3240 20 889 576

9.7 d 2.5 d 1.1 d 0.5 d 4.5 m 20.1 d

a For each stage or range of stages, the columns give the value of each parameter per stage. No. of angles refers to the number of angles being varied simulaneously throughout a stage. No. of comb’s/stg refers to the number of searches at each stage, corresponding to the number of combinations of the total number of unknown angles taken no. of angles at a time. No. of grid points/stg gives the total number of structures evaluated in a given stage. No. of grid points/stg ) (360°/ grid size)no. of angles × no. of comb’s/stg. For the abridged search protocols, the number of stages at each grid size was as follows (grid size, number of stages): A1, (15°, 4; 3°, 6; 1°, 7); A2, (3°, 10; 1°,7); A3, (1°, 12); A4, (30°, 4; 15°, 7; 3°, 7; 1°, 5). Note that, as in the primary search protocol, all N!/[(N - n)!n!] combinations of angles were explored at each stage so that, for example, in the “single angle”, searches all nine angles were explored in succession and 9 × 360 structures were evaluated at each stage. For the search protocols that became trapped in non-native orientations (A3 and A4), the searches at the finest grid size (1°) were repeated until there was no change in the structure.

structure resulting from the prior stage was included and that, therefore, the energy could never increase from one stage to the next. The number of angles (n) varied at a given stage was decreased at smaller grid sizes (e.g., see Table 2) to avoid the exponential increase in computer time. C. Protein Structures. The X-ray crystal coordinates for the proteins CheY (2chf),20 carp parvalbumin (4cpv),21 and the tenth type-III cell adhesion module of fibronectin, known as FNIII10 (1fna)22 were obtained from the Protein Data Bank23 and used for all calculations; no minimization of the coordinates was performed. Water atoms were removed, and the 2 Ca2+ ions were removed from the parvalbumin structure; the hydrogen atoms were built with H-BUILD.24 D. Selection of Tested Angles. One main-chain φ angle was chosen for variation in each loop. Gly residues were chosen whenever possible; when a Gly was not present, the smallest residue in the loop was chosen. For CheY, the residues are Ser 14, Gly 28, Gly 38, Gly 49, Gly 64, Ala 76, Ala 89, Gly 101, and Thr 111, referred to as angles φ1-φ9, respectively. For parvalbumin, the residues are Ala 21, Gly 34, Gly 56, Ala 72, and Gly 93. For FNIII10, they are Ser 12, Ala 19, Gly 35, Gly 47, Gly 56, and Gly 74. III. Results The results for the CheY protein are presented in detail, followed by a summary of the results for parvalbumin and fibronectin. A. CheY. Primary Search Protocol. Table 2 gives a summary of the results at certain stages of the calculations for CheY. The lowest-energy structure from the initial 60° grid search (stage 1) has an rms deviation (rmsd) from the native of 23.4 Å and an energy 278.0 kcal/mol greater than that of the native structure. The coarse grid search has yielded an expanded structure (Figure 1b) that has two approximately correct dihedral angles and no highly unfavorable van der Waals contacts. Subsequent searches at finer grid sizes result in a progressive decrease in the total energy, radius of gyration (Rg), number of dihedral angles incorrectly predicted, and rmsd from the native.

Energy-Based Search for Native Protein Structure

J. Phys. Chem. B, Vol. 104, No. 47, 2000 11373

TABLE 2: Analysis of Structures from Various Stages of the Primary Search Protocol (ST1-ST19) and from the Final Stages of Search Protocols A3 and A4 for the Protein CheYa grid (deg) no. of angles φ1 (deg) φ2 (deg) φ3 (deg) φ4 (deg) φ5 (deg) φ6 (deg) φ7 (deg) φ8 (deg) φ9 (deg) no. correct Etot Esolv Eelec Evdw Edihe Rg (Å) rmsd (Å)

native

ST1

ST2

ST4

ST5

ST7

ST8

ST9

ST10

ST19

A3

A4

-55.0 104.4 -59.2 78.6 -62.4 -88.7 -68.9 107.0 -102.6 0.0 0.0 0.0 0.0 0.0 13.5 -

60 9 180 60 -60 -120 -60 60 -120 -120 -120 2 278.0 -234.6 199.5 312.6 0.4 24.6 23.4

30 4 -60 -90 -60 150 -60 90 -120 -120 -120 3 258.2 -228.0 192.9 292.5 0.7 26.9 23.7

15 3 -60 -90 -60 150 -60 90 -60 105 -105 6 229.0 -201.3 176.1 254.2 -0.1 25.7 20.7

15 3 -60 -90 -60 150 -60 75 -60 105 -105 6 222.8 -200.8 173.2 250.6 -0.3 26.0 20.4

3 2 -60 -90 -60 150 -60 -87 -66 105 -105 7 153.2 -145.7 118.2 180.8 -0.1 19.4 11.2

3 2 -57 105 -60 150 -60 -87 -66 105 -105 8 134.1 -72.2 64.5 141.5 0.1 17.7 8.1

3 2 -57 105 -57 78 -60 -87 -66 105 -105 9 101.4 -10.1 20.8 90.7 0.0 13.7 0.62

3 2 -54 105 -57 78 -60 -87 -66 105 -102 9 64.5 -8.3 14.5 58.2 -0.1 13.6 0.37

1 1 -55 104 -58 78 -59 -88 -68 107 -102 9 -8.1 -5.6 3.3 -5.8 0.0 13.6 0.19

1 1 -55 104 -59 78 -58 -88 -114 -107 -102 7 55.8 -53.4 36.4 72.3 0.5 17.4 12.9

var 1 -60 -82 -60 78 -58 -88 -114 -101 -102 6 143.2 -158.0 117.1 183.8 0.1 21.0 16.4

a Grid refers to the grid size for the stage. φ -φ are the predicted angle values for the nine unknown dihedral angles (see Computational 1 9 Methods and Details). No. of angles refers to the number of angles varied simultaneously in that stage. No. correct is the number of correct dihedral angles (out of 9) within 10° of the native value. Etot is the total energy of the structure. Esolv is the solvation energy. Eelec is the electrostatic energy. Evdw is the van der Waals energy. Edihe is the dihedral angle energy. Rg is the radius of gyration. Rmsd is the root-mean-square deviation between the coordinates of the calculated and native structures. The correct dihedral angle values are indicated in bold. All energies are relative to those for the native structure and are in units of kcal/mol. Missing stage numbers (3 and 6) represent results equal to those of the prior stage. Stages 11-18 are omitted because the angle changes are very small.

Figure 2. Plot of the radius of gyration, Rg, and the various energetic components as functions of the stage number for the minimum-energy structures in each stage of the main search procedure for the protein CheY. The energy values are relative to that of the native (Enative ) 0).

This occurs until stage 9, where the resulting structure is close to that of the native (rmsd ) 0.62 Å). After this stage, the resulting structures change little; e.g., the structures for stages 9 and 19 are nearly superimposable. The approach to the native structure is illustrated in Figure 1b-f. The final rmsd from the native structures is 0.19 Å. From these results, it appears very likely that the global energy minimum on the nine dihedral angle effective energy surface has been found. Figure 2 shows the Rg and the energetic components of the resulting structures in successive stages of the search protocol. The variables are highly correlated, as expected. The solvation energy increases throughout the study, while the other energy components and the total energy decrease. In stages 6-9, the changes are more rapid and correlated with Rg. The solvation free energy becomes less favorable with increasing compactness because fewer polar and charged side chains are exposed to the solvent in compact structures, and this loss outweighs the gain from the burial of nonpolar groups (i.e., protein-water interac-

tions are more favorable in “open” structures despite a small energetic penalty for solvating nonpolar groups). However, the compact structures are stabilized by intraprotein electrostatic and van der Waals interactions, which are represented explicitly in EEF1 (not included in the solvation energy term) and which more than offset the destabilizing protein-solvent effects. Although the changes in the structures are small from stages 10-19, the total energy decreases from +101.4 to -8.1 kcal/ mol (relative to the native), mainly from a decrease in van der Waals energy. This indicates that the native structure is in a well-defined, rather deep minimum. Figure 3 shows plots of energy versus rmsd for the low-energy structures from certain stages of the calculations. In stage 1, there is a roughly rectangular distribution (i.e., the breadth in rmsd is independent of the energy). The rmsd distributions from subsequent stages are more funnel-shaped, and they are shifted to lower rmsd values at each successive stage. The wide ranges of energies and rmsd’s being sampled are indicative of the importance of the present procedure, in which many structures are examined (i.e., at each stage) before the structure for the next stage is selected. Stage 7 and the later stages display complex distributions, generally with a division into two lowenergy rmsd regions. In stage 7, for example, there are 15 227 structures within 100 kcal/mol of the minimum; 614 structures have rmsd’s less than 15.5 Å, 14 575 structures have rmsd’s greater than 17.5 Å, and only 38 structures have rmsd’s in between. Analysis of the two populations shows that they arise from sampling around the initial (rmsd ) 20.4 Å) and final (rmsd ) 11.2 Å) structures, with the main difference between the two involving angle φ6; the values are ∼75° and -85°, respectively, for the high- and low-energy rmsd regions. The sets of points of approximately constant energy (horizontal streaks) represent changes in a single dihedral angle. An example consists of the three similar streaks connecting the high- and low-energy rmsd regions; they correspond to changes in φ3 for slightly different φ6 positions, all close to the native position (φ6 ) -81°, -84°, and -87° in the order of decreasing energy).

11374 J. Phys. Chem. B, Vol. 104, No. 47, 2000

Petrella and Karplus

Figure 3. Plots of energy as a function of the rms deviation (rmsd) from the native structure for various stages in the main search procedure for CheY. The curvilinear streaks in the later distributions represent series of structures that vary in only one dihedral angle; within a given series, the structures vary significantly in rmsd from the native but are often very similar in energy.

The energy surfaces around the minima tend to become steeper in the later stages. This is demonstrated by the average (1-dimensional) slopes of the sampled energy surfaces for the stages (see Computational Methods and Details). For the energy range between 5 and 25 kcal/mol above the minimum for a given stage, the results were as follows (stage number, average slope in kcal/deg): 1, 0.22; 2, 0.51; 3, 0.41; 4, 0.98; 5, 0.75; 6, 0.62; 7, 4.9; 8, 1.5; 9, 5.8; 10, 16.3; 19, 6.0. The large increase at stage 7 corresponds to the significant reduction in the radius of gyration (see Table 2). Exploratory Calculations. A number of additional calculations were performed with CheY to explore the significance of the successful search reported above. Beginning with the structure from stage 1, the primary protocol was repeated with only van der Waals and dihedral angle energy terms. This resulted in a structure that was within 0.22 Å of the native and 9.81 kcal/ mol lower in energy. Hence, for this reduced conformational

space, the native structure appears to be the global minimum of both the effective energy surface and the “packing energy” surface. To investigate the effect of the search protocol on the results, we examined several abridged protocols, starting with the structure from stage 1 of the primary protocol, in the presence of all the energy terms. Calculations were performed following the primary protocol (see Table 1), except that various grid sizes were skipped; in A1, the 30° searches were skipped, in A2, the 30° and 15° searches were skipped, and in A3, only the 1°, single-angle searches were included. In a fourth case, A4, the searches were all conducted by varying only single angles (as in search A3), but the grid sizes were decreased gradually, as in the primary search protocol. A1 and A2, which included multiple-angle stages, resulted in identical final structures that were within 0.19 Å of the native and within 0.14 Å of the final structure from the primary protocol. However, A3 and A4

Energy-Based Search for Native Protein Structure became trapped (Figure 1g,h). Two-angle searches on a 3° grid, beginning with the trapped structures from A3 and A4, resulted in structures close to the native structure (rmsd < 0.5 Å). In the trapped structure from search A4 where β1 is out of place, β2 and β3 are nearer to each other than in the native structure, and a near-native positioning of β1 between them results in steric clashes. Thus, a search of φ2 alone does not lead to a lowerenergy structure; only when φ2 and φ4 are varied simultaneously does the structure open up (to a non-native φ4 position), allowing β1 and β2 to associate, and then collapse to the nativelike fold. For all of the successful search protocols, the last two incorrectto-correct transitions involve angles φ2 and φ4, in that order (e.g., see Table 1). Angle φ2 changes from approximately -90° to 105°, and angle φ4 changes from either -84° or 150° to 78°. In search A2, φ4 achieves a near-native position (∆φ ) 0.6°; β2 and β3 in near-native relative positions) before φ2 does, but in an intervening step, it changes back to an incorrect position (∆φ ) 168.6°) , allowing φ2 to assume a near-native position (β1 and β2 in their near-native relative orientations), before finally changing once more to a near-native position (∆φ ) 0.6°). This illustrates that the pathways in which contacting β strands assume their native relative positions before noncontacting strands are favored. The effect is readily observed in the packing of β1, β2, and β3 of CheY because the spatial arrangement of these strands in the native structure is not sequential (β1 is between β2 and β3), and it is not observed in the packing of β3, β4, and β5 (see Figure 4). To investigate whether the intraloop interactions are important in determining the ground state, we repeated the primary protocol (from the stage 1 structure) using only the nine dihedral angle energy terms and the nonbonded interactions between the secondary structural elements. The conformational search space, which is restricted by the presence of the connecting loops, was left unchanged. The search became trapped in a structure with an rmsd 17.6 Å from the native structure and 61.9 kcal/mol higher in energy than the native state (on a final 1° grid). In contrast to the successful searches, the radius of gyration changes little; it is 23.4 Å in the initial structure, 20.7 Å after the second stage, and 21.1 Å after the 12th and last stage. For globular proteins, this pattern can indicate that the search is trapped at a metastable state. As in the trapped structure from search A4, φ2 is incorrect (∆φ2 ) 92.6°), while φ4 is near-native (∆φ4 ) 3.6°). However, unlike the case with all the energy terms present, the open structures with φ4 incorrect and φ2 correct (see Figure 1e), which can be located with two-angle searches, are not lower in energy than the trapped structure in the absence of the loop interactions. Lacking this low-energy “intermediate” state, the search must go from the “trapped” configuration for the β1β2-β3 segment to the near-native configuration in a single stage, and the two-angle 3° searches (5.2 × 105 structures/stage) are not sufficient for this. Substitution of three-angle 5° stages (3.1 × 107 structures/stage) for the two-angle 3° stages resulted in the correct structure (rmsd 0.60 Å from the native; energy 61.8 kcal/mol higher than that of the native) after two stages but in a non-native structure after the 3rd stage (rmsd 3.08 Å from the native; energy 58.23 kcal/mol higher than that of the native; ∆φ9 ) 32.4°). The energy for the near-native conformer is essentially the same as that for the trapped structure (61.8 and 61.9 kcal/mol higher than the native energy, respectively) because the energy of the former was obtained with a coarser grid search (5° vs 1°), i.e., energies from searches using different grid sizes cannot be directly compared. The use of two-angle 1° stages (4.7 × 106 structures/stage) instead of the two-angle 3° stages resulted in near-native structures with a final rmsd of

J. Phys. Chem. B, Vol. 104, No. 47, 2000 11375

Figure 4. Effect of secondary structure topology on the order of folding observed in the CheY study. Ovals 1, 2, and 3 represent three secondary structural elements in a protein. In the upper pathways, elements 2 and 3 assume their native relative orientations first. In the lower pathways, elements 1 and 2 fold first. In A, the spatial arrangement of the elements in the folded structure is the same as their arrangement in the amino acid sequence, i.e., 1 contacts 2 and 2 contacts 3. The changes from either intermediate (1 and 2 or 2 and 3) to the final folded structure are under equivalent steric constraint, and thus, there is no preference for one pathway over the other. In B, the spatial arrangement of the elements in the folded structure (2,1,3) is not the same as their arrangement in the amino acid sequence (1,2,3). In this case, the change from the upper intermediate to the final structure is under higher steric constraint than that from the lower intermediate; the upper pathway is therefore disfavored.

0.14 Å, and an energy 6.5 kcal/mol less than that of the native. These results demonstrate the complexity of finding the global minimum for even such a restricted problem. The energy surface in the absence of the loop interactions is generally much flatter than that obtained with the complete energy function. The stage 1 structure, for example, is only 121.0 kcal/mol higher in energy than the native structure (compare to ∆E ) 278.0 kcal/mol for all interactions). Moreover, the first stage in which two angles were varied at 1° intervals generated 2699 distinct structures within 10 kcal/mol and over 6 × 105 structures within 25 kcal/mol of the minimum for the stage. To determine the importance of the side chains relative to the backbone in the current model, we repeated the calculations in stages 7, 9, and 19 of the primary search protocol (all energy components) with all side chain atoms removed, except for the Cβ atoms in non-glycine residues, which are essentially fixed

11376 J. Phys. Chem. B, Vol. 104, No. 47, 2000

Petrella and Karplus

TABLE 3: Analysisa of Structures from Various Stages of the Primary Search Protocols for (a) Fibronectinb and (b) Parvalbumin (a) Fibronectin grid (deg) no. of angles φ1 (deg) φ2 (deg) φ3 (deg) φ4 (deg) φ5 (deg) φ6 (deg) no. correct Etot Esolv Eelec Evdw Edihe Rg (Å) rmsd (Å)

native

ST1

ST2

ST3

ST4

ST5

ST7

ST8

ST12

1-dim

-158.7 -76.0 78.1 -61.9 71.6 142.8 0.0 0.0 0.0 0.0 0.0 13.4 -

30 6 -90 -60 -120 90 -120 -90 0 312.5 -213.0 255.2 269.0 1.2 16.3 28.6

15 3 -90 -60 75 -120 -120 135 2 272.0 -177.5 218.6 229.7 1.2 8.1 19.5

15 3 -90 -75 75 -60 -120 135 4 234.9 -145.0 178.3 200.9 0.7 4.7 15.7

3 2 -165 -75 75 -60 -120 135 5 201.3 -125.7 142.5 184.0 0.5 4.8 16.5

3 2 -165 -75 75 -60 75 -105 5 163.8 -75.1 91.3 147.6 0.0 5.0 17.6

1 1 -160 -75 75 -60 75 -105 5 136.7 -74.6 86.5 124.7 0.1 5.0 17.6

1 1 -160 -75 75 -60 75 142 6 56.3 -14.8 34.1 37.0 0.0 0.0 0.38

1 1 -160 -75 76 -61 73 142 6 25.9 -9.6 19.6 15.9 0.0 0.0 0.24

1 1 -88 -66 -120 94 67 142 3 238.4 -165.9 195.4 208.0 0.8 6.2 16.6

(b) Parvalbumin grid (deg) no. of angles φ1 (deg) φ2 (deg) φ3 (deg) φ4 (deg) φ5 (deg) no. correct Etot Esolv Eelec Evdw Edihe Rg (Å) rmsd (Å)

native

ST1

ST2

ST3

ST4

ST5

ST7

-53.0 89.9 95.7 -47.6 65.3 0.0 0.0 0.0 0.0 0.0 12.7 -

15 5 -75 90 -75 -75 60 2 143.5 -128.9 83.6 188.9 -0.2 8.6 17.2

5 3 -55 90 -75 -50 65 4 90.8 -82.6 45.6 128.2 -0.3 6.3 15.9

5 3 -155 85 95 -50 65 4 68.6 -48.8 46.9 70.5 0.1 3.1 9.3

1 2 -54 85 95 -48 65 5 37.3 -11.1 7.8 40.7 -0.1 0.0 0.37

1 2 -54 88 96 -48 65 5 0.7 -4.0 2.2 2.6 -0.0 -0.0 0.17

1 2 -53 88 96 -48 65 5 -2.8 -2.2 0.3 -0.9 -0.0 -0.0 0.13

a Stages resulting in no changes in the structures are omitted, as are stages 9-11 for fibronectin, in which the changes are small. Correctly predicted angles are in bold. Labels correspond to those in Table 2. b Analysis also for the final stage of the 1-dimensional search in fibronectin.

by the main chain. For stage 7, the resulting backbone structure was identical to that in the original set of calculations (i.e., φ6 and φ7 achieved near-native positions). For stage 9, it was within 0.64 Å of the native, versus an rmsd of 0.62 Å for the primary result. For stage 19, the resulting structure had an rmsd from native of 0.75 Å, versus 0.19 Å for the primary search protocol, and a ∆E of -0.5 kcal/mol relative to the native backbone structure. The results indicate that, given the constraints of the current model, the backbones have a much larger influence on the predictions than the side chains, which are important primarily for achieving the precise structure in the region very close to the global minimum. A calculation in which stage 19 was repeated with Cβ atoms present on all residues, including glycines (polyala model), resulted in an opening up of the structure at φ4 (-107°) from a bad contact created by the Cβ added at Gly49. FiVe-Fragment Model. Because the CheY protein has welldefined supersecondary structural units, a model with four variable angles between the five fixed R-helix/β-strand elements was also investigated. The results were essentially the same as for the primary protocol and are not given here. B. Parvalbumin and Fibronectin. Multidimensional primary search protocols analogous to those described for CheY were carried out for parvalbumin and the fibronectin cell adhesion module FNIII10. All energy terms were included. For FNIII10, the final prediction (on a 1° grid) has an rmsd of 0.24 Å from the native structure and an energy 25.9 kcal/mol higher than that of the native structure. The largest contribution to the energy

difference is electrostatic (19.9 kcal/mol) and is due predominantly to small misalignments in hydrogen bonding (15.7 kcal/ mol) between the β strands in the predicted structure. For parvalbumin, the final structure has an rmsd of 0.13 Å from the native structure and an energy 2.8 kcal/mol lower than that of the native. The results are summarized in Table 3. The resulting structures at various stages of the searches are shown in Figures 5 and 6. Protocols comprised entirely of single-angle 1° searches (analogous to the A3 protocol for CheY, with the stage 1 structures from the respective primary protocols as the starting conformations) were also carried out for the two proteins. For FNIII10, the 1-dimensional search became trapped at a structure with three of the six angles more than 71° from the native values and with an overall rmsd of 16.6 Å from the native; the energy was 238.4 kcal/mol higher than that of the native structure. For the simpler and smaller parvalbumin molecule, the single-angle search did converge after eight stages to the same structure and energy, as obtained in the primary protocol. IV. Concluding Discussion The successful limiting-case conformational searches for the native state of proteins in three structural classes illustrate some essential features of energy-based protein structure predictions. In general, there are two requirements for such predictions: an energy function with its global minimum at or near the native state and a search procedure that is able to find the minimum.

Energy-Based Search for Native Protein Structure

J. Phys. Chem. B, Vol. 104, No. 47, 2000 11377

Figure 6. Ribbon diagrams of parvalbumin (4cpv) in (a) stage 1, (b) stage 3, and (c) stage 7. R helices A-F and correctly predicted loops (solid) are indicated.

Figure 5. Ribbon diagrams of the fibronectin module FNIII10 (1fna) in (a) stage 1, (b) stage 2, (c) stage 3, (d) stage 5, and (e) stage 12. β strands A-G and correctly predicted loops (solid) are depicted.

The current results show that the native structures of the proteins CheY and parvalbumin and of the FNIII10 module of fibronectin correspond to the global minima of the effective energy function EEF1 in the 5 to 9-dimensional conformational spaces of the loop backbone dihedral angles, with all other degrees of freedom fixed at the native values. Although fixing the ψ backbone angles does significantly affect the distributions of possible φ values25,26 (all predicted φ angles at all stages in the study correspond to allowed [φ,ψ] regions), the results for CheY show that, given the geometric constraints provided by the loops, the native structure occurs at the energy minimum even in the absence of all intraloop interactions and of all interactions of the loops with the secondary structural elements; i.e., in the current model the interactions between the secondary structural elements determine the minimum-energy structure. Previous studies of decoys have shown that EEF1 could distinguish them

from the native state;6 the present search provides a somewhat different test of the function. If EEF1 (or an improved version) has the correct global minimum, the question of protein folding becomes reduced to the search problem. The search method used in the current study, which works well in these limiting cases, is intermediate between serial singlestructure evaluation (e.g., Monte Carlo) and exhaustive enumeration; it performs partial, though extensive, searches in series. The results demonstrate the importance of generating many structures prior to the selection of a single structure for the next stage of the search and of including concerted variations in multiple degrees of freedom. The flatter the energy landscape, the more structures must be generated with finer or higherdimensional grids for adequate sampling at each stage. The failure of some 1-dimensional search protocols (those that vary only one dihedral angle at a time at all stages) in finding the global minima appears to be related at least in part to the size and complexity of the conformational spaces; i.e., in this study, to the number of secondary structural units and the nature of their 3-dimensional packing arrangements. Although properties such as compactness and more sophisticated measures27,28 can be used as indicators of successful protein folding, in nonexhaustive search methods such as the current one, multiple trials with different search protocols may be required to verify that the lowest-energy structure corresponds to the global energy minimum rather than to a metastable conformer. The extensive sampling of the reduced conformational space for the current problem makes possible a detailed characterization of the energy surfaces involved. The surfaces tend to become steeper and narrower as the native structure is approached and the radius of gyration decreases. Near the global minimum, very small changes in position can lead to large changes in energy. This suggests that the surface is “rough” in this region and means that a finer grid is required for the search. The solvation energy components of the surfaces increase (become less favorable) as the radius of gyration decreases, and the more compact structures are stabilized by intraprotein electrostatic and van der Waals interactions. There are many possible extensions of the current approach to more realistic forms of the protein prediction problem, including the integration of neural-network predictions of secondary structure29 and either statistical30,31 or energy-based32

11378 J. Phys. Chem. B, Vol. 104, No. 47, 2000 backbone-dependent predictions of side-chain conformations. With current computational speeds, however, the thoroughness of the current search is not possible in the much higher dimensional conformational spaces of a protein, as is evident from the Levinthal paradox33 and complexity theory.34,35 Hence, while a systematic search method is still likely to be a useful approach, modifications are required, such as the use of rotameric or otherwise preselected dihedral angle states for loop conformations and energy analysis of selected fragments for which an adequate search is possible to obtain candidate structures,36,37 and the more efficient generation of low-energy starting structures. It is hoped that the present limiting-case studies will stimulate related work in the highly challenging and rapidly advancing field of protein structure prediction. Acknowledgment. The authors thank Ioan Andricioaei, Mattius Buck, Qiang Cui, Aaron Dinner, Erik Evensen, Themis Lazaridis, Jianpeng Ma, and Robert Yelle for helpful discussion and technical help. The authors thank Bruce Tidor and Collin Stultz for insightful comments on the manuscript. R.J.P. is an American Cancer Society fellow. This work was supported in part by a grant from the National Institutes of Health. The MOLSCRIPT38 program was used for Figures 1, 5, and 6. References and Notes (1) Eisenberg, D. Nat. Struct. Biol. 1997, 4, 95-97. (2) Koehl, P.; Levitt, M. Nat. Struct. Biol. 1999, 6 (2), 108-111. (3) Rowen, L.; Mahairas, G.; Hood, L. Science 1997, 278, 605-607. (4) Sali, A.; Kuriyan, J. Trends Biochem. Sci. 1999, 24 (12), M20M24. (5) Finkelstein, A. V.; Reva, B. A. Protein Eng. 1996, 9 (5), 387397. (6) Lazaridis, T.; Karplus, M. J. Mol. Biol. 1999, 288, 477-487. (7) Frauenfelder, H.; Leeson, D. T. Natl. Struct. Biol. 1998, 5 (9), 757759. (8) Warshel, A.; Levitt, M. J. Mol. Biol. 1976, 106, 421-437. (9) Cohen, F. E.; Richmond, T. J.; Richards, F. M. J. Mol. Biol. 1979, 132, 275-288. (10) Gunn, J. R.; Monge, A.; Friesner, R. A.; Marshall, C. H. J. Phys. Chem. 1994, 98, 702-711.

Petrella and Karplus (11) Monge, A.; Friesner, R. A.; Honig, B. Proc. Nat. Acad. Sci. USA 1994, 91 (11), 5027-5029. (12) Monge, A.; Lathrop, E. J. P.; Gunn, J. R.; Shenkin, P. S.; Friesner, R. A. J. Mol. Biol. 1995, 247, 995-1012. (13) Standley, D. M.; Gunn, J. R.; Friesner, R. A.; McDermott, A. E. Proteins 1998, 33 (2), 240-252. (14) Mumenthaler, C.; Braun, W. Protein Sci. 1995, 4, 863-871. (15) Lazaridis, T.; Karplus, M. Proteins 1999, 35, 133-152. (16) Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M. J. Comput. Chem. 1983, 4, 187-217. (17) Park, B. H.; Levitt, M. J. Mol. Biol. 1996, 258, 367-392. (18) Neria, E.; Fischer, S.; Karplus, M. J. Chem. Phys. 1996, 105, 19021921. (19) Lazaridis, T.; Karplus, M. Science 1997, 278, 1928-1931. (20) Stock, A. M.; Martinez-Hackert, E.; Rasmussen, B. F.; West, A. H.; Stock, D. R.; Petsko, G. A. Biochemistry 1993, 32 (49), 13375-13380. (21) Kumar, V. D.; Lee, L.; Edwards, B. F. P. Biochemistry 1990, 29 (6), 1404-1412. (22) Dickinson, C. D.; Veerapandian, B.; Dai, X. P.; Hamlin, R. C.; Xuong, N. H.; Ruoslahti, E.; Ely, K. R. J. Mol. Biol. 1994, 236 (4), 10791092. (23) Berman, H. M., et al. Nucleic Acids Res. 2000, 28, 235-242. (24) Brunger, A. T.; Karplus, M. Proteins 1988, 4, 148-156. (25) Rooman, M. J.; Kocher, J. A.; Wodak, S. J. J. Mol. Biol. 1991, 221, 961-979. (26) Buturovic, L. J.; Smith, T. F.; Vadja, S. J. Comput. Chem. 1994, 15 (3), 300-312. (27) Lu¨thy, R.; Bowie, J. U.; Eisenberg, D. Nature 1992, 356, 83-85. (28) Koehl, P.; Delarue, M. Proteins 1994, 20, 264-278. (29) Chandonia, J. M.; Karplus, M. Proteins 1999, 35, 293-306. (30) Dunbrack, R. L.; Karplus, M. Nat. Struct. Biol. 1994, 1 (5), 334340. (31) Bower, M. J.; Cohen, F. E.; Dunbrack, R. L. J. Mol. Biol. 1997, 267, 1268-1282. (32) Petrella, R. J.; Lazaridis, T.; Karplus, M. Folding Des. 1998, 3, 353-377. (33) Karplus, M. Folding Des. 1997, 2 (4), S69-S75. (34) Ngo, J. T.; Marks, J.; Karplus, M. In The Protein Folding Problem and Tertiary Structure Prediction; Merz, K., Jr., Le Grand, S., Eds.; Birkha¨user: Boston, 1994; pp 435-508. (35) Monasson, R.; Zecchina, R.; Kirkpatrick, S.; Selman, B.; Troyansky, L. Nature 1999, 400, 133-137. (36) Gibson, K. D.; Scheraga, H. A. J. Comput. Chem. 1987, 8 (6), 826-834. (37) Lee, J.; Scheraga, H. A.; Rackovsky, S. J. Comput. Chem. 1997, 18, 1222-1232. (38) Kraulis, P. J. J. Appl. Crystallogr. 1991, 26, 283-291. (39) Petrella, R. J., et al. Results unpublished.