Sequential simulated annealing: an efficient ... - ACS Publications

Mar 1, 1993 - Allostery and Induced Fit: NMR and Molecular Modeling Study of the trp Repressor-mtr DNA Complex. Luciano Brocchieri , Guo-Ping Zhou ...
1 downloads 0 Views 460KB Size
J . Phys. Chem. 1993,97, 3007-3012

3007

Sequential Simulated Annealing: An Efficient Procedure for Structural Refinement Based on NMR Constraintst Daqing Zhao and OIeg Jardetzky' Stanford Magnetic Resonance Laboratory. Stanford University, Stanford, California 94305-5055 Received: October 12, 1992

We suggest an efficient procedure for calculating a family of structures satisfying NMR constraints and empirical bond-energy terms. The procedure is suitable for the calculation of large protein structures for which existing methods fail and is called sequential simulated annealing. We tested the procedure against conventional distancegeometry-simulated annealing on a system of moderate size and showed that the two are completely equivalent in reproducing the target structure. The sequential simulated annealing protocol saves at least 45% of the computation time to generate a family of structures.

Introduction

Details of the Procedure

Molecular dynamics (MD) has become an indispensable tool for the calculation of the three-dimensionalstructure of peptides and proteins from nuclear magnetic resonance (NMR) experimental constraint^.'-^ For smaller proteins (0.1 A RMS bond-length violation, A RMS bond-angle violation, deg angle violation, > 5 O

19 NOEa/residue

DG-SA 831.7(9.2) 9.9(3.4) 12.4(0.7) 26.9(3.3) 731.2(4.4) 51.3( 1.2) 0.56(0.09) 0.55 0.56 0.79(0.09) 0.80(0.10) 0.019(0.003) 40)

SEQ-SA

DG-SA

834.6(5.8) 12.6( 1.8) 13.0(0.6) 27.8(2.2) 729.5(2.9) 51.6(0.8) 0.44(0.05) 0.46 0.46 0.63(0.06) 0.63(0.05) 0.017(0.001) 4(2)

833.5(4.8) 12.4(1.5) 12.8(0.5) 27.4( 1.8) 729.6(3.0) 51.4(0.8) 0.49(0.06) 0.53 0.47 0.64(0.07) 0.64(0.05) 0.017(0.001) 4(1)

0.004

0.004

0.004

0.004

2.014(0.005) 240)

2.0 14(0.006) 25(1)

2.012(0.004) 2 4 ~ )

2.01Z(0.004) 24(1)

Number in parentheses indicates 1 standard deviation; statistics based on 60 structures unless noted otherwise. The RMSD between individual structures and their average structure. Statistics based on 30 structures.

0.0

0.2

0.4

0.8

0.6

RMSD

1 .o

1.2

(A)

Figwe 1. Frequencies vs RMSD accuracy for the 60 structures generated by sequential simulated annealing (SEQSA, indicated by squares) and distance-gcomctry-simulated annealing (DG-SA, indicated by diamonds) for both small (connected by dotted lines) and large (connected by solid lines) data sets.

This same data set was used to calculate a family of structures using ( 1) sequential simulated annealing (SEQ-SA) and, for comparison, (2) distance geometry followed by simulated annealing (DG-SA).The statistics of the two families of structures were then compared. A similar comparison was made on another data set with 890 constraints (19 constraints per residue). In the following discussion, the small constraint set indicates the set with 539 constraints and the large constraint set the set with 890 constraints. Constraints derived from the crystal structure were classified into four categories according to the procedures described by Forman-Kay et al.Is Those with distances smaller than 2.7 A were assigned a distance range of 1.8-2.7 A, those between 2.7 and 3.3 A were assigned 1.8-3.3 A, those between 3.3 and 5.0 A, 1.8-5.0 A, and those between 5.0 and 6.0 A, 3.0-6.0 A. The selection of the constraints out of all possible proton-proton distances in thecrystal structure iscompletelyrandom. Multiple

selection of equivalent protons, such as they- and 6-methyl protons in isoleucine, is avoided. Sixty structures were calculated for each set of conditions. The calculations were performed using XPLOR 3.0 on a Silicon Graphics Indigo R3000 workstation,as well as on the Cray YMP at the Pittsburgh Supercomputer Center. On the SGI Indigo workstation, for the DG-SA calculations, 9000 s of CPU time was used for each structure, while for SEQ-SA, only the first structure, which was calculated from distance-geometry-simulated annealing, used 9000 s, and the rest of the structures used only 5000 s per structure. Therefore, 150 h is needed to generate 60 structures using DG-SA, while only 85 h using SEQ-SA. In this example, SEQ-SAsaves 45% in computation time. A considerably larger time savings can be expected as the number of atoms in the structure is increased, especially when the acceptance rate of distance geometry coarse structures is low, as is the case in many practical applications.

-

The Journal of Physical Chemistry, Vol. 97, No. 12, 1993 3009

Sequential Simulated Annealing 2.5

2.0

-

".......+"..."

SEQSA backbone SEQSA sidechain DO-SA backbone DO-SA sidechain

1.5-

1.0-

0.5

0

10

20

30

Rraidur

Numbrr

2.5

-

----p--

2.0

-

50

40

.........+".....

SEQSA backbone SEQSA sidechain DO-SA backbone DO-SA sidechain

1.5-

0

10

20

30

50

40

Rraldur Numbrr

Figure 2. Accuracy of the structure: the average RMSDs for the backbone (solid lines) and side-chain (dotted lines) structures generated by SEQ-SA (squares) and by DG-SA (diamonds) vs the crystal structure for the (top) small data set and (bottom) large data set.

Table I lists the refinement statistics, energies, and violations, based on the statistics on 60 structures for each method. For both test sets, the averages, means, and standard deviations are all identical within statistical fluctuation, indicating that the two methods have the same accuracy and precision. (Accuracy is measured by the RMSD from the known starting structure, which reflects the faithfulness with which the known standard is reproduced. Precision is measured by the RMSD of the individual members of a family from their mean and reflects solely the reproducibility of the calculation.) In order to demonstrate how the statistics converge as the number of structures increases, we also listed the statistics based on 30 structures. One can scc that

increasing the number of constraints per residue from 12 to 19 improves the precision slightly from 0.75 to 0.65 A, but the accuracy of the average structures remains the same (-0.5 A). The values from both methods of refinement are again identical. The plots of population among 60 structures vs overall RMSD are shown in Figure 1. It is interesting to note that for both methods, the distributions of RMS s are more or less Gaussian with non-zero most probable RM D values. In Figures 2 and 3, we plotted the RMSDs vs the residue accuracy and precision of each method. For the small constraint set, nearly identical accuracy and precision are obtained, and the RMSDs havesimilar profiles as a function of residue. For the large constraint set,

?

-

2.5

SEQ-SA backbone

---*-SEQ-SA sidechain

-

2.0

DO-SAbactrbone

....""-.f".".DO-SA ridechain e

1.5

-

1.0

-

0.5

-

3 n m

z

0

10

30

20

-

SEQ-SAbackbone

----)-SEGSA

2.0

-

50

40

".-+."...

sidechain

DESAbackbone DESA sidechain

1

1.5-

1.0-

0.5

0.0

!

I

I

0

10

20

I

30

Rrridur

40

50

Numbrr

Figwe 3. Precision of the structure: the averageR M S h for the backbone (solid lines) and sideshain (dotted lines) structures generated by SEQ-SA (squares) and by DG-SA (diamonds) vs the average structure for the (top) small data set and (bottom) large data set.

SEQ-SA gives a slightly better accuracy and precision and the RMSDs have similar profiles as a function of residue as well. The residues with larger RMSDs are mainly Arg 17, Leu 18, Pro 19, and Gly 20, near the end of the first a-helix, with DG-SAshowing a much larger deviation. It also appears in the prcrrent example that the fessprecisefydetermined residues also tend to have less uccurute average positions, but this is not necessurily true in

general.* In Figure 4, we plot the number of total NOES vs residue number in the two constraint sets. The correlation between the number of N O b and the RMSD in the same region is weaker than expected. If we normalize the number of NOES by the number of protons in the residue, the features are essentially the same, indicating that the features are due mainly to the folding of the protein. There are 41 NOES in the small constraint set

The Journal of Physical Chemistry, Vol. 97, No. 12, 1993 3011

Sequential Simulated Annealing

..".""*-

RMSD small set RMSD large set

.."....*"."" NOEs small set

'*O/ la0

NOEs large set

n

'

1.0

A

0.5

A

0.0 A

0

10

20

30

40

50

Rrsldur Numbrr Figure 4. Graph (bottom two lines) of the number of NOES vs the residue number for the large data set3 (solid lines) and small data sets (dotted lines). The top two lines graph the backbone RMSDs for SEQ-SA calculations vs the residue number for both data sets.

and 65 in the large set for residue Arg 10, 29 and 49 NOEs, respectively, for residue Phe 13, 3 1 and 57 NOEs, respectively, for residue Arg 17, and as expected, these residues have smaller RMSDs. For other residues, the correlation is less predictable. For example, Asp 43 is poorly determined with four NOEs in the same constraint set and six NOEs in the large constraint set. The Glu 23 side chain consistently has larger RMSDs than the side chains of neighboring residues, irrespective of the refinement method. Glu 23, however, has 10 NOEs in the small constraint set and 15 NOES in the large constraint set. Ala 9 and Ile 33 are among the best determined residues. Ile 33 has 9 and 13 NOEs in the small and large data sets, respectively, but Ala 9 has only 3 and 4 NOEs.

Discussion We have tested sequential simulated annealing as an efficient protocol of generating a family of structures from experimental and empirical constraints. In a separate study, we applied this procedure successfully to the refinement of the E. coli trp repressor, which has 3500atomsand 2000 NMRexperimental constraints." It should be stressed that high-temperature dynamics is essential to the successof this procedure. If calculated at room temperature, each run of molecular dynamics simulation would require at least nanoseconds of molecular dynamics, and the procedure may offer no computational advantage. SEQ-SA does not search the entire possible conformation space during each run, but this is not a problem. Similiar to the situation in Metropolis sampling, those parts of the conformation space which cannot be reached during a sufficiently long trajectory are not statistically important.16J7 The probabilities for the system to take these conformations are so small that their contribution to the possible range of structures is negligible. In fact, this method saves computation time by avoiding a random search each time. Since the length of a long trajectory is determined subjectively, we may not exclude some

structures, especially those with two distinct globular foldings separated by a large energy barrier, which would require very long simulation times (possibly nanoseconds or longer). Under conditions of high temperature, reduced van der Waals strength, and bond-angle force strength and with the Monte Carlo velocity selections, it is less probable that a significant structure will be overlooked. A useful check may be made by comparing the statistics when the number of structures and/or the length of the trajectory is increased. We checked these for both the crambin and rrp repressor calculations and found that the results remain unchanged when more structures are included. For the large constraint set in the crambin test calculation, the distance-geometry-simulated annealing appears to give slightly less accurate and precise structures, especially near Pro 19. This indicates that a single 15-ps IO00 K simulated annealing may not be sufficient to refine a coarse structure from distance geometry. In the limit of longer time of refinement by simulated annealing, however, the results from the DG-SA converge to those from the SEQ-SA method. This indicates that the time required to regularize a coarse structure is longer than the time needed to generate an additional structure using SEQ-SA. In the example cited, we used equal refinement time for both methods and SEQSA takes 45% less computation time than DG-SA does. If more time is used to refine a coarse distance geometry structure, SEQSA will save even more computation time. Finally, we like to point out that in this model calculation, there is only one known and correct structure, the crystal structure, and the data set is free of experimental error in distance estimates. Any spread in calculated atomic positions is due to the limited number of constraints compared to the 3N- 6 degrees of freedom. This represents the limit of refinement given a limited data set. Greater spread will result with experimental data sets that are sparse by comparison and may contain errors. It is worth noting that the limit is the same for both of the two methods.

3012 The Journal of Physical Chemistry, Vol. 97, No. 12, 1993

A c ~ w ~ This t research . was funded by NIH Grants RR02300, RR07558, and GM33385. Computer time was made available by the Pittsburgh Supercomputing Center (supported by NIH Grant 1 P41 RR06009 and NSF Grant ACS-850o0650) and the Cornel1 National Supercomputer Facility (funded by the National Science Foundation, the IBM Corporation, New York State, and members of the Corporate Research Institute).

References and Notes (1) Van Gunsteren, W. F.; Kaptein, R.; Zuiderweg, E. R. P. In Nucleic Acid Conformation and Dynamics; Olson, W.K.,Ed.; NATO/CECAM: Orsav. 1983. -..~ (Z)-Clore, G. M.; Briinger, A. T.; Karplus, M.;Gronenborn, A. M.J . Mol. Biol. 1906, 191, 523. ( 3 ) Briinger, A. T.; Karplus, M.Acc. Chem. Res. 1991, 24, 54. (4) Briinner. A. T. X-PLOR (Ver. 3.0) Manual; Yale University: New Haven, CT, 1592. (5) Kuntz, I. D.; Crippen, G. M.;Kollman, P. A. Biopolymers 1!J79,18, 939. - __. (6) Altman, R. B.; Jardetzky, 0. Nuclear Magnetic Resonance, Part B: Structure and Mechanisms. In Methods in Enzymology; Oppenheimer, N.

Zhao and Jardetzky J., James, T. L., Eds.; Academic Press: New York, 1989; Vol. 177, pp 218246.

( 7 ) Koehl. P.; Lehre, J.-F.; Jardetzky, 0. J. Molec. Biol. 1992, 223, 299. (8) Liu, Y.; Zhao, D.; Altman, R.; Jardetzky, 0.J . Biomol. NMR 1992,

2,373.

o, Biochemistry (9) Arrowsmith, 1990, C.29, H.;6332. Pachter, R.; Altman, R.B.; Iyer, S.;Jardetzky, (10) Arrowsmith. C.; Pachter, R.; Altman, R.; Jardetzky, 0. Eur. J . Biochem. 1991, 202 (2), 53. (11) Zhao, D.; Arrowsmith, C. H.; Jia, X . ; Jardetzky,O. J. Molec. Biol., in press. (12) Chandler, D. Introduction to Modern Statistical Mechanics; Oxford: New York, 1987. (13) Briinger, A. T. X-PLOR (Ver. 2.1) Manual; Yale University: New Haven, CT. 1990. (14) Hendrickson, W. A.; Teeter, M.M. Nature 1981, 290, 107. (15) Forman-Kay, J. D.; Clore, G. M.;Wingfield, P. T.; Gronenborn, A. M. Biochemistry 1991, 30, 2685. (16) Metropolis, N.;Rosenbluth, A. W.; Rosenbluth, M. H.; Teller, A. H.; Teller, E.J . Chem. Phys. 1953, 21, 1087. (17) Vallcau, J. P.; Whittington. S . G. In Statistical Mechanics Part A; Beme, B. J., Ed.; Plenum: New York, 1977.