Toward Absolute Density of States Calculations for Proteins - The

This estimate can place any DOS function, such as the Gaussian DOS distribution in the random energy model, on an absolute scale. Comparing the absolu...
0 downloads 0 Views 52KB Size
J. Phys. Chem. B 2006, 110, 12125-12128

12125

Toward Absolute Density of States Calculations for Proteins David C. Sullivan*,†,‡ and Carmay Lim*,†,§ Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan, and Department of Chemistry, National Tsing Hua UniVersity, Hsinchu 300, Taiwan ReceiVed: April 26, 2006

The density of states (DOS), which gives the number of conformations with a particular energy E, is a prerequisite in computing most thermodynamic quantities and in elucidating important biological processes such as the mechanism of protein folding. However, current methods for computing DOS of large systems such as proteins generally yield only the ratios of microstate counts for different energies, which could yield absolute conformation counts if the total number of conformations in phase space is known, thus motivating this work. Here, the total number of energy minima of 50-mer polyalanine, whose size corresponds to naturally occurring small proteins, was estimated under an all-atom potential energy function based on the cumulative distribution function (CDF) of conformational differences to be ∼1038. This estimate can place any DOS function, such as the Gaussian DOS distribution in the random energy model, on an absolute scale. Comparing the absolute conformational counts from a Gaussian DOS function with those from the CDF derived from quenched molecular dynamics ensembles shows that the former are far greater than the latter, indicating far fewer low-energy minima actually exist. In addition to showing how CDF and relative DOS calculations can yield absolute DOS for a discrete system, we also show how they can yield absolute DOS for continuous variable systems to a specified atomic variance. In the context of protein folding, knowing this phase-space “volume” of conformations in a DOS function, as well as characteristic transition times, constrains the set of possible folding mechanisms.

Introduction The energy of a macromolecule as a function of its internal degrees of freedom (DOF), i.e., the energy landscape, determines the thermodynamic and kinetic properties of the molecule, thus motivating its characterization; however, several hurdles exist. First, defining a physically relevant energy function, generally via a force field, is inherently limited, as discussed in previous works.1-3 Second, the rough and high dimensional nature of macromolecular energy landscapes precludes brute-force sampling among the DOF because adequately sampling probabilistically important low-energy conformations for energy evaluation would require a computationally infeasible fine grid of the DOF space. On the other hand, energy-biased sampling suffers from a lack of information on the high-energy states, and sampling may still be quite sparse due to high dimensionality,4 particularly for unfolded states.5 Third, extracting a description of the conformational distribution from the sampled conformations is not straightforward. For many purposes, a conceptually reduced energy landscape model suffices in the form of a density of states (DOS) function, which gives Ω(E), the number of conformations with a particular energy E. Knowing an absolute DOS with sufficient accuracy, the partition function and thus most thermodynamic quantities such as moments of energy, entropy, and free energy can be calculated. Knowing absolute counts of conformations and their associated phase space “volume” can also facilitate kinetic and * To whom correspondence should be addressed. E-mail: [email protected] or [email protected]. † Institute of Biomedical Sciences. ‡ Current address: Novartis Institutes for Biomedical Research, M/S 4.2, 4560 Horton St., Emeryville, CA 94608. § National Tsing Hua University.

mechanistic inference from the DOS by providing a link between thermodynamics and molecular geometry. Consider protein folding, in which a protein navigates conformational space from a diffuse unfolded state through a more constricted transition state to a highly localized native state. Describing the mechanism involves specifying the breadth of accessible conformational space at each step,6-8 for example, knowing the number of conformations that compose high-energy extended states vs collapsed states vs the native basin. Furthermore, to interpret counts of conformations in a kinetic or mechanistic sense, knowing the conformational extent of a single conformation, or the distance between neighboring conformations, is required. However, except for enumerable systems such as lattice models, current methods9-12 for computing DOS generally yield only the ratios of microstate counts for different energies, i.e., Ω(E)/Ω(E′), while the scale is arbitrary. Arbitrary scaling of a DOS function amounts to ignorance of the conformational extent of a single conformation and by extension eliminates the possibility of characterizing dynamic transitions among the conformations implied by the scaling. In principle, the absolute DOS or counts of microstates at populated energies can be found if the degeneracy of any of the macrostates or the total number of conformations in phase space is known. Counting conformations for continuously variable systems further requires a set of assumptions as to what defines boundaries between conformations, upon which ref 13 discusses in depth. However, for a given conformation definition, an accurate estimate of the total number of conformations in phase space for large molecules, the size of proteins, has not yet been reported (to the best of our knowledge). Thus, one aim of this work is to estimate the total number of minima of 50-mer polyalanine, whose size corresponds to

10.1021/jp0625457 CCC: $33.50 © 2006 American Chemical Society Published on Web 05/28/2006

12126 J. Phys. Chem. B, Vol. 110, No. 24, 2006

Sullivan and Lim

naturally occurring small proteins. This number was estimated under an all-atom potential energy function1 based on the cumulative distribution function (CDF) of conformational differences13,14 (see Methods). With this estimate, any DOS function, such as the Gaussian DOS in the random energy model (REM),15 can be normalized to an absolute scale. Another aim is to show how CDF-based conformational counts can be used to assess the accuracy of current DOS functions. Here, we assess the accuracy of the common Gaussian DOS by evaluating the deviation of the REM absolute counts from the respective CDFbased ones (see Results). Although the calculations herein are limited to minima in the energy landscape, which are easily definable landscape features, we also formalize how the CDF could scale DOS functions for continuously variable protein representations to count conformations of a specified structural variance (see Discussion). Methods Relationship between Minima Count and CDF. We first show how the Ala50 minima count can be estimated from the cumulative distribution function of conformational differences, V(r).13,14 Conformational difference is measured by the rootmean-square distance (RMSD) between corresponding nonhydrogen atoms after optimal superposition of two conformers. For an ensemble sampled by w conformations, the CDF about conformer i is computed as

Vi(r) )

δij(r)/(w - 1) ∑ j*i

where r defines conformational difference (e.g., RMSD) and δij(r) ) 1 if RMSD(i,j) e r and 0 if RMSD(i,j) > r. Averaging the distribution over w conformers

V(r) ≡

∑i Vi(r)/w

(1)

For conformational minima l associated with finite probabilities, pl, the CDF of minimum k is given by

Vk(r) )

∑l plδkl(r)/∑l pl ) ∑l plδkl(r)

where k and l exhaustively enumerate all W minima. The ensemble average distribution

V(r) )

∑k pk ∑l plδkl(r)

If r ) 0, then δkl(r) ) 1 only for k ) l, so that

V(0) )

pkplδkl(r) ) ∑ pk2 ∑ k)l k

(2)

If all W minima are of equal probability (1/W), then eq 2 reduces to

∑k pk2 ) W(pk2) ) W(1/W)2 ) 1/W Thus, the total number of unique minima, W, can be estimated from 1/V(0) for discrete ensembles with equally probable conformations. Constructing the CDF. Except for the smallest chain lengths, direct sampling is generally insufficient to define V(r) to small RMSD. However, V(r) could be extrapolated to small RMSD by examining RMSD pair-distribution trends across a range of

chain lengths (8-50 amino acids).13,14 Thus, V(RMSD) for Ala50 was constructed using distribution properties of small chain length ensembles, in which conformation space can be densely sampled, to inform the extrapolation of longer chain length conformational distributions where only sparse sampling is computationally feasible. Thus, it was constructed from w ) 10 000 conformations each for chains of length 8, 10, 12, 15, 20, 25, 30, and 50, as described in ref 13. Generating Conformers. RPP-min Ensemble. To approximate equiprobable sampling of the continuous polyalanine conformational space, a random φ/ψ ensemble, RPP, was generated by uniform random sampling of the φ and ψ torsion angles, with atom interpenetration disallowed between residues separated by at least two residues and no compactness enforced.16 The random φ/ψ ensemble, RPP, was then energyminimized using the AMBER program17 and parm94 force field1 with a distance-dependent dielectric function given by i,j ) di,j, where di,j is the distance between atoms i and j. For the 50-mer ensemble, 98% reached an energy gradient of 10-4 kcal/ (mol Å) after 790 000 minimization steps. The resulting ensemble, referred to as RPP-min, was used in computing V(RMSD) for Ala50, from which the total number of Ala50 energy minima was estimated. MD-min Ensembles. Low-energy ensembles were initiated from random-walk polyalanine models with uniform sampling of φ and ψ, atom interpenetration disallowed but with compactness enforced to bias sampling to lower energy. These highenergy models were relaxed by molecular dynamics (MD) using a temperature schedule of 100 K (5 ps), 300 K (10 ps), 1000 K (10 ps), 400 K (10 ps), 350 K (10 ps), 300 K (10 ps), 200 K (10 ps), and 100 K (10 ps). Ensemble copies were minimized following 1000 K (at 25 ps), 300 K (at 55 ps), and 100 K (at 75 ps), yielding MD-min1000K, MD-min300K, and MD-min100K ensembles. For all 50-mer ensembles, 96-98% reached an energy gradient of 10-4 kcal/(mol Å) after more than 100 000 steps for the MD-min set, while for shorter chains, higher percentages reached the energy gradient threshold. As the kinetic history impacts the conformational and energy distributions, additional simulations initiated from RPP-min (generated from extended random walks, as opposed to compact) were performed with a temperature schedule of 300 K (5 ps), 1000 K (30 ps), and 700 K (30 ps). Ensemble copies were minimized following 1000 K (at 35 ps) and 700 K (at 65 ps) yielding MD-min21000K and MD-min2700K ensembles. The five MD-min ensembles (MD-min1000K, MD-min300K, MD-min100K, MD-min21000K, and MD-min2700K) were used in computing V(RMSD) for Ala50. Minima counts, 1/V(0), were derived from each MD-minimized ensemble using an extrapolation scheme, as outlined in ref 13. Results The total number of energy minima for Ala50 can be estimated from the average cumulative number of minima surrounding a conformational minimum, V(r)/V(0), which was derived from the RPP-min ensemble (Figure 1). On average, ∼10 minima lie within 0.5 Å, rising to ∼104 minima within 1 Å RMSD. A small fraction (0.03%) of all minima lie within 6 Å RMSD. The total number of Ala50 minima is ∼1038 from Figure 1. This estimate is likely to be a lower bound because the RPP-min ensemble (see Methods) assumes (1) each minimum’s conformational space catchment region is identical and (2) an implicit solvent model (a distance-dependent dielectric function). Accounting for variance in basin size should increase the minima count estimate because eq 2 is minimized when all W mi-

Density of States Calculations for Proteins

J. Phys. Chem. B, Vol. 110, No. 24, 2006 12127

Figure 1. Cumulative radial minima count, V(r)/V(0), for Ala50 RPPmin.

Figure 2. Effective temperatures for mean ensemble energies.

crostates are equiprobable.18 Adding explicit solvent should also introduce energetic roughness and increase the number of minima.13 Given the total number of energy minima, the next question is how are the energies distributed, i.e., what is the DOS, Ω(E)? As our emphasis is to inform the scale rather than the shape of the DOS, we assume a Gaussian or normally distributed DOS, which is inherent in the REM.15 In the REM, a conformation’s energy is approximated as a sum of energetic terms drawn from independent distributions. In the limit of many energetic terms, the REM formulation suggests a normal DOS distribution,

Ωnorm(E) ) W/(σx2π) exp(-(E - µ)2/2σ2)

(3)

For the RPP-min model of Ala50, the mean energy (µ) and energy deviation (σ) are -6.1 and 31.4 kcal/mol, respectively, while W, the total number of minima, which defines the absolute scaling, is 1038. Given an absolute DOS function (eq 3), the corresponding probabilistically weighted mimima count, 1/VREM(0), can be computed by rewriting eq 2 as follows. First, from the DOS (eq 3), the “effective” temperature, T, of an ensemble, which could be much greater than the simulation temperature, can be defined by calculating expected energies, 〈E〉, as a function of temperature (Figure 2). 〈E〉 is calculated from the temperaturedependent energy distribution

p(E) ) Ωnorm(E) exp(-E/kT)/Z where Z is the partition function

Z)

∫ Ωnorm(E) exp(-E/kT) dE

∫ Ωnorm(E)(exp(-E/kT)Z-1)2 dE

derived from the minimized ensembles, the deviation in DOS from REM behavior can be determined. Figure 3 shows that the DOS deviates from REM behavior, except at an infinite temperature, corresponding to the initial uniform sampling of φ and ψ (Figure 3). The fall in 1/V(0) with decreasing mean energy for the MD-minimized ensembles (Figure 3, open circles) is much steeper than that predicted by REM (Figure 3, solid curve), showing far fewer low-energy minima actually exist than predicted by REM. This is probably because REM-ignored correlations among energetic interactions serve to raise the energy of many minima that would otherwise be low in energy. In other words, many combinations of energetic terms that sum to low energy in the mean field are in fact collectively associated with high-energy conformations. Chain overlap may be a significant source of higher energy. Steric excluded volume should exclude a relatively higher fraction of conformations in the vicinity of compact, low-energy conformations.19 It should be less constraining in the expanded RPP-min ensemble (CR-Rγ ) 11.1 Å) than in the more compact, lower energy, ensembles (e.g., CR-Rγ ) 8.9 Å for MD-min100K). The actual deviations from REM behavior are probably more severe than those depicted in Figure 3. True Boltzmann sampling would probably lead to more thorough sampling of low-energy states for a particular mean energy. Low-energy sampling would be focused among the fewer low energy minima, which should result in smaller statistical minima counts (i.e., larger V(0) values). The protocol used here, in which the ensembles initially sample φ/ψ space uniformly, is essentially the opposite of heating from a single low-energy microstate, in which case insufficient equilibration leads to understating the breadth of conformational space for a particular energy level. Furthermore, accounting for basin size variance and introducing explicit solvent would increase the current estimate of 1038 minima, which in turn would increase the number of REM minima counts. Thus, refinements to the present calculations would amplify the above conclusion that the REM overestimates the number of low-energy minima. Discussion

Assuming a distribution of energies consistent with a Boltzmann distribution at the effective temperature, the probability of occupying a particular minimum k of energy E is pk ) exp(-E/kT)/Z. Equation 2 can be rewritten as a function of the Ωnorm(E) to give VREM(0) for a given effective temperature

VREM(0) )

Figure 3. REM predicted weighted minima counts, 1/VREM(0) (solid line), as a function of mean ensemble energy, compared to weighted minima counts derived from minimized ensembles of Ala50 (open circles).

(4)

By comparing the REM predicted 1/VREM(0) curve (eq 4) with the respective CDF-based 1/V(0) values (eq 1, extrapolated13)

We have shown here how the CDF, which captures exclusively geometric properties of an ensemble, can leverage energy distribution calculations to provide a more complete picture of the energy landscape. The inverse CDF, 1/V(0), gives a statistical minima count (see eq 2), which for 50-mer polyalanine is estimated to be 1038. As discussed in ref 13, resolving CDF extrapolation technical hurdles should enable routine application to macromolecules. Hence, the combination of CDF and relative DOS calculations can yield absolute DOS even for systems the size of proteins, while CDFs from quenched MD ensembles can test the accuracy of any given DOS function.

12128 J. Phys. Chem. B, Vol. 110, No. 24, 2006 Consequently, a promising extension of this work is to combine CDF information with other relative DOS calculations. For example, relative DOS for polyalanine12 and a protein fragment11,20 without reference to a structural resolution have been computed using modifications to the Wang-Landau21 algorithm, which is based on histogram-reweighting methods.22 V(r) could be used to place relative DOS on a referenced scale using a fixed atomic coordinate “variance-sphere”, defined by rref, for continuous conformational space. Analogous constructs have been used for calculating coordinate variance based relative entropy differences between polyalanine ensembles reflecting different constraint sets.23 The referencing makes no assumptions regarding the shape of the energy landscape, whereas counting conformational basins (i.e., collections of minima with mutually low interconversion barriers) assumes a hierarchical landscape model.24 However, the reference element can serve as a basic volume element for quantifying the extent of landscape features that display characteristic atomic variances. For continuously variable systems, DOS referencing can use the following scheme. Given V(r) calculated from an equilibrated ensemble, the weighted conformation count, V(rref), can scale the DOS to yield an agreeing calculated V(rref) value. Given Ω(E) known to within some multiplicative constant, m, then Ω(E) ) mΩr(E), and Zr is similarly defined by Z ) mZr. Equation 4 can be rewritten as

V(rref) ) m-1

∫ Ωr(E)(exp(-E/kT)/Zr)2 dE

Solving for m

m ) V(rref)-1

∫ Ωr(E)(exp(-E/kT)/Zr)2 dE

rref must be small enough that the absolute density within rref does not vary over the temperature range of interest. Operationally, this occurs where the slope of V(r) has converged on the number of degrees of freedom of the system.14 From the DOS of different systems placed on the same reference scale, relative thermodynamic quantities such as entropy differences can be calculated. This is not possible if the DOS functions are arbitrarily scaled. In conclusion, we have shown that the CDF can quantify conformational distributions and energy landscapes by extension,

Sullivan and Lim and can potentially restate the DOS of continuously variable protein models in terms of defined structural fluctuation. Acknowledgment. This work is supported by Academia Sinica and the National Science Council, Taiwan. References and Notes (1) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M. J.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. J. Am. Chem. Soc. 1995, 117, 5179-5197. (2) MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; JosephMcCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wiorkiewicz-Kuczera, J.; Yin, D.; Karplus, M. J. Phys. Chem. B 1998, 102, 3586-3616. (3) Jorgensen, W. L. OPLS Force Fields. In The Encyclopedia of Computational Chemistry; Schleyer, P., Ed.; John Wiley & Sons Ltd.: Chichester, U.K., 1998; Vol. 3; pp 1986-1989. (4) Clarage, J. B.; Romo, T.; Andrews, B. K.; Pettitt, B. M.; Phillips, G. N. J. Proc. Natl. Acad. Sci., U.S.A. 1995, 92, 3288-3292. (5) Day, R.; Daggett, V. Proc. Natl. Acad. Sci., U.S.A. 2005, 102, 13445-13450. (6) Leopold, P. E.; Montal, M.; Onuchic, J. N. Proc. Natl. Acad. Sci. U.S.A. 1992, 89, 8721-8725. (7) Dill, K. A.; Chan, H. S. Nat. Struct. Biol. 1997, 4, 10-19. (8) Li, A.; Daggett, V. J. Mol. Biol. 1996, 257, 412-429. (9) Wang, F. G.; Landau, D. P. Phys. ReV. E 2001, 64, 056101. (10) Wang, J.-S.; Swendsen, R. H. J. Stat. Phys. 2002, 106, 245-285. (11) Rathore, N.; Knotts, T. A.; de Pablo, J. J. Biophys. J. 2003, 85, 3963-3968. (12) Ghulghazaryan, R. G.; Hayryan, S.; Hu, C.-K. Submitted for publication in J. Comput. Chem. (13) Sullivan, D. C.; Lim, C. Submitted for publication in J. Phys. Chem. B. (14) Sullivan, D. C.; Kuntz, I. D. Biophys. J. 2004, 87, 113-120. (15) Derrida, B. Phys. ReV. Lett. 1980, 45, 79-82. (16) Gregoret, L. M.; Cohen, F. E. J. Mol. Biol. 1991, 219, 109-122. (17) Pearlman, D. A.; Case, D. A.; Caldwell, J. W.; Ross, W. S.; Cheatham, T. E., III; DeBolt, S.; Ferguson, D. M.; Seibel, G. L.; Kollman, P. A. Comput. Phys. Commun. 1995, 91, 1-41. (18) Simpson, E. H. Nature 1949, 163, 688. (19) Dill, K. A. Biochemistry 1985, 24, 1501-1509. (20) Rathore, N.; Yan, Q. L.; de Pablo, J. J. J. Chem. Phys. 2004, 120, 5781-5788. (21) Wang, F. G.; Landau, D. P. Phys. ReV. Lett. 2001, 86, 2050-2053. (22) Ferrenberg, A. M.; Swendsen, R. H. Phys. ReV. Lett. 1988, 61, 2635-2638. (23) Sullivan, D. C.; Lim, C. J. Chin. Chem. Soc. (Taipei, Taiwan) 2004, 51, 1209-1219. (24) Ansari, A.; Berendzen, J.; Bowne, S. F.; Frauenfelder, H.; Iben, I. E. T.; Sauke, T. B.; Shyamsunder, E.; Young, R. D. Proc. Natl. Acad. Sci. U.S.A. 1985, 82, 5000-5004.