Quantifying Polypeptide Conformational Space - ACS Publications

Jul 26, 2006 - The CDF represents the probability of two randomly selected points in conformation space being separated by less than a conformational ...
0 downloads 0 Views 217KB Size
J. Phys. Chem. B 2006, 110, 16707-16717

16707

Quantifying Polypeptide Conformational Space: Sensitivity to Conformation and Ensemble Definition David C. Sullivan*,†,§ and Carmay Lim*,†,‡ Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan, and the Department of Chemistry, National Tsing Hua UniVersity, Hsinchu 300, Taiwan ReceiVed: NoVember 29, 2005; In Final Form: June 29, 2006

Quantifying the density of conformations over phase space (the conformational distribution) is needed to model important macromolecular processes such as protein folding. In this work, we quantify the conformational distribution for a simple polypeptide (N-mer polyalanine) using the cumulative distribution function (CDF), which gives the probability that two randomly selected conformations are separated by less than a “conformational” distance and whose inverse gives conformation counts as a function of conformational radius. An important finding is that the conformation counts obtained by the CDF inverse depend critically on the assignment of a conformation’s distance span and the ensemble (e.g., unfolded state model): varying ensemble and conformation definition (1 f 2 Å) varies the CDF-based conformation counts for Ala50 from 1011 to 1069. In particular, relatively short molecular dynamics (MD) relaxation of Ala50’s random-walk ensemble reduces the number of conformers from 1055 to 1014 (using a 1 Å root-mean-square-deviation radius conformation definition) pointing to potential disconnections in comparing the results from simplified models of unfolded proteins with those from all-atom MD simulations. Explicit waters are found to roughen the landscape considerably. Under some common conformation definitions, the results herein provide (i) an upper limit to the number of accessible conformations that compose unfolded states of proteins, (ii) the optimal clustering radius/conformation radius for counting conformations for a given energy and solvent model, (iii) a means of comparing various studies, and (iv) an assessment of the applicability of random search in protein folding.

Introduction Modeling the conformational distribution; that is, the density of conformations over phase space is a recurring problem in characterizing macromolecular processes such as protein folding. Knowing equilibrium populations of initially unfolded conformations (e.g., before diluting denaturant),1,2 which in turn requires accurate knowledge of the energy landscape under conditions of interest3 and efficient landscape sampling methods, is required to model in vitro folding experiments. Knowing conformational distributions across conformation space in the approach to the folded state as a function of time4,5 is required to differentiate between various folding mechanisms.6-8 Developing these descriptions for molecular processes faces several hurdles, one of which is the lack of accurate methods for modeling conformational distributions. This problem forms the basis of this work, which attempts to model an accurate conformational distribution for a simple polypeptide such as an N-mer polyalanine. The key obstacle to accurately modeling conformational distributions is the high dimensionality of macromolecular conformational spaces; therefore conformational distributions can be accurately modeled only for model systems or real molecules with relatively few degrees of freedom (DOF). For * To whom correspondence should be addressed. E-mail: [email protected]; [email protected]. † Institute of Biomedical Sciences. ‡ National Tsing Hua University. § Current address: Novartis, M/S 4.2, 4560 Horton St., Emeryville, CA 94608.

the latter, graph theory techniques can subdivide the energy landscape into basins of thermally accessible minima.9-11 For proteins, lattice models have been used to reduce the conformational space so that probabilities used in both thermodynamic12 and kinetic13-16 calculations can be exhaustively evaluated. However, for macromolecules represented by allatom models, partitioning conformational space into all energetic basins or sampling all energy minima becomes computationally prohibitive. Because of the high dimensionality of macromolecular conformational spaces, sampling in current all-atom simulations is necessarily sparse, even for relatively localized systems such as native states of proteins;17 hence, direct sampling over the entire phase space of interest for a given thermodynamic state can currently yield only a very lowresolution model of the conformational distribution. How fine should conformational distributions be modeled for practical purposes? A useful resolution for protein folding should be at least as fine as the native-state “target”, defined by thermal fluctuations of the order of 1 Å. Ideally, the distribution should be defined to capture functional protein motions, which might involve hops between energy minima18 associated with mean atomic displacements of ∼0.1 Å.19 Although coarser grained models of conformational space have advanced theories of protein folding considerably,20 higher resolution models are needed to account for the 5-15 kcal/mol free energy difference between native and unfolded states. For example, uncertainty in the conformational breadth of the native state can introduce significant energetic uncertainties. Comparing different crystal structures of the same protein suggests mean native-state atomic

10.1021/jp0569133 CCC: $33.50 © 2006 American Chemical Society Published on Web 07/26/2006

16708 J. Phys. Chem. B, Vol. 110, No. 33, 2006 displacements of ∼0.4 Å,21 while NMR generated ensembles reveal ∼1 Å displacements.22 The entropic cost difference between localizing a protein ensemble to these two scales has been estimated at 80 kcal/mol for a 150 amino acid protein using a simple random-walk protein representation;23 this is far in excess of global stability, underscoring the need for accurate and precise conformational distribution models. The cumulative distribution function (CDF) of conformational differences24 could resolve conformational distributions to fine detail as it can account for the complex correlation among atom positions in macromolecules. The CDF represents the probability of two randomly selected points in conformation space being separated by less than a conformational distance (see Theory and Methods section). A key advantage in using the CDF is that it reduces the conformational distribution from a density function of dimensionality equal to ∼3 times the number of atoms to a function of only one dimension, namely, the conformational distance. On the other hand, the difficulty with applying the CDF is that often the distribution must be defined to very small r, beyond what numerical experiments can directly access. Extrapolation of the CDF to very small r (small probabilities), which may exceed 100 orders of magnitude in probability, strongly depends on assumptions concerning the shape of the distribution. For polypeptide conformations generated by random walks, CDFs with high relative accuracy can be achieved by using the CDF slope to provide a first-principle’s constraint on the CDF and help to extrapolate the CDF to small probabilities.23,25 The main purpose of this work is to quantify the conformational distribution for a simple polypeptide such as an N-mer polyalanine using the CDF. To this end, we advance the CDF modeling techniques developed in previous works, which were limited to random-walk23-25 and two-dimensional (2D) square lattice models,26 to all-atom models. We show how the conformational distribution responds to changes in temperature and individual forces such as torsional and repulsive van der Waals (vdW) forces by comparing CDFs derived from different polyalanine ensembles generated by (a) random-walk-based algorithms, (b) molecular dynamics (MD) using a simulated annealing-like temperature schedule, and (c) energy minimization (see the Theory and Methods section). We also show that, in the limit of very small probabilities, CDFs derived from energy-minimized conformations yield an estimate of the number of significantly populated energy minima for proteinsized molecules, whose number of energy minima cannot be directly accessed with present computational power. The collection of CDFs presented here provides a useful framework for clarifying disagreements in the literature on the density and extent of conformational space in the context of protein folding. This discussion illuminates the importance of appropriate and interpretable conformation definitions. A companion paper27 formalizes the probabilistic interpretation of the CDF used herein and shows how the CDF can inform the scale of density of states calculations to yield absolute density of states, from which the partition function and most thermodynamic quantities can be calculated. Theory and Methods 1. Generating Molecular Ensembles. YARN-generated Ensembles. Pseudo-random polyalanine conformations of length N ) 3, 4, 5, 6, 8, 10, 12, 15, 20, 25, 30, and 50 were generated using the program YARN28 with canonical bond lengths, bond angles, and ω-torsions. “Compact” conformations are constrained during generation to fit inside an ellipsoid boundary

Sullivan and Lim

Figure 1. Molecular dynamics temperature schedule. The temperature schedule used for all chain lengths is plotted, with asterisks indicating trajectory positions where a copy of the ensemble is energy minimized. Three conformations, chosen arbitrarily from a single trajectory out of the 10 000 trajectories for Ala50, pictorially represent the level of optimization in structure achieved at different time points. The helical contents of these particular conformations are 0% (initial YARN, left), 20% (1000 K MD, middle), and 48% (minimized 100 K, right); the ensemble average values are 0.7%, 13%, and 42%, respectively.

with a volume of 123.9N Å3. “Extended” conformations have no such constraint. Ensembles, each consisting of 10 000 members, were generated using four excluded volume combinations: “local/distal”, “local/no distal”, “no local/distal”, and “no local/no distal”. Conformations with “local” excluded volume sample φ/ψ torsion angles based on experimentally observed frequencies, while those with “no local” excluded volume draw φ/ψ torsion angles uniformly random over 0-360°. Conformations with “distal” excluded volume disallow atom interpenetrations between residues separated by at least two residues, where atomic boundaries are defined by vdW radii reduced by 10%, whereas those with “no distal” excluded volume do not have such constraints. MD-generated Ensembles. The YARN ensemble generated using “compact/no local/distal” excluded volume was used as the starting point for MD calculations performed by the AMBER program29 and parm94 force field30 (see below). For each YARN-generated conformation, hydrogen atoms were added and the hydrogen-built conformation was energy minimized for 2000 steps. The electrostatic effect of water molecules was modeled by a distance-dependent dielectric function given by i,j ) di,j, where di,j is the distance between atoms i and j. A residue-based 10 Å cutoff distance was used in computing nonbonded interactions. During each simulation, all bonds involving hydrogen atoms were constrained using the SHAKE31 algorithm. For each N-mer polyalanine, 10 000 MD simulations were carried out using a time step of 1 fs according to the following temperature schedule (Figure 1): 100 K (5 ps), 300 K (10 ps), 1,000 K (10 ps), 400 K (10 ps), 350 K (10 ps), 300 K (10 ps), 200 K (10 ps), and 100 K (10 ps). Energy-minimized Ensembles. At five points during the simulation, namely, at 5, 15, 25, 55, and 75 ps after the first 100, 300, and 1000 K and second 300 and 100 K simulations, respectively, the 10 000 conformations in each ensemble were energy minimized for 10 000 steps or to a root-mean-square energy gradient of 10-4 kcal/(mol Å) if reached first (Figure 1). The first 20 minimization steps used steepest descent, while the remaining steps employed conjugate gradient. This minimization schedule is adequate for characterizing conformational minima distributions, as the gradient used localizes pairs of conformations within the same energy minimum to a root-meansquare-deviation (RMSD) < 0.005 Å, which is smaller than the smallest landscape features. Whereas all short-chain (N e 6) conformations from the 75 ps ensembles could be energy minimized to a root-mean-square energy gradient of 10-4 kcal/(mol Å), longer (N > 6) chain

Quantifying Polypeptide Conformational Space

J. Phys. Chem. B, Vol. 110, No. 33, 2006 16709

conformations increasingly do not all achieve convergence: For N ) 20, 97% reached the threshold energy gradient while for N ) 50 only 72% converged. When an additional 90 000 steps of minimization were allowed, 97% converged for N ) 50; however, the reduction in mean ensemble energy was only 0.017 kcal/mol, while the change in conformational distribution (as defined below) was negligible. Some of the remaining nonconvergent conformations quickly minimized to the threshold gradient when the nonbonded cutoff was removed, indicating that discontinuities in the cutoff-based energy landscape result in misclassification of some minima under a gradient definition. 2. Computing CDFs. CDF Definition. For a given conformation i in an ensemble with w members (here w ) 10 000), the cumulative distribution about i, Vi(r), is computed as the fraction of conformations within distance r of conformation i

Vi(r) )

δij(r)/(w - 1) ∑ j*i

(1)

where

δij(r) )

{

1 ri,j e r 0 otherwise

(2)

Averaging the distribution over each of the w conformations in the ensemble gives

V(r) ≡

∑i Vi(r)/w

(3)

Note that in this work, Vi(r) is averaged on a linear scale. If, instead, the component Vi(r) distributions are averaged on a logarithmic scale, then direct connections to entropy24 are possible. The probabilistic interpretation of linearly averaged V(r) is expanded upon below. Note that the CDF reflects the distribution in the ensemble sample; hence, if the sampling does not consider energy, as in the YARN ensembles, then the CDF will likewise be ignorant of unequal weights among conformations. Conformation Displacement Measure. The CDF avoids assumptions on the shape of the energy landscape granting some freedom in difference measure choice. Although earlier CDF studies on 2D square lattice ensembles used a distance-matrix difference measure because it yielded surprisingly simple results in some cases,26 this work uses the RMSD between heavy atoms after optimal superposition of conformations as the conformational displacement measure, r, for the following reasons. (1) RMSD values are intuitively interpretable: thermal fluctuations within the native state well are on the order of 1 Å, while structures separated by less than ∼6 Å generally display the same fold. (2) RMSD of Cartesian coordinates is commonly used, thus enabling direct comparison with RMSD-based results from the literature (see Discussion section). (3) Using atomcoordinate RMSD with the CDF has been shown to yield accurate conformational entropy measurements in multipleminima landscapes.24 Defining Low Probability CDF Tails. The low probability (small RMSD) tails of V(RMSD) cannot be adequately sampled, therefore V(RMSD) cannot be computed directly by eqs 1-3 but was extrapolated by the same strategy used in previous work23 based on its logarithmic slope: n(r) ) d[log V(r)]/d(log r). By first transforming V(RMSD) and RMSD to a logarithmic scale, n(RMSD) can be calculated using the slope from a leastsquares line fit over a neighborhood of V(RMSD) points about r. The resulting slope function, n(RMSD), differs depending on the chain length, N (Figure 2a) with n(r f 0) converging to

Figure 2. Scaling r and normalizing n, 300 K. Plot of n(RMSD) (a), n(SRMSD) (b), nnorm(RMSD) (c), and nnorm(SRMSD) derived from the MD conformations of the 300 K (55 ps) ensemble for N ) 15, 20, 25, 30, and 50.

the number of degrees of freedom (DOF) for continuously variable systems.23 The conformational distance measure RMSD also depends on the chain length, N.32 Thus, to superimpose the slope functions across the different chain lengths, we reduced the protein-size dependence inherent in n and RMSD by scaling them to yield nnorm and SRMSD, respectively. Figure 2 illustrates the slope function overlay and the Results section details CDF extrapolation for an MD ensemble and an energy-minimized ensemble. For reference, the n and RMSD scaling operations are provided below. These methods differ slightly from earlier work24 in that more parameters are required here to capture the complex atom position correlations reflected in conformational distributions associated with ensembles subjected to all-atom force fields. Scaling RMSD. To account for the protein-size dependence inherent in the conformational distance measure, RMSD was scaled to a “universal” SRMSD, defined by the parameters a and bi. The first step is to define parameter intervals for bi. Because the RMSD scaling depends on the chain length, N, defining parameter intervals first involves choosing a reference chain length and segmenting its (unscaled) RMSD range into equal lengths. We use the longest chain length (N ) 50) for reference and segmented the RMSD into 0.5 Å intervals. Thus, given the parameters a and bi, the interval boundaries (indexed by I) in terms of SRMSD can be calculated. SRMSDI)1 is always at zero. For succeeding interval boundaries I

SRMSDI )

0.5(50 - a)b ∑ i)2

i

(4)

For all other (smaller) chain lengths, the parameter intervals are mapped back to RMSD by unscaling SRMSDI to RMSDI(N) for a given chain length N less than 50 I

RMSDI(N) )

(SRMSDi - SRMSDi-1)/(N - a)b ∑ i)2

i-1

(5)

This mapping ensures that identical parameters are applied to corresponding sections of the n(RMSD) curves across a set of chain lengths. Any RMSD for any chain length can be scaled to SRMSD by first identifying the largest (N-dependent) parameter interval boundary, RMSDI(N) (eq 5), that is less than

16710 J. Phys. Chem. B, Vol. 110, No. 33, 2006

Sullivan and Lim

RMSD, then identifying the SRMSD mapping for I (eq 4), and finally adding an interpolation factor

minimum (i ) j), so that eq 11 reduces to Simpson’s33 diversity index

〈Vi(r)〉 )

SRMSDI(N) ) SRMSDI + (RMSD - RMSDI)(N - a)bI (6) In contrast to the n(RMSD) curves in Figure 2a, the n(SRMSD) curves for different chain lengths exhibit similar shapes with different maxima heights at a similar SRMSD of ∼2 (Figure 2b). Normalizing n. Additionally, the function n must be normalized to account for the differing number of DOF,23,25 which scales linearly with chain length. In contrast to the n(SRMSD) curves in Figure 2b, the nnorm(RMSD) curves for different chain lengths exhibit similar maxima heights, which occur at differing RMSD values (Figure 2c). The normalized slope as a function of SRMSD, nnorm(SRMSD), is given by

nnorm(SRMSD) ) n(SRMSD)/(N + c) c ) cI +

(

)

SRMSD - SRMSDI (c - cI) SRMSDI+1 - SRMSDI I+1

(7) (8)

where SRMSDI < SRMSD < SRMSDI+1. The new slope function, nnorm(SRMSD), is nearly independent of N (Figure 2d). The parameter space for a, bi, and ci is searched stochastically by minimizing an objective function based on the spread in nnorm(SRMSD) across ensembles. 3. Estimating the Coverage of φ/ψ Space. The distribution extrapolation methods outlined above implicitly assume that ensemble properties bear a simple chain-length scaling dependence. We investigate this assumption by comparing the φ/ψ torsion variance on a per-residue basis across ensembles of different chain lengths using the residue entropy (SR), which is defined as

SR ) -

p(φ,ψ) ln p(φ,ψ) ∑φ ∑ ψ

(9)

where p(φ/ψ) is the probability of the residue occupying a φ/ψ element in torsion-angle space calculated across all w conformations, using a 6° × 6° bin resolution. These “entropy” terms are likely not additive and are only used for descriptive purposes here. For a given chain length, the mean entropy over all nonterminal residues, 〈SR〉, can be calculated. The effective fractional coverage of φ/ψ space was calculated by dividing the exponent of 〈SR〉 by the total number of 6° × 6° bins, that is, exp(〈SR〉)/3600. This quantity can range from 1/3600 for all conformations falling in a single bin, to unity for uniform coverage. 4. Counting Conformational Minima. If the conformations are discrete, as in the case of conformational minima, then a finite probability can be associated with each conformation. In this case, the cumulative distribution about conformation i, where i and j exhaustively enumerate all W conformations potentially accessible to the system, can be interpreted as

Vi(r) )

∑j pjδij(r)/∑j pj ) ∑j pjδij(r)

(10)

The ensemble average distribution is then

〈Vi(r)〉 )

∑i pi ∑j pjδij(r)

(11)

If r ) 0, then δij(r) ) 1 only if i and j identify the same

pipjδij(r) ) ∑ pi2 ∑ i)j i

(12)

which is minimized by the special case of all W minima being of equal probability (1/W)

∑i pi2 ) Wpi2 ) W(1/W)2 ) (1/W)

(13)

and V(0) equals 1/W. For discrete ensembles with equally probable conformations, 1/V(r) provides a count of the number of minima if r is sufficiently small. For ensembles with a distribution of minima probabilities, 1/V(0) will be less than the total number of minima. Results 1. Ensemble Characterization. In this section, we examine the conformational differences among different ensembles to help compare the results with similar studies in the literature. We also examine conformational differences across chain lengths to help understand limits to the assumption that structure dependence on chain length can be “normalized” out in the CDF calculation. YARN Vs MD Ensembles. The MD ensembles differ substantially from the starting YARN ensembles examined in earlier studies23,24 in that the former include (a) more DOF, (b) intrachain electrostatic interactions, and (c) solvent effects. Each YARN conformation has only 2N - 2 internal heavy-atom DOF (two torsion angles, φ and ψ, per residue minus the two terminal torsions for which rotations do not displace any heavy atoms). On the other hand, each MD conformation has 15N - 3 internal heavy-atom DOF (five heavy atoms per residue, each with three DOF, minus six overall rotation and translation DOF, plus three DOF for an added C-terminal carboxylate oxygen). The additional DOF, however, are associated with covalent bonding and ω-torsion, which is relatively rigid compared with the more flexible φ/ψ backbone torsions. Displacements along these relatively rigid DOF are expected to contribute significantly to the conformational distribution only at small conformational distances. The MD conformations also differ from YARN ones in including attractive r-6 vdW interactions and long-range r-1 electrostatic interactions among the residues and between residues and implicit water molecules. Thus, the conformational distribution generated from the initial random walk (YARN ensemble) would be expected to differ substantially from that derived from the MD ensemble after 70 ps. Energy-minimized Vs MD Ensembles. As expected, different conformational distributions are generated by MD and energy minimization: minimizing MD ensemble copies yields conformers that are generally (i) lower in energy (Figure 3), (ii) more compact (Figure 4), (iii) more helical (Figure 5), and (iv) more confined in φ/ψ torsional space (Figure 6). At 1000 K, the mean internal energy for Ala50 (Figure 3) is higher than that at all earlier points, indicating that the MD conformations are high on the walls of the potential energy surface, whereas the minimized ensemble copy has a lower mean energy than earlier times, indicating that the ensemble has moved into a lower energy region of conformational space. Energy minimization yields compaction, as measured by the radius of gyration (Rγ) calculated over CR atoms (Figure 4, compare filled vs open circles). The amount of compaction with energy minimization increases with the temperature of the originating ensemble.

Quantifying Polypeptide Conformational Space

Figure 3. Mean ensemble potential energies along Ala50 trajectories. The mean energies of the MD ensembles (a) are plotted separately from those of minimized ensembles (b), which vary over a much smaller range. The error bars plot the deviation (σ) in energy. The lines connecting points with error bars only serve to guide the eye. Energy values at time equal to zero in plots (a) and (b) are obtained from the minimized compact YARN conformations, which were generated using “distal” but “no local” excluded volume.

Figure 4. Radii of gyration. Plot of the ensemble average radii of gyration for MD (O) and energy-minimized (b) ensembles at different time points for polyalanine of lengths N ) 3, 6, 10, 15, 20, 30, and 50 with chain lengths labeled. Radii monotonically increase with N.

Figure 5. Helical contents. Plot of the fraction of helical residues for MD (O) and energy-minimized (b) ensembles at different time points for polyalanine of lengths N ) 3, 6, 10, 15, 20, 30, and 50 with chain lengths labeled. The fractions at t ) 75 ps monotonically increase with N.

Energy minimization, especially of the 1000 K MD ensemble, increases helical content (Figure 5). “Helices” are collectively defined by R, 3/10, and π helices.34 The fraction of residues adopting β-strand conformation is less than 0.01 in all cases. Energy minimization also results in less per-residue conformational variance (Figure 6), as measured by the effective fraction of torsional space sampled, exp(〈SR〉)/3600, where 〈SR〉 is the

J. Phys. Chem. B, Vol. 110, No. 33, 2006 16711

Figure 6. Torsion space sampling. Plot of the average fraction of φ/ψ space sampled on a per-residue basis, exp(〈SR〉)/3600, for MD (O) and energy-minimized (b) ensembles at different time points for polyalanine of lengths N ) 3, 6, 10, 15, 20, 30, and 50 with selected chain lengths labeled. For improved clarity, the data for increasing chain lengths are increasingly right shifted.

residue entropy given by eq 9, averaged over N - 2 nonterminal residues. Short Vs Longer Polypeptides. Different dominant forces are expected to operate in short and long chains: covalent bonding and torsion-associated forces likely dominate the energy landscape in short peptides, whereas nonbonded interactions between residues far apart in sequence yet close in space become more important and compete with local interactions to facilitate wider exploration of torsion-angle space in longer chains. Thus, not surprisingly, the conformational distributions of short polypeptides (N < ∼15) differ from those of the longer chains (N g ∼15) that can attain the globular properties of proteins. The longer chains result in higher helical content, whereas the smallest chain lengths (N ) 3, 6) exhibit nearly no secondary structure for the final minimized ensembles at 75 ps (Figure 5). A sharp increase in secondary structure occurs between N ) 10 and 15, with additional chain length modestly increasing helical content. The φ/ψ sampling fraction (Figure 6) betrays more complicated chain-length dependence. At short times (t e 5 ps), per-residue φ/ψ sampling increases with chain length. At t g 25 ps, qualitative differences emerge between short (N < 10) and longer (N > 10) chains. φ/ψ coverage increases with chain length for short chains but decreases with chain length for longer chains. The drop in φ/ψ coverage at longer chain lengths is concomitant with increasing secondary structure formation. This trend holds upon energy minimization. Because of the observed differences between short and longer length polypeptides, short (N < 15) polypeptide data were excluded in modeling the “universal” conformational distribution by an nnorm(SRMSD) (eqs 4-8). This exclusion, in practice, limits the range of the V(RMSD) distribution to medium and large resolution, for example, RMSD > 0.5 Å for the T ) 300 K ensembles. 2. CDF Construction. Below, we describe how the CDF, V(RMSD), is constructed, using distribution properties of small chain-length ensembles, in which conformation space can be densely sampled, to inform the extrapolation of longer chainlength conformational distributions where only sparse sampling is computationally feasible. CDFs were derived from (i) ensembles with continuous conformational density generated by YARN and MD as well as (ii) ensembles with discrete conformers generated by subsequent energy minimization. CDFs from MD Ensembles. Constructing V(RMSD) from MD ensembles and its extrapolation to low RMSD is exemplified here using the 300 K (55 ps) ensembles (see Figure 7). The first step in constructing V(RMSD) over the full RMSD range is to calculate V(RMSD) according to eq 3. This operation

16712 J. Phys. Chem. B, Vol. 110, No. 33, 2006

Figure 7. Distribution functions from 300 K molecular dynamics ensemble. Plot of the distribution functions V(RMSD) (a) and n(RMSD) (b) derived from the MD conformations of the 300 K (55 ps) ensemble for N ) 3, 4, 5, 6, 8, 10, 12, 15, 20, 25, 30, and 50 (from left to right). The corresponding normalized nnorm(SRMSD) (c) and extrapolated V(RMSD) (d) for N ) 15, 20, 25, 30, and 50 are plotted below n(RMSD) and V(RMSD), respectively.

defines the V(RMSD) distribution to high resolution only for the smallest chain-length ensembles: the 3-mer distribution is defined to 0.09 Å, whereas the 50-mer distribution is only defined to RMSD ) 3.7 Å (Figure 7a). For distances smaller than these terminal RMSD values, the CDF can be estimated by exploiting the similar shape in CDF functions across chain length (Figure 7a). That the V(RMSD) distributions for chains of various lengths share a similar shape becomes more apparent in the derivative distribution function, n(RMSD) (Figure 7b), which expresses the relationship between the DOF and conformational displacement. However, distilling a “universal curve” for n(RMSD) requires a normalization (or rescaling) in both the RMSD (eqs 4-6) and the dependent variable, n (eqs 7 and 8). For example, the peak for N ) 30 at 4 Å corresponds for N ) 12 to a peak both shorter and shifted back to 2.5 Å. The reduced height reflects fewer DOF in the shorter 12-mer Ala; the back-shift indicates smaller magnitudes of displacement as reported by coordinate RMSD. The utility in generating the universal n curve (Figure 7c) lies in creating a mapping function between conformational distributions of shorter and longer chains. We illustrate how V(RMSD) was extrapolated using its loglog slope, n(RMSD), which is also defined to different RMSD values for different chain lengths (Figure 7b), by showing how short chains inform the CDF extrapolation of the longest chain, Ala50. Superimposing the slope functions across chain lengths N g 15, as described in the Theory and Methods section, initially defines the new slope function, nnorm(SRMSD), to the scaled SRMSD ) 0.56, 1.1, 1.3, 1.4, and 2.2 for N ) 15, 20, 25, 30, and 50, respectively (Figure 7c, circles). To extrapolate the 50mer’s V(RMSD), its nnorm(SRMSD) was first extrapolated from SRMSD ) 2.2 to 1.4 by appending the tail of the 30-mer’s nnorm(SRMSD), and further extrapolated to SRMSD ) 1.3 by appending the tail of the 25-mer’s nnorm(SRMSD) curve, then to SRMSD ) 1.1 using the 20-mer’s nnorm(SRMSD) tail, and finally to SRMSD ) 0.56 using the 15-mer’s nnorm(SRMSD). In general, any chain length’s extrapolated nnorm(SRMSD) curve is defined at any SRMSD value by nnorm(SRMSD) associated with the largest chain length (not greater than N) for which nnorm-

Sullivan and Lim (SRMSD) has been directly calculated. For extrapolating V(RMSD), the universal curve, nnorm(SRMSD), is first transformed back to the chain-length specific function, n(RMSD), by the inverse functions of eqs 4-8. The extrapolated n(RMSD) curves were used to extend the V(RMSD) distribution by Euler extrapolation, in which a slope function, for example, n(RMSD), is applied iteratively to extend the original function, for example, V(RMSD), by small increments. For extrapolating n(RMSD) and thus V(RMSD) to even smaller RMSD, we use the n(RMSD) curves for N < 15 in Figure 7b, which are defined to RMSD as low as 0.09 Å, as well as the fact that n(RMSD) trends to the number of DOF, 15N - 3, as RMSD approaches zero. Additional distribution analysis calculations on alanine (12 heavy-atom DOF) under MD at 700 K with w ) 30 000 confirms the trend of n(RMSD) to 15N - 3, with alanine’s n(RMSD) value reaching 9.96 at 0.033 Å RMSD (data not shown), and a linear fit over n(RMSD) from 0.033 to 0.056 Å RMSD (correlation coefficient r2 ) 0.99) yielding n(0) ) 12. Hence, n(RMSD) is normalized simply by dividing by N - 0.2, a quantity proportional to 15N - 3, with no RMSD scaling. This yields a tight overlap of n(RMSD) for N ) 3-12 for r < 1 Å (Figure 7c, inset), providing empirical justification for the normalization. By appending this curve, rescaled through multiplication by N - 0.2, n(RMSD) for N g 15 is defined to RMSD ) 0.09 Å; it is further defined to RMSD ) 0 by a linear extrapolation to 15N - 3. The extrapolated n(RMSD) curves, in turn, were used to extend the V(RMSD) distribution to very small RMSD, as shown in Figure 7d. CDFs from Energy-minimized Ensembles. Extrapolating V(RMSD) for minimized ensembles with discrete conformers is similar to that described above for MD ensembles with continuous conformational density, except for RMSD < 1 Å. This is because although at RMSD > 1 Å, the V(RMSD) distributions of minimized and MD ensembles are similar, in the limit of low RMSD, the V(RMSD) distributions of minimized ensembles flatten for N e 20, where w is large enough for multiple sampling of individual minima (compare Figure 8a with Figure 7a). Correspondingly, the low RMSD portions of n(RMSD) differ dramatically and approach zero instead of 15N - 3 (compare Figure 8b with Figure 7b). Hence, the n(RMSD) curves for N < 15 in Figure 8b (instead of Figure 7c inset) were used to extrapolate V(RMSD) below 1 Å for the minimized ensembles. 3. Dependence of “Unfolded” States of Ala50 on Thermodynamic Variables and Energy Components. In this section, the response of conformational distributions to changes in temperature (energy minimization) and individual forces such as repulsive vdW and torsional forces on the energy landscape is obtained by comparing CDFs derived from different Ala50 ensembles (MD, energy minimized, and YARN with various constraints). Dependence of V(RMSD) on Temperature. Examining conformational distributions drawn from different temperatures gives insight into the shape of the energy landscape, which is found to consist of distinct energy basins, separated by ∼1 Å. An apparent feature in the low-temperature MD V(RMSD) (Figure 9a, open circles, 100 and 300 K) is an overall sigmoid shape: the CDF is steep at small RMSD, then flattens ∼1 Å, and steepens above ∼2 Å before flattening again in the limit of high RMSD. This behavior would be consistent with an energy landscape composed of distinct energy basins, with characteristic spacing of ∼1 Å. Interbasin conformation pairs contribute to V(RMSD) at RMSD > 1 Å, whereas intrabasin conformation pairs contribute to V(RMSD) at RMSD < 1 Å. The extent of

Quantifying Polypeptide Conformational Space

Figure 8. Distribution functions from energy-minimized ensemble. Plot of the distribution functions V(RMSD) (a) and n(RMSD) (b) derived from the energy-minimized conformations of the 100 K (75 ps) MD ensemble for N ) 3, 4, 5, 6, 8, 10, 12, 15, 20, 25, 30, and 50. The corresponding normalized nnorm(SRMSD) (c) and extrapolated V(RMSD) (d) for N ) 15, 20, 25, 30, and 50 are plotted below n(RMSD) and V(RMSD), respectively. In the inset to (b), the low-RMSD peak maximum position is plotted vs the chain length, N. This feature results from degeneracy in the deprotonated C-terminal carboxylate oxygen atoms. The peak position is well fit by RMSsaveraging the switching of the positions of the two carboxylate oxygen atoms, by 2.19 Å, across the 5N + 1 heavy atoms; r ) [(2 × (2.19)2)/(5N + 1)]0.5 defines the dotted line in the inset.

intrabasin exploration, whose end-point depends on temperature, is indicated by the difference between V(RMSD) derived from energy-minimized ensembles (Figure 9a, filled circles) with those from the corresponding MD ensembles. Comparing MD 100 K with 300 K, the V(RMSD) difference is only in the intrabasin regime, demonstrating that cooling to 100 K does not reduce the number of basins explored on these time scales. Higher temperature (1000 K) mutes the sigmoid shape in V(RMSD) as these basins flood. The difference between 1000 K and low-temperature MD V(RMSD) indicates that far more basins are accessible at 1000 K. This is also borne out in the number of energy minima, estimated from the 1/V(0) value for energy-minimized ensembles (eq 12), which ranges from 1030 at 1000 K to 1015 at 100 K (see Figure 9a). Dependence of V(RMSD) on Energy Components. Just as comparisons between MD and energy-minimized based V(RMSD) provide information on the shape of the energy landscape, the response of V(RMSD) to particular forces in the molecularmechanics force field is also useful for characterizing the energy landscape. For this purpose, V(RMSD) distributions were derived from YARN ensembles employing different constraints mimicking aspects of all-atom based potential functions: compactness, “distal” excluded volume, and “local” excluded volume. Compactness reflects the relative self-attractiveness of the chain, as opposed to attraction to solvent. Hard-sphere “distal” excluded volume reflects repulsive vdW interactions, whereas “local” excluded volume reflects φ/ψ torsional interactions. Although, in lieu of such YARN constraints, one could similarly adjust parameters of the MD energy function (for example, by varying atomic radii), using an enumeration scheme like YARN removes time dependence from consideration. The extent to which compactness reduces conformational space depends on the compactness constraint strength and RMSD. Compactness increases V(RMSD) between ensembles

J. Phys. Chem. B, Vol. 110, No. 33, 2006 16713

Figure 9. Model dependence on Ala50 V(RMSD). Extrapolated V(RMSD) is plotted for 50-mer ensembles generated using different constraint sets or modeling operations. In (a), V(RMSD) is constructed from the MD (O) ensembles at 25 ps (1000 K), 55 ps (300 K), and 75 ps (100 K), along with their minimized (b) counterparts. In (b), the V(RMSD) curves are constructed from YARN models24 with no compactness constraint while in (c) compactness (see the Theory and Methods section) is imposed. For the YARN ensembles, combinations of two constraints are explored. Curves marked “Local” have “local” excluded volume imposed, such that the φ/ψ distribution reflects that of real proteins.28 “No Local” conformations are built with a uniform φ/ψ distribution. “Distal” constrained ensembles impose excluded volume constraint between residues not covalently bound. “No Distal” curves do not have this excluded volume constraint.

with identical constraints ∼8-15 orders of magnitude at RMSD ∼1 Å, increasing to 10-20 orders of magnitude in the limit of small RMSD (compare solid lines in parts b and c of Figure 9). The extent to which repulsive vdW interactions (“distal” excluded volume) reduce conformational space depends on whether the chains are extended or compact. For extended conformations (Figure 9b), “distal” excluded volume causes only a small constriction in conformation space: the “local/distal” and “local/no distal” V(RMSD) curves essentially overlap, as do the “no local/distal” and “no local/no distal” V(RMSD) curves. For compact conformations (Figure 9c), “distal” excluded volume causes a greater constriction in conformation space than for extended conformations. Torsional interactions (“local” excluded volume) reduce conformational space much more than repulsive vdW interactions. Relative to uniform (“no local/no distal”) φ/ψ sampling, “local” excluded volume introduces ∼70 orders of magnitude separation in V(RMSD) in the limit of small RMSD. On a perresidue basis, “local” excluded volume ensembles occupy only ∼3.7% of φ/ψ space (i.e., 10-70 ≈ 0.03749). “Local” excluded volume also partly accounts for the sigmoid shape of V(RMSD) derived from 100 and 300 K MD ensembles, as V(RMSD) from YARN ensembles become more sigmoid-like when it is turned on. This result suggests that the basins observed in MD are predominantly associated with sequence-local interactions as opposed to chain packing. Further correlation between successive torsion angles in MD associated with secondary structure might explain the more sigmoidal shape in MD V(RMSD), as compared with YARN V(RMSD), given that the torsion angle

16714 J. Phys. Chem. B, Vol. 110, No. 33, 2006 values in YARN conformations probably have negligible correlation. In summary, despite the common “random” origins of these ensembles, V(RMSD) reveals profound pair distribution differences reflecting imposed constraints in the case of YARN ensembles or in response to relatively short MD simulations or energy minimization. 4. “First-shell” Structure of Ala50. In this section, we analyze the distribution of minima in the locality of a single energy minimum. The “first-shell” structure of minima surrounding a single minimum for Ala50 is then compared with results from a related study on proteins and for Ala50 with explicit water molecules. Ala50’s Minima Neighbor Distance. To investigate further the “first-shell” energy landscape structure, V(RMSD) was recalculated for the 100 K (75 ps) minimized ensemble using RMSD values that exclude the two C-terminal carboxylate oxygen atoms (data not shown, see Figure 8b). From these degeneracyfree distributions, a “statistical” nearest-neighbor distance (SNND) was defined as the RMSD where V(RMSD) ) 2 × V(0.005 Å). Note that the pair distribution is absent of density from ∼0.004 to 0.1 Å. Cumulative density at 0.005 Å captures exclusively intraminima conformational density. Over N ) 3-25, for which V(RMSD) is defined to small RMSD at 75 ps, SNND ranges from 1.5 Å (N ) 4) to 0.64 Å (N ) 25). Although the SNND decreases with increasing N, the N-dependence is too noisy to infer a functional form, therefore, as an initial guess, the SNNDs from N ) 10 to N ) 25 were fitted to a line, yielding a SNND for Ala50 of 0.42 Å. Comparison Between the SNNDs of 50-mer Ala and 58residue BPTI. The minima neighbor distance predicted for 50mer Ala (0.42 Å) is four times larger than that (CR-RMSD ∼ 0.1 Å) reported for the 58-residue bovine pancreatic trypsin inhibitor (BPTI) protein.19 Does the observed minima neighbor distance difference stem from (i) differences in the way the conformations were analyzed, (ii) differences in the treatment of solvent, or (iii) differences between polyalanine and proteins? To address the first possibility, the method used to analyze the BPTI conformations19 was applied to the Ala50 conformations. In the BPTI analysis, conformations at every 10 fs from a 1 ps segment of a 298 K MD trajectory were energy minimized, and the CR RMSD between any two energy minima was computed. In analogy to the BPTI analysis, conformations at every 10 fs from five 1-ps 500 K MD trajectories (starting from five independent Ala50 conformations taken from the 55 ps MD ensemble) were energy minimized, and the RMSD (rather than the CR RMSD) was computed. The higher temperature (500 K vs 298 K) was necessary to induce minima crossings. After energy minimization, of the 495 pairs of successive structures, 431 are RMSD e 0.004 Å displaced, indicating that the MD trajectory has not crossed a barrier, while the remaining 64 structure pairs are RMSD ) 0.068-1.32 Å displaced, indicating interminima transitions. Although the average MD interminima RMSD (0.62 Å) is slightly larger than Ala50’s predicted SNND (∼0.42 Å), as the MD trajectory does not necessarily traverse into the nearest neighbor, it indicates that the observed minima neighbor distance difference between Ala50 and BPTI does not stem from differences in the conformational analysis. To assess if the observed minima neighbor distance difference stems from differences in the treatment of solvent, the above analysis of kinetically linked minima was repeated for Ala50 using a solvent model similar to that used in BPTI. The BPTI simulation19 was carried out in the presence of a 5 Å shell of 280 TIP3P water molecules using a distance-dependent dielectric

Sullivan and Lim function,  ) dij. Although when a distant-dependent dielectric and explicit waters are used, the screening is double-counted, we had to adopt the BPTI protocol to isolate protein sequence difference as the sole variable in comparing the BPTI and Ala50 results. Therefore, in analogy to the BPTI simulation, each of the Ala50 simulations (starting from the five independent Ala50 conformations taken from the 55 ps MD ensemble, as above) was carried out in the presence of a 4, 6, and 8 Å shell, containing 21-33, 212-231, and 475-494 TIP3P water molecules, respectively, using a distance-dependent dielectric function. The solvated systems were first energy minimized, then equilibrated at 300 K for 30 ps, followed by 1 ps of “production” dynamics. To preserve solvent/solute contact for the 4-Å-shell system, a light harmonic constraint of 0.1 kcal/ (mol Å2) was applied during equilibration, but this constraint was turned off during production. The Ala50 simulations show a strong dependence of minima spacing on the number of water molecules. Adding the 4 Å shell reduces the mean minima spacing from 0.62 Å (no water molecules) to 0.24 Å (0.22 Å CR RMSD). The spacing is further reduced to 0.09 Å RMSD (0.08 Å CR RMSD) with a 6 Å water shell and to 0.062 Å RMSD (0.057 Å CR-RMSD) with an 8 Å water shell. The minima spacing for Ala50 solvated with a 6 Å water shell (0.08 Å CR-RMSD) is comparable to that for BPTI solvated with a 5 Å shell (0.10 Å CR-RMSD). Thus, when equivalent solvent models are employed, the minima spacing appears essentially sequence-independent. In summary, the above results indicate that the nearestneighbor minima spacing is more sensitive to the solvent model than to the protein sequence. Discussion 1. Quantifying Conformational Space using Conformational Counts. Two idealized calculations motivate counting conformations. In an idealized thermodynamic calculation, each of the W conformations is equally probable in the unfolded state with one conformation exclusively describing the native state, such that the conformation count directly provides the entropic cost of folding by k ln W. In the corresponding ideal kinetic calculation, occupancy time in any single conformation is equal during folding such that the Levinthal random exploration folding time6 follows directly from the unfolded state conformation count multiplied by the conformation transition time. Deviations between these limiting cases and actual entropies or folding times, if available, estimate the energetic or kinetic bias relative to random sampling of the unfolded state model. Here, we quantify conformational space by counting conformations based on the CDF. Assuming a conformational radius, V(r) gives the fraction of conformational density occupied by each conformation, hence the inverse, 1/V(r), provides a statistical conformation count (see the Theory and Methods section). This alternative presentation of the CDF information gives conformation counts as a function of conformational radius. The CDF-based conformational count scheme presented herein has several advantages over current conformation count schemes. The CDF does not assume independence among the DOF and can therefore capture correlations of arbitrary complexity, whereas projection-based methods assume independent DOF. For example, assuming noninteracting backbone torsions, conformation counts can be calculated as the product of each torsion’s number of conformations (e.g., number of wells by torsion rotation). However, this “isolated pair hypothesis”,35 if applicable at all,36 is probably limited to random and extended states depending on the exact problem definition37 but would

Quantifying Polypeptide Conformational Space not be applicable for compact states where nonbonded repulsion excludes many conformations.38 Insensitivity to higher-order correlations among DOF by quasiharmonic analysis39 illustrates the general limitations of quantifying conformational space by assuming a set of independent DOF:24,40 the product of conformational density along component DOF may map probabilistically “favored” regions to high-energy molecular geometries. Another advantage of the inverse-CDF conformation count is that it yields counts that converge with increasing sample size, whereas cluster counts from computer simulations depend on simulation length. The latter count obserVed conformations by clustering an ensemble sample from an MD trajectory and counting the number of clusters, with each cluster representing a “conformation”.3,41,42 Such raw MD cluster counts at any nonzero temperature will increase with simulation time until all possible clusters containing conformations of finite energy have been sampled, which is not a well-behaved quantity (see below). A third advantage of CDF-based conformation counts is that the CDF can provide the optimal clustering radius/ RMSD for counting conformations for a given energy and solvent model (see below). 2. Quantifying Conformational Space of Folding Peptides/ Proteins. Unfolded Ensembles. Ignoring for the moment the most appropriate correspondence between the CDF radius and clustering radius, which will depend on algorithmic details of the clustering, the results herein provide a limiting count of the number of significant clusters for any radius and by extension the fraction of conformational space covered by any single simulation, as exemplified by the following. Inspired in part by claims that surprisingly few conformations define unfolded states,3 Caflisch and co-workers41 explored the possibility of fully covering the unfolded conformational space of a 20-mer designed peptide by clustering 12.6 µs of MD simulation at the melting temperature, 330 K, with a 2 Å RMSD clustering radius. In 12.6 µs, ∼15 000 clusters are visited, but additional simulation time continually explores new regions, indicating more clusters. How many more? An upper bound on the number of high-energy conformations could be estimated from the 1/V(2 Å) conformation count for Ala20 derived from (i) the YARN “extended/local/distal” ensemble, 2 × 107, and (ii) the 1000 K MD (25 ps) ensemble, 6 × 106. Thus, 106-107 probably bounds 20-mer unfolded state conformation counts at low temperature (T < ∼500 K). The inverse of the sum of squared fractional occupancies from MD clusters would give a weighted conformation count more directly comparable to these CDF-based counts. The results herein also provide the optimal clustering radius/ RMSD for counting conformations. The optimal value is ∼1 Å RMSD, where n(r) for the 55 ps Ala20 ensemble (Figure 7) is at a minimum and therefore the CDF depends least critically on conformational radius. This distance also captures inherent boundaries of energy basins that segregate intraminima vibrations from larger displacements, as evidenced from comparing n(r) of the 55 ps MD ensembles (Figure 7b), in which both vibrations and larger displacements are active, with n(r) of the minimized ensembles (Figure 8b), in which vibrations have been quenched: No differences are apparent above 1 Å, whereas below 1 Å, n(r) is ∼0 for minimized ensembles but increases precipitously for the corresponding MD. However, whether this inherent conformation cutoff could be transferred to different energy and solvent models, as well as different sequences, will require further investigation. The results herein show that the conformational count depends critically on the clustering radius used in a given clustering

J. Phys. Chem. B, Vol. 110, No. 33, 2006 16715 algorithm or the conformational distance used in inverse-CDF conformation counts. Doubling RMSD from 1 to 2 Å reduces the conformation count from 1023 to 2 × 1010 for Ala20 and from 1069 to 1036 for Ala50 generated using YARN “extended/ no local/distal”. It reduces the conformation count much less from 1014 to 3 × 1011 for Ala50 generated using the 300 K (55 ps) MD ensemble. The excluded conformational space surrounding conformational basins likely explains this reduction in radius sensitivity in low energy ensembles near this characteristic ∼1 Å resolution. Besides the conformational/clustering distance, the CDFbased conformation count depends on the unfolded state model, as evidenced by the large CDF differences between free-flight and collapsed Ala50 ensembles exhibiting significant secondary structure (Figure 9). Assuming 1 Å RMSD to represent the thermal-fluctuation span of a protein’s native state, 1/V(1 Å) gives a probabilistically weighted estimate of the number of native-state-sized conformational-space elements that compose an ensemble. In 55 ps of MD simulation, 1/V(1 Å) is reduced more than 40 orders of magnitude from 1055 to 1014 (Figure 9, compare V(1 Å) from the initiating “compact/no local/distal” YARN ensemble with the 300 K MD ensemble). If this analysis is restricted to energy minima counts using 1/V(r ) 0), there is still a large range in ensemble sizes of 1030 for the 25 ps ensemble falling to 1015 after only 75 ps. These results highlight the importance of accurately representing a thermodynamic state of interest, in particular the conformational distribution of an unfolded protein. In summary, by varying ensemble and conformation definition (1 f 2 Å), CDF-based conformation counts for Ala50 range from 1011 to 1069. The low end of this range results from applying molecular-mechanics force fields, which embody attractive nonbonded forces absent in earlier CDF studies.23,24 Thus, the CDFs obtained herein expose the myriad of dependencies in conformation counts that can easily obscure a discussion on the sizes of conformational space. Transition-state Ensembles. The collection of CDFs derived from different polyalanine ensembles and the finding that conformation counts are sensitive to ensemble and conformation definition (see above) can help place various studies on a comparable scale, as exemplified by the following. Debe and Goddard43 proposed a test for determining whether two structures occupy the same topomer by whether harmonic forces alone could superpose the pair. “Topomers” in their parlance are broad conformational basins, of which the native transition state is one. By this definition, 50-mer proteins are composed of ∼106 topomers, each spanning ∼5-7 Å RMSD.43 These numbers are within the extended YARN CDF-based conformation count range of 105 to 108 “topomers” at 5 Å RMSD for the 50-mer model (see Figure 9b). Wallin and Chan44 recently reexamined this class of models45 while applying a more stringent test for transition state cooccupancy, based on estimating the probability of forming the set of native contacts in a random search. This strategy estimates a 1036 to 1047 probabilistic degeneracy for the transition state of a 64-mer protein, chymotrypsin inhibitor 2 (CI2), with the transition state spanning only ∼2-3 Å RMSD. Considering the longer chain length of the 64-mer protein vs 50-mer Ala, Wallin and Chan’s counts are in qualitative agreement with all extended YARN CDF-based topomer counts, which range from 1024 to 1038 at 2 Å RMSD (see Figure 9b). The agreement between these two literature topomer counts and the YARN CDF estimates demonstrates that the essential disagreement between Debe and Goddard’s43 ∼106 topomers and Wallin and Chan’s44

16716 J. Phys. Chem. B, Vol. 110, No. 33, 2006 1036-1047 topomers can be restated as a disagreement on the extent of the transition state as reflected by mean atomic displacements (5-7 vs 2-3 Å ). 3. Constructing CDFs for Proteins. Although protein CDFs have not been computed in this work, they are expected to track polyalanine CDFs to some small RMSD because side chain structure is strongly correlated with main chain structure, both covalently and through torsions.46,47 Just as CDFs derived from energy-minimized and MD ensembles track each other before vibrations cause a sharp dive in the CDF, we expect similar behavior for adding longer side chains. When CDF calculations for proteins become available, the CDFs from simpler molecules such as polyalanine will provide a baseline for comparison. We expect the large “substate” basins observed for polyalanine, which have a breadth of ∼0.5-1 Å, to divide into a hierarchy18 of energy wells for proteins for equivalent solvent models. If particular tiers cover characteristic volumes of conformational space, then different tiers will lend a step pattern to V(RMSD) with bumps in n(RMSD) at characteristic distances. Comparing V(RMSD) at these characteristic distances will reveal the structure of the hierarchy, such as the number of minima per “minima cluster”, number of clusters per substate, and so forth. If instead the tiers cover a continuum in a scale-free manner,48 V(RMSD) should slope smoothly to small RMSD. Further, different conformational displacement measures, in lieu of RMSD, can be tested for their ability to capture energy landscape structure via the CDF. To apply CDFs to proteins, technical issues of extrapolating CDFs for actual protein sequences have to be addressed. Below are multiple plausible solutions. (i) A first possible solution would be to apply directly the methods developed here to a set of proteins drawn from different chain lengths. With enough sequences, an envelope of distributions could be deduced for proteins generally or restricted to particular folds. (ii) An alternative strategy to collecting distribution information over a range of chain lengths would be to collect distribution information over a range of resolutions for a single sequence of given chain length. This strategy makes use of the fact that defining n(r), as opposed to the pair distribution, avoids the problem of establishing normalization factors between different resolutions. In practice, controlling the resolution of an ensemble could be achieved by constraining the DOF across an ensemble, for example, by using positional constraints in MD. The distribution at r sufficiently less than allowable conformational displacements should be indicative of the unconstrained distribution. This approach would permit the definition of n(r) to smaller r values than possible using unconstrained sampling. The low-r n(r) segments over several resolutions could be concatenated to estimate the full n(r). (iii) A third strategy for extrapolating V(r) would leverage matrix algebra treatments such as PCA.49 As shown in previous work,24 reducing a conformational distribution to its component marginal distributions may result in significant information loss on the system’s correlation structure. However, given the straightforward ease of PCA, a better understanding of that information loss would clearly be beneficial. This can be explored by comparing n(r) computed directly from a conformational ensemble of interest with n(r) computed from an ensemble regenerated using only the marginal distributions of the original ensemble. Knowing the change in n(r), that is, the information loss, the next step is to investigate how n(r) relates to quantities obtained by PCA, either by numerical or by

Sullivan and Lim analytical investigation. Specifically, it would be of interest to know n(r) relative to the number of DOF with eigenvalues (root variances, with units of angstroms, and not variances directly) greater than r. This ratio will approach unity in the limit of small-r values. The relation would probably be simpler for systems without significant nonlinear correlations, for example, excluded volume. With a thorough understanding of this relation, V(r) could be extrapolated using PCA. (iv) In a variation of strategy (iii), one could test various copula functions, which regenerate a joint distribution from a set of marginal distributions.50,51 From copula-regenerated joint distributions, a point sample could be drawn for calculating V(r). This V(r) could be compared to V(r) calculated over the original joint distribution to evaluate the fit of the copula function. It might also be possible to evaluate analytically the reduction in conformational space associated with a given copula function. These studies would probably have the most impact in characterizing the nonlinear correlation structure of conformational distributions. (v) A random-walk process can well describe protein dynamics in an RMSD space.52 At fixed time intervals, the variance of travel distance depends on the dimensionality of space available to the walker. In the limit of an infinite number of dimensions, all steps are orthogonal. Displacement exactly equals the square root of the number of steps, multiplied by the step length, and displacement variance increases at smaller dimensions. “Successive N-cube analysis” (S-NCA)25 applies a related mathematical treatment to arrive at a time-scale specific dimensionality from the RMSD distribution of pairs of conformations equally spaced in time. Like n(r), the S-NCA dimensionality appears to approach the number of DOF at small displacements and drops at larger displacements. Establishing the exact relationship between S-NCA and n(r) would enable CDF calculations from a single long and well-equilibrated MD trajectory and would obviate multiple chain-length extrapolation schemes of the form used here. The second and third alternate V(r) extrapolation schemes could be developed and tested using (a) small chain-length ensembles where conformational space can be densely sampled, (b) arbitrarily long chain-length ensembles with simple constraint sets that have known chain-length scaling properties,24 and (c) the polyalanine MD ensembles calculated here, which have structural features reminiscent of proteins. Different problems may pose different constraints on the CDF calculation that may help choose from among these various possible schemes. Directly calculating the CDF from distances between pairs of conformations sampled from an ensemble of interest requires that all conformations be structurally independent to interpret the CDF as statistics of random pairs. Calculating the CDF from pair distances among multiple conformations from a single MD trajectory would bias the RMSD distribution to low values if snapshots are sampled at intervals shorter than the structural relaxation time, which can be >1 µs for unfolded proteins.53 In our study, structural independence among conformations for any particular ensemble sample (i.e., any time point) is ensured by using only one conformation from each MD trajectory initiated by randomly generated conformations (the compact YARN structures). Projections to marginal conformation distributions, as in PCAbased strategies (iii) and (iv), will be less sensitive to autocorrelation among conformations precisely because of the information loss inherent in these methods; each dimension can be heavily resampled such that distributions converge. Strategy V is attractive because it requires kinetically linked conformations

Quantifying Polypeptide Conformational Space and is compatible with long and well-equilibrated MD trajectories but is completely incompatible with stochastically generated ensembles such as the YARN models. Acknowledgment. We thank Vincent A. Voelz for helpful discussions. This work is supported by Academia Sinica and the National Science Council, Taiwan. References and Notes (1) Shortle, D. FASEB J. 1996, 10, 27. (2) Smith, L. J.; Fiebig, K. M.; Schwalbe, H.; Dobson, C. M. Folding Des. 1996, 1, R95. (3) van Gunsteren, W. F.; Burgi, R.; Peter, C.; Daura, X. Angew. Chem., Int. Ed. 2001, 40, 352. (4) Guo, Z. Y.; Brooks, C. L.; Boczko, E. M. Proc. Natl. Acad. Sci. U.S.A. 1997, 94, 10161. (5) Plotkin, S. S.; Onuchic, J. N. Proc. Natl. Acad. Sci. U.S.A. 2000, 97, 6509. (6) Levinthal, C. J. Chim. Phys. 1968, 65, 44. (7) Dill, K. A.; Chan, H. S. Nat. Struct. Biol. 1997, 4, 10. (8) Nolting, B.; Golbik, R.; Neira, J. L.; Soler-Gonzalez, A. S.; Schreiber, G.; Fersht, A. R. Proc. Natl. Acad. Sci. U.S.A. 1997, 94, 826. (9) Czerminski, R.; Elber, R. J. Chem. Phys. 1990, 92, 5580. (10) Becker, O. M.; Karplus, M. J. Chem. Phys. 1997, 106, 1495. (11) Wales, D. J.; Miller, M. A.; Walsh, T. R. Nature 1998, 394, 758. (12) Lau, K. F.; Dill, K. A. Proc. Natl. Acad. Sci. U.S.A. 1990, 87, 638. (13) Chan, H. S.; Dill, K. A. J. Chem. Phys. 1993, 99, 2116. (14) Chan, H. S.; Dill, K. A. J. Chem. Phys. 1994, 100, 9238. (15) Ozkan, S. B.; Dill, K. A.; Bahar, I. Biopolymers 2003, 68, 35. (16) Schonbrun, J.; Dill, K. A. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 12678. (17) Clarage, J. B.; Romo, T.; Andrews, B. K.; Pettitt, B. M.; Phillips, G. N. J. Proc. Natl. Acad. Sci. U.S.A. 1995, 92, 3288. (18) Ansari, A.; Berendzen, J.; Bowne, S. F.; Frauenfelder, H.; Iben, I. E. T.; Sauke, T. B.; Shyamsunder, E.; Young, R. D. Proc. Natl. Acad. Sci. U.S.A. 1985, 82, 5000. (19) Troyer, J. M.; Cohen, F. E. Proteins: Struct., Funct., Genet. 1995, 23, 97. (20) Bryngelson, J. D.; Wolynes, P. G. Proc. Natl. Acad. Sci. U.S.A. 1987, 84, 7524. (21) Chothia, C.; Lesk, A. M. EMBO J. 1986, 5, 823. (22) Spronk, C. A.; Nabuurs, S. B.; Bonvin, A. M.; Krieger, E.; Vuister, G. W.; Vriend, G. J. Biomol. NMR 2003, 25, 225. (23) Sullivan, D. C.; Kuntz, I. D. Biophys. J. 2004, 87, 113. (24) Sullivan, D. C.; Lim, C. J. Chin. Chem. Soc. 2004, 51, 1209. (25) Sullivan, D. C.; Kuntz, I. D. Proteins: Struct., Funct., Genet. 2001, 42, 495.

J. Phys. Chem. B, Vol. 110, No. 33, 2006 16717 (26) Sullivan, D. C.; Aynechi, T.; Voelz, V. A.; Kuntz, I. D. Biophys. J. 2003, 85, 174. (27) Sullivan, D. C.; Lim, C. J. Phys. Chem. B 2006, 110, 12128. (28) Gregoret, L. M.; Cohen, F. E. J. Mol. Biol. 1991, 219, 109. (29) Pearlman, D. A.; Case, D. A.; Caldwell, J. W.; Ross, W. S.; Cheatham, T. E., III; DeBolt, S.; Ferguson, D. M.; Seibel, G. L.; Kollman, P. A. Comput. Phys. Commun. 1995, 91, 1. (30) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M. J.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. J. Am. Chem. Soc. 1995, 117, 5179. (31) Ryckaert, J.; Ciccotti, G.; Berendsen, H. J. C. J. Comput. Phys. 1977, 23, 327. (32) Maiorov, V. N.; Crippen, G. M. J. Mol. Biol. 1994, 235, 625. (33) Simpson, E. H. Nature 1949, 163, 688. (34) Kabsch, W.; Sander, C. Biopolymers 1983, 22, 2577. (35) Flory, P. J. Statistical mechanics of chain molecules; Wiley: New York, 1969. (36) Pappu, R. V.; Srinivasan, R.; Rose, G. D. Proc. Natl. Acad. Sci. U.S.A. 2000, 97, 12565. (37) Ohkubo, Y. Z.; Brooks, C. L. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 13916. (38) Dill, K. A. Biochemistry 1985, 24, 1501. (39) Levy, R. M.; Karplus, M.; Kushick, J.; Perahia, D. Macromolecules 1984, 17, 1370. (40) Chang, C. E.; Chen, W.; Gilson, M. K. J. Chem. Theor. Comput. 2005, 1, 1017. (41) Cavalli, A.; Haberthur, U.; Paci, E.; Caflisch, A. Protein Sci. 2003, 12, 1801. (42) Day, R.; Daggett, V. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, 13445. (43) Debe, D. A.; Carlson, M. J.; Goddard, W. A. I. Proc. Natl. Acad. Sci. U.S.A. 1999, 96, 2596. (44) Wallin, S.; Chan, H. S. Protein Sci. 2005, 14 1643. (45) Makarov, D. E.; Plaxco, K. W. Protein Sci. 2003, 12, 17. (46) Ponder, J. W.; Richards, F. M. J. Mol. Biol. 1987, 193, 775. (47) McGregor, M. J.; Islam, S. A.; Sternberg, M. J. J. Mol. Biol. 1987, 198, 295. (48) Doye, J. P. K. Phys. ReV. Lett. 2002, 88, 238701. (49) Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetterling, W. T. Numerical Recipes in C; The art of scientific computing, 2nd ed.; Cambridge University Press: New York, 1993. (50) Sklar, A. Publications de l′Institut de Statistique de l′UniVersite de Paris 1959, 8, 229. (51) Nelsen, R. B. An introduction to copulas; Springer-Verlag: New York, 1999. (52) Sullivan, D. C.; Kuntz, I. D. J. Phys. Chem. B 2002, 106, 3255. (53) Chattopadhyay, K.; Elson, E. L.; Frieden, C. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, 2385.