Bayesian Single-Exponential Kinetics in Single ... - ACS Publications

Using a Bayesian hypothesis test, one can easily test whether a transition occurs at the same rate or at different rates in two data sets. We illustra...
0 downloads 0 Views 260KB Size
12410

J. Phys. Chem. B 2009, 113, 12410–12423

Bayesian Single-Exponential Kinetics in Single-Molecule Experiments and Simulations Daniel L. Ensign Department of Chemistry, Stanford UniVersity, Stanford, California 94305

Vijay S. Pande* Departments of Chemistry, Computer Science, and Structural Biology, Stanford UniVersity, Stanford, California 94305 ReceiVed: April 3, 2009; ReVised Manuscript ReceiVed: June 29, 2009

In this work, we develop a fully Bayesian method for the calculation of probability distributions of singleexponential rates for any single-molecule process. These distributions can even be derived when no transitions from one state to another have been observed, since in that case the data can be used to estimate a lower bound on the rate. Using a Bayesian hypothesis test, one can easily test whether a transition occurs at the same rate or at different rates in two data sets. We illustrate these methods with molecular dynamics simulations of the folding of a β-sheet protein. However, the theory presented here can be used on any data from simulation or experiment for which a two-state description is appropriate. 1. Introduction For several decades, molecular dynamics (MD) simulations have been used to investigate the folding of proteins.1-5 However, computational restrictions are a huge challenge to using MD to understand protein folding because proteins take much longer to fold than the length of typical simulations. Even the fastest folding proteins fold on time scales (about 1 µs6) that are difficult to reach in simulation because they require ∼109 iterations for just one trajectory. Several prominent methods have been proposed to overcome this time scale gap, ranging from statistically based methods where a large number folding trials are attempted with short trajectories,7-16 to the development of less accurate but faster computational methods,17-21 to the development of unique and very expensive computer hardware which can calculate single trajectories at blazingly fast speeds.22 Multitrajectory approaches are the best methods for illuminating the process of protein folding using computer simulation. The immense number of possible configurations and long time scales render all individual trajectories highly improbable; therefore, single trajectories are not useful for drawing conclusions about folding in real systems. Furthermore, most protein folding experiments to date6,23-25 have focused on ensemble folding of a large number of individual protein molecules; to verify the consistency of computer simulations with experimental results, one must consider an ensemble of computed trajectories and its ability to represent ensemble experimental quantities.12 On the other hand, some recent experiments have used singlemolecule methods to investigate protein folding.26-28 While these approaches should be more directly comparable to molecular dynamics simulations, which are in some sense single molecule “experiments” based on simplified renderings of natural inter* Corresponding author. Address: James H. Clark Center S295, 318 Campus Dr. West, Stanford University, Stanford, CA 94305-5447. Email: [email protected].

atomic forces, truly single-molecule approaches still suffer from the low-sampling limitation: any individual trajectory is highly improbable. Single-molecule spectroscopists have long recognized that single trajectories are insufficient for drawing inferences for many problems of interest,29 excepting, perhaps, when the dynamics of interest are fast enough to be observed many times in one trajectory. Instead, multiple trajectories are typically used to recapitulate ensemble quantities, thereby revealing fluctuations that are invisible to truly ensemble-averaged experiments. However, single-molecule methods could benefit from more sound and consistent statistical analyses, for exactly the same reasons as protein folding MD simulations. These analyses should yield the entire distribution of a parameter of interest, not just a small number of its moments. Bayesian statistical thinking is the mathematical consequence of the desiderata30 of consistent reasoningsthat is, coming to the same conclusions when we have the same informationsand that we do not arbitrarily ignore relevant data. Probability distributions derived using Bayesian statistics automatically incorporate all of the relevant information in a data set without requiring the data to be distributed in certain ways (i.e., Gaussian) and without inclusion of extraneous parameters. These qualities led Jaynes30 to describe Bayesian probability theory as the unique analytical representation of consistent inductive reasoning, upon which the process of science is ultimately based. The primary result of the Bayesian calculation is a probability distribution for the parameters of interest; as we will illustrate below, these distributions are useful for many specific computational tasks. The primary computational tool of Bayesian inferences is Bayes’ rule

P(model|data, I) )

P(data|model, I)P(model|I) P(data|I)

(1)

for some background information I. The probabilities considered in Bayesian statistics are always conditional probabilities, denoted P(A|B) for the probability of proposition A given that proposition B is true.

10.1021/jp903107c CCC: $40.75  2009 American Chemical Society Published on Web 08/14/2009

Bayesian Single-Exponential Kinetics In eq 1, we have made the specific assumption that one wishes to draw inferences about a “model” from some “data,” as is usual in scientific inquiry. The inference is accomplished through the posterior probability density P(model|data,I). (Bayesian theory is of course not limited to inferences about models from scientific data; instead, we could formally be making inferences about any proposition A from knowledge of any other proposition B.) In prose, Bayes’ rule means that the probability of a model given the observed data is derived from the probability of that model producing the observed data. In classical probability theory one is forced to make assumptions as to what variables can be regarded as “random” and which cannot. Instead, the Bayesian framework allows us to derive probability distributions of any parameter, “random” or not, given the weight of the evidence at hand. Probability distributions of nonrandom parameters such as a folding rate are interpreted as degrees of belief regarding different values of those parameters. In essence, these methods allow one to make the most reasonable predictions of the values of parameters possible given the observed data. In this paper, we describe Bayesian methods for calculating the probability distributions of single-exponential rates from single-molecule data, either from experiments or from simulations. Rates are of immense interest in protein folding, as satisfactory understanding of protein folding kinetics requires a quantitatiVe comparison of rates in experiments and simulations. The analysis presented here provides a robust means of estimating folding rates, and calculating their uncertainties, from computational data. In this light, the methods we present are illustrated using data from protein folding simulations of a β-sheet protein, Fip35.14,24,31 The central result is the probability distribution of a single-exponential rate given a set of singlemolecule trajectories. From the probability distribution of the rate, we can calculate many other useful distributions, including the probability distribution of difference between two rates and the probability distribution of the ratio of two rates. Rate distributions can be computed from single starting conditions or averaged over an ensemble of starting configurations, such as a representation of the equilibrium unfolded ensemble. Most importantly, we present Bayesian decision-making theory which can determine whether two data sets are generated from two distinct processes with different rates. Using this tool, one can predict the minimum number of barrier-crossing events that need to be observed in order to reliably tell that two rates are different or that two rates are the same. We show that this lower limit is much greater than the number of barrier-crossing events that could be observed in popular simulation designs. In fact, even for loose interpretations of “reliable,” the use of fewer than 20 simulations leads to unreliable results. Thus, we are able to show that a small number of molecular dynamics simulations cannot lead one to reliable conclusions about the value of any rate or about its value relative to reliably known rates. Therefore, simulations should be designed in such a way that one should observe at least an order of magnitude more barrier-crossing events than can be provided by a single simulation. For all of these calculations, the mathematics are simple, and for the most part expressible in simple analytical forms. None of this theory assumes or requires equilibrium sampling, which is immensely useful for computation because often relaxation times are much longer than the time scales one is ordinarily capable of computing. All of our results are appropriate for any single-molecule data (experimental or computational) one wishes to interpret in terms

J. Phys. Chem. B, Vol. 113, No. 36, 2009 12411 of two states, so long as the two states can be clearly distinguished by some method. Because Bayesian theory is inherently consistent and logical, and because the probability distributions presented here are easy to calculate and manipulate, these methods should be adopted immediately by protein folding and single molecule researchers when interpreting the dynamics of two-state systems, or whenever one wishes to interpret data assuming a two-state model. For example, many protein systems are thought to fold with two-state kinetics.32 We briefly discuss extensions of the Bayesian theory of single-exponential rates to more complicated kinetics. 2. Theory 2.1. Bayesian Inference. We address the model M which states that a set of single-molecule trajectories D is described by a single-exponential barrier-crossing process governed by a rate k. The immediate goal is to use the data D to compute the posterior probability of the rate, p(k|D,M,I), which is conditioned on the data D and the model M, as well as relevant background information I for the experiments or simulations that generated D: for example, the force field used, the temperature of the simulations, the code used for calculations, etc. We will assume that M is obvious in all conditional probabilities and will therefore abbreviate the posterior probability as p(k|D,I). Once we have the posterior probability, we can make inferences about the value of k and its moments. The posterior probability density of the rate k is given by Bayes’ rule:

p(k|D, I) )

p(D|k, I)p(k|I) ∝ l(k|D, I)p(k|I) p(D|I)

(2)

for observed data D. The posterior probability density is the product of two terms: the prior probability density p(k|I), and the likelihood function l(k|D). The likelihood function is any function proportional to the sampling distribution p(D|k) and is defined up to a multiplicative factor which is a function of D alone, so one can often compute the normalization of the posterior density after the data has been taken into account. The prior density p(k|I) represents the state of knowledge about k before the data is taken. For example, if there were a previous estimate of k with a Gaussian distribution, then one could use that Gaussian as the prior. However, it is desirable to make inferences about k that are independent of any previous estimatesfor instance, from experimental datasbecause (1) it is useful to compare inferences made about model parameters using data from different experiments or simulations and (2) in the case of using experimental estimates of rates as priors for simulation data, it is not always clear whether experiments are referring to the same dynamical process investigated in simulations. That is, we can directly identify folded and unfolded states by examination of structural parameters such as C-R root mean square deviation from a model of the folded structure, whereas experiments generally probe surrogates of folding such as configuration-dependent fluorescence signals.6,33 In order to make inferences about model parameters which are dominated by the data, irrespective of other knowledge, we utilize the so-called “noninformative” prior. A rule of thumb for this purpose proposed by Jeffreys34 is that “scale parameters” such as the standard deviation of a Gaussian sampling distribution should be given the prior

p(σ) ∝ σ-1

(3)

12412

J. Phys. Chem. B, Vol. 113, No. 36, 2009

Ensign and Pande

whereas “location parameters” such as the mean of the sampling distribution should be given the prior

p(µ) ∝ 1

(4)

Both of these prior distributions are derived from the likelihood for one observation d135 by

p(k) ∝

 [

to inferences about the rate. The probability of a trajectory not transitioning by time t is derived from the probability of transitioning:

]

∂2 -Ex 2 log(l(k|d1)) ∂k

(5)

where the expectation Ex[f] is taken over the sampling distribution. The Jeffreys prior for a single-exponential rate, which is not clearly (to these authors) either a “scale” or a “location,” is computed explicitly in this way, below. The Jeffreys prior has the useful property of yielding results which are invariant to reparametrization.35,36 As such, the use of Jeffreys priors is consistent with a conservative strategy for posterior calculation and parameter estimation and so, all other things being equal, we prefer their use whenever a noninformative prior distribution is desired. However, as will be shown below, Jeffreys priors cannot always be used for the calculation of rates, because they can generate nonnormalizable (and therefore useless) posterior distributions even when a significant amount of data has been collected. With a large amount of data, it is unreasonable to be unwilling to make inferences about rates because one’s favorite prior distribution prohibits such inferences; for this reason, we will also calculate posterior distributions of k utilizing a uniform prior

p(k) ∝ 1

(6)

which, as long as data is collected at all, will never result in nonnormalizable posterior distributions for k. Choosing a different prior will produce different posterior probability distributions; the differences are most noticeable in the limit of small amounts of data. The fact that the inferences between using Jeffreys and uniform priors differ is an indication that not enough data has been collected to be certain about point estimates calculated from the posterior distribution. 2.2. Bayesian Inference for Single-Exponential Processes. 2.2.1. Construction of the Likelihood. We now consider a single-exponential process initiated from a given system configuration X and considering pertinent background information I. For example, we may be interested in protein folding in molecular dynamics simulations from starting coordinates X, while I represents details such as the solvent model, simulation parameters, etc. On the other hand, we may be interested in transitions of a fluorescent single molecule under starting conditions X (at macroscopic equilibrium or some defined nonequilibrium state); here, I would include information such as temperature, laboratory pressure, salt concentration, and so forth. The following analysis applies to any single-exponential process, insofar as the single-exponential model is reasonable or desirable. For the single exponential, the probability of a trajectory originating from X making a transition at time t is

ptransition(t|k, X, I) dt ) k exp[-kt] dt

(7)

Of course, in a finite data set not all trajectories will undergo a transition; these trajectories also contain information relative

pnot(t|k, X, I) ) 1 -

∫0t dt' k exp[-kt'] ) exp[-kt] (8)

For instance, suppose that N protein folding simulations are initiated from configuration X, and, judged by a suitable definition “folding,” n simulations reach the folded state each at time tfi. The remaining N-n simulations do not fold, but their lengths are ti. As in ref 39, the probability of observing that set of simulation data D ) (n,N,{tfi},{ti}) is a product with n factors of eq 7 and N - n factors of eq 8 or

p(D|k, X, I) )

n

N-n

i)1

j)1

∏ k exp[-ktfi] ∏ exp[-ktj] ) kn exp[-kΘ] (9)

where we define the total relevant trajectory time Θ as

Θ)[

n

N-n

i)1

j)1

∑ tfi] + [ ∑ ti]

(10)

so-called because any extra time spent after tfi contains no additional information useful for inferences about the forward rate. One might be concerned that including the information from the trajectories that do not cross the barrier, eq 8, might introduce an unreasonable bias. For instance, imagine a set of 1010 simulations of length 1 fs. Naive use of the likelihood (9) would suggest that this absurd data set would be equivalent to 103 simulations of 10 ns. However, proteins cannot fold in 1 fs. Hence, there is an additional assumption in the choice of the single-exponential likelihood function: we are only including trajectories sufficiently long to be well approximated by a single exponential. We discuss testing the suitability of the single exponential approximation in Appendix C; in short, one way to justify the use of any trajectory is to make sure that it is longer than the minimum observed barrier crossing time. This condition is met for all of the trajectories we use for illustration in section 3. When the data D was first introduced, we described it as the set of single-molecule trajectories one would wish to use for calculating single-exponential rates. However, the only information in the data relevant to inferences about the rate are the two values n and Θ, the number of transitions and the time it took to observe them. As will be shown, these are sufficient statistics35 for distributions of the rate. For brevity, we will henceforth consider D to contain only the information (n,Θ). We take eq 9 as the likelihood function

l(k|D, X, I) ) kn exp[-kΘ]

(11)

The likelihood has maximum value

n kˆ ) Θ

(12)

Bayesian Single-Exponential Kinetics

J. Phys. Chem. B, Vol. 113, No. 36, 2009 12413

Interestingly, the maximum likelihood estimate is related to the estimate of the rate from the ensemble method.37 In the ensemble method, one estimates the cumulative probability of one trajectory folding for small t as the first nonzero term in its Taylor series

p(t < τ|k, X, I) )

∫0τ ptransition(t|k, X, I) dt )

Another estimate of the same probability is the fraction of simulations that folded, n/N. Therefore, the rate can be estimated to be

n n ) Nτ Θ′

(14)

where Θ′ is the total simulation time, to be distinguished from the total necessary simulation time Θ, which is less. Even so, this estimate and the maximum likelihood estimate of the rate both have the same form: a number of barrier-crossing events observed in some time period. If trajectories are halted as soon as they undergo a transition, then both times are the same, Θ ) Θ′, and so kˆ ) k*. We will see a deeper connection with these estimates of k when we examine the maximum of the posterior distribution using a uniform prior, and the expectation value of k over the posterior distribution using a Jeffreys prior. 2.2.2. Posterior Probability of the Single-Exponential Rate Using Uniform and Jeffreys Priors. Uniform Prior. Let us first take a uniform prior distribution. The normalized posterior density is

pU(k|D, X, I) )

Θn+1 n k exp[-kΘ] n!

(15)

which is valid for n g 0 and Θ > 0. Equation 15 is a twoparameter gamma distribution in k, with expectation value and variance

〈k〉U )

n+1 , Θ

var(k)U )

n+1 Θ2

(16)

Because the uniform prior was chosen, value of k which maximizes pU(k|D,X,I) is n/Θ, the same as the maximum likelihood estimate of k (eq 12). Jeffreys Prior. While the uniform prior seems to correspond intuitively to the desirable qualities of a “noninformative” prior distribution, posterior probabilities derived from it are not invariant to reparametrization. The Jeffreys prior, however, does not suffer from this drawback. To calculate the Jeffreys prior for the rate from eq 5, the likelihood for one observation (that is to say, one trajectory) is

l(k|d1 ) folding at time t) ) ke-kt

[

] [ ]

∂2 1 1 log(l) ) Et - 2 ) - 2 ∂k2 k k

(17)

In order to calculate the Jeffreys prior, we take the second derivative with respect to k of the logarithm of eq 17, and average the result over t (again using eq 17)

(18)

The Jeffreys prior (eq 5) is then

pJ(k) ∝

1 - exp[-kτ] ≈ kτ (13)

k* )

Et

 [ -Et

]

∂2 1 log(l) ) 2 k ∂k

(19)

Using the likelihood (11) and the Jeffreys prior p(k|I) ∝ 1/k instead of the uniform prior, we obtain the posterior probability density

pJ(k|D, X, I) )

Θn kn-1 exp[-kΘ] (n - 1)!

(20)

Unlike the posterior obtained from the uniform prior, this posterior is undefined for n ) 0 and so cannot be used for rate inferences when no barrier-crossing events are observed; using this posterior probability requires that n g 1 and Θ > 0. If these restrictions are satisfied, then the mean and variance are

〈k〉J )

n , Θ

var(k)J )

n Θ2

(21)

The mean of the posterior derived using the Jeffreys prior is the same as the maximum likelihood estimate. Which NoninformatiVe Prior Distribution Should Be Used? The advantage of choosing the uniform distribution in our a priori state of ignorance about the value of the rate k, rather than the Jeffreys prior, is that the uniform prior generates a posterior distribution that is always proper so long as any data has been observed at all. This leads to the interesting result that we may make inferences about the value of k even with a data set that has no barrier-crossing events, n ) 0. This should be contrasted with the choice of the Jeffreys prior, which prevents any estimate of the rate when no barrier-crossing events are observed, because the resulting posterior density is not defined when n ) 0. The most useful feature of the Jeffreys priorsthat it yields posterior distributions invariant to reparametrizationsis dwarfed by the ability of the uniform prior to allow for inferences when no transitions have been observed. It is more reasonable to use uniform priors than Jeffreys priors for folding rates because not observing any transitions in some amount of time τ should still allow one to crudely estimate the rate. If one observed no protein folding events in 1 s of simulation (an immense amount of simulation), it would be most reasonable to conclude that the rate is probably slower than 1 per second, and not that the rate is impossible to estimate. In fact, for the posterior probability derived from the uniform prior and for data which has n ) 0, the probability of k lying between the mode (k ) 0) and the mean 〈k〉 ) 1 s-1 is about 63%; k is more likely to be smaller than 〈k〉. On the other hand, the probability of the rate when n ) 0 has no meaning when using the Jeffreys prior. The fact that different prior probabilities yield posterior distributions with different point estimates indicates when not enough data has been collected to be confident in the value of any point estimate. In assessing point estimates of the rate, one should compare the estimate 〈k〉 between the uniform and Jeffreys cases, since the results converge to the same answer in the limit of large n. One should mistrust any point estimate of

12414

J. Phys. Chem. B, Vol. 113, No. 36, 2009

Ensign and Pande

k such as 〈k〉 unless both priors yield posterior distributions with more similar point estimates: if the estimates agree, then the results are dominated by the data and not by the choice of prior. Here, however, we are more interested in probability distributions than point estimates. Therefore, we will use uniform priors since n ) 0 is a reasonable and realistic result for many conceivable simulations. The caveat in such cases is that 〈k〉 is an estimate of k’s upper bound. 2.3. What about Multiple Exponential Kinetics? Protein folding is rarely, if ever, actually a single-exponential process, even if we sometimes find spectroscopic surrogates for folding which appear to be two state. Furthermore, single-molecule experiments often show multiple states. However, we can use the preceding analysis to treat such cases, if we know all of the relevant states. Let us consider a situation with three states, A, B, and C. One then estimates the single-exponential rates of transfer between each pair of states. For instance, the posterior distribution of the rate of transition between states A and B, kAB, is (using a Jeffreys prior)

p(kAB |D, I) )

ΘAnAB knAB-1 exp[-kABΘA] (nAB - 1)! AB

HB ≡ each data set comes from a distinct process; the first process undergoes transitions at rate k1 and the second at rate k 2. The data for the two experiments are D1 and D2. In D1, n1 transitions were observed in time Θ1; likewise for D2, n2 transitions were observed in time Θ2. The total data set is used in evaluating hypothesis A: D ) D1 + D2 ) {n ) n1 + n2, Θ ) Θ1 + Θ2}. We wish to compute the odds ratio favoring hypothesis B

P(HB |D, I) P(D|HB, I) P(HB |I) ) P(HA |D, I) P(D|HA, I) P(HA |I)

(25)

(We use capital P for the probabilities of propositions such as “the data is D ” or “hypothesis H ” to distinguish from probability distributions p of parameters.) Neither process is favored a priori, so we choose uniform prior probabilities for the hypotheses

P(HB |I) ) P(HA |I)

(26)

P(HB |D, I) P(D|HB, I) ) P(HA |D, I) P(D|HA, I)

(27)

(22) so that

analogous to eq 15. Here, nAB is the number of observed transitions between states A and B, whereas ΘAB is the amount of time spent in state A. According to statistical mechanics, the diagonal elements of a rate matrix are given by

kii ) -

∑ kij

(23)

j*i

It can trivially be shown that sums of gamma-distributed variables also have gamma distributions. Therefore, the probability distribution of, for instance, the rate kAA is

ΘA (-kAA)nAB+nAC-1 × (nAB + nAC - 1)! exp[kAAΘA]

P(D|HA, I) )

∫0∞ dk p(D|k, HA, I)p(k|HA, I)

(28)

and for HB

nAB+nAC

p(kAA |D, I) )

Therefore, we need only compute the likelihood ratio for the two hypotheses, also known as a Bayes factor. To compute the likelihood of the data given HA, we need to evaluate

2

(24)

P(D|HB, I) )

∏ ∫0



dki p(Di |ki, HB, I)p(ki |HB, I)

i)1

(29) where kAA ∈ (-∞,0). Therefore, one has available a probability distribution of the rate matrix describing transitions between the states. Such a matrix can easily be sampled so as to ensure microscopic reversibility. Complex kinetics can be modeled using this matrix if one knows the relevant states of the system so that the number of transitions between each pair of states and the amount of time spent in each state can be determined. Generating a useful state space decomposition for such models is an active area of research.7-9,11,13,38 2.4. Testing Whether Two Data Sets Were Generated by Distinct Barrier-Crossing Processes. We now turn to the question of whether two data sets differing in some experimental or simulation parameterssstarting configuration, temperature, solvent viscosity, etc.sare better described by different singleexponential rates than by the same single-exponential rate. To do this, we use Bayesian hypothesis testing. We define two hypotheses: HA ≡ there is one process, governed by a rate k, which describes both data sets; both sets of parameters undergo transitions at the same rate.

where P(D|k,HA,I) and P(D|ki,HB,I) are equivalent to the likelihood, eq 11. Caution is in order with respect to the choice of prior probability distributions p(k|HA,I) and p(k1,k2|HB,I) ) p(k1|HB,I)p(k2|HB,I); here, we cannot choose improper distributions such as the uniform prior p(ki) ) c or the Jeffreys prior p(ki) ) 1/c. The normalization constant c is formally zero for such a distribution if ki ∈ (0,∞). We could include c as a parameter of the calculation, then take the limit c f 0 at the end of the calculation. In that case we will be left with two factors of c in the numerator of eq 27, because the numerator contains two rates, and one in the denominator because the denominator contains one rate. Therefore, the odds will be proportional to c; at the limit c f 0, we would always conclude, mistakenly, that the probability of HB is 0. Discussions of this phenomenon may be found elsewhere.39,40 Instead, inspired by Bretthorst, we will take a normalizable prior distribution for k, p(k|HA) that is broad enough to be flat in the region of the data. Thus, we can treat the likelihood part

Bayesian Single-Exponential Kinetics

J. Phys. Chem. B, Vol. 113, No. 36, 2009 12415

of either of eqs 28 or 29 as a delta function compared to this prior, so that, for example

p(D|HB, I) )

P(D|HA, I) ≈ p(k|HA, I)

|

∫0∞ dk kn exp[-kΘ]

n1 ! n2 ! 2c 1 2 2 π kˆ + kˆ Θn1+1 Θn2+1 1 2 1 2

(37)

(30)

Substituting for the maximum likelihood values of the rates, the odds ratio (eq 27) is

where kˆ is the maximum likelihood estimate of kˆ ) n/Θ, eq 12. The integral evaluates as

P(D|HB, I) n1!n2 ! n/Θ Θn+1 2 ) P(D|HA, I) π n2 /Θ2 + n2 /Θ2 n! Θn11+1Θn22+1 1 1 2 2 (38)

k)kˆ

P(D|HA, I) ≈ p(k|HA, I)

|

n!

n+1 k)kˆ Θ

[ ] k2 2σ2

(32)

Equation 31 now becomes

[ ]

P(D|HA, I) ≈ 2(2πσ2)-1/2 exp -

kˆ2 n! 2σ2 Θn+1

(33)

We have introduced a hyperparameter, σ, the standard deviation of the prior for k. Because P(D|HA,I) is now implicitly dependent on an unknown hyperparameter σ, we must multiply by a prior for σ and integrate it out. In this case, we can choose an improper prior for σ without trouble since we will choose the same prior for σ in the two-process case, and (as will be evident) the improper normalization constants will cancel out. We choose the Jeffreys prior p(σ) ) cσ-1. The integral is

∫0∞ dσ (2πσ2)-1/2 exp - 2σk 2

[ ]

(34)

c n! kˆ Θn+1

(35)

ˆ2 c c ) σ 2kˆ

so

p(D|HA, I) )

Likewise for the two-process model, we treat the normalizable prior probabilities for k1 and k2 as being essentially flat in the region of the data, so that 2

P(D|HB, I) ≈

∏ i)1

p(ki |HB)

|

∫0



ki)kˆi

)

(31)

Since the prior for k is only required to be normalizable and very broad, a form for this distribution can be freely chosen for convenient calculation. Let us take a Gaussian on the positive numbers for the prior, or

p(k|HA, I) ) 2(2πσ2)-1/2 exp -

)( )(

(

dki kini exp[-kiΘi]

(36) As before, we choose Gaussian priors for each rate; both priors are governed by the same parameter σ. Then, we eliminate σ by choosing the same hyperprior as before, p(σ) ) cσ-1, and integrate it out. The result is

Equation 38 is the most important result of this work. With this formula one can calculate the odds favoring two distinct processes as a function of the data alone, through the values of n1, Θ1, n2, and Θ2. It is based upon two reasonable assumptionssfirst, that the data come from single-exponential processes, and second that our prior information is Gaussian but extremely vague in the region of the data. With eq 38, we can compute the odds without having to worry about extraneous parameters. It is useful to have a scale for interpretation of the odds ratio calculation. Such an interpretation was given intuitive labels by Jeffreys.36,41 For an odds ratio of 3, Jeffreys suggests that the evidence favoring the two-process model is “barely worth mentioning.” This is in accord with intuition: if, at the end of January 2009, one had believed that the odds of the Pittsburgh Steelers beating the Arizona Cardinals in Super Bowl XLIII were in the range 1:1 to 3:1, then one would not have made a serious mistake to favor the Cardinals, even with the odds slightly in favor of the Steelers. Given odds in the range 1:1-3: 1, one would be most justified to be uncertain about the outcome. For odds ratios in the range 3:1 to 10:1, Jeffreys describes the evidence as “substantial,” and for odds greater than 10:1, the evidence is “strong” favoring one outcome. If the odds of the Steelers winning Super Bowl XLIII had been 10:1, then one would have been foolish to favor the Cardinals; 10:1 odds indicate only a 9% chance of winning. Likewise with single-molecule rates, if we calculate eq 38 and get an odds ratio between 0.33 and 3 then we should be uncertain as to which hypothesis to favor. This result is a signal that more data should be collected, or that other means should be used to investigate differences in rates, in order to decide whether there is a difference. It is instructive to examine the behavior of eq 38 at a couple of limits. For simplicity, we consider the case in which the simulation time of each data set be the same, Θ1 ) Θ2 ) 1/2Θ. Then we measure time in units of Θ1 ) 1. With these simplifications, eq 38 is

R)

2n1+n2+1 n1 + n2 n1!n2 ! π n2 + n2 (n1 + n2)! 1 2

(39)

where R is the odds ratio under the constant time restriction. For the first limit, we imagine two data sets for which n1 . n2. We expect that R will be large in this limit, indicating different processes, and to grow with n1. Then the odds ratio becomes

Rn1.n2

2n1n2 ! ) πn1

(40)

12416

J. Phys. Chem. B, Vol. 113, No. 36, 2009

Ensign and Pande

which grows exponentially with n1, yielding greater and greater certainty as n1 increases when n1 . n2 that the two processes are different. For the second limit, we imagine a data set for which n1 ) n2. We expect the odds ratio to give a small number in this limit, favoring the one-process model, and to decrease as n1 and n2 increase. In this limit, eq 39 becomes 2

Rn1)n2

22n1+1 (n1!) ) πn1 (2n1)!

(41)

This function decays as n1 for large n1, yielding greater and greater certainty that two experiments that have n1 ) n2 in the same time period do indeed correspond to the same process. Rn1)n2 is less than 1 for all n1 > 1. For n1 ) 1, Rn1)n2 ≈ 1.27 which slightly favors two different processes. Even so, this is not evidence favoring either process; in the case n1)n2 ) 1 for which Rn1)n2 ≈ 1.27 is indistinguishable from Rn1)n2 ) 1 which represents total ambiguity as to whether the two processes are the same or different. However, Rn1)n2 does not predict results which would be considered “substantial” (Rn1)n2 < 1/3) by Jeffreys’ scale until n1 ) n2 ) 12, for which Rn1)n2 ≈ 0.329. We emphasize, in order to be reasonably sure (i.e., get odds greater than 3:1 in faVor) that two processes occur at the same rate, one needs to obserVe at least 12 transitions each of the two experiments. Note that 12 is the minimum number of transitions that need to be observed in each set of trajectories required to conclude that the rates are the same, under Jeffreys’ interpretation of the odds ratio. A good rule is to collect many more than 12 trajectories in order to ensure that one sees 12 transitions. Equation 38 is undefined if n1 ) n2 ) 0 so it can only be used if at least one barrier-crossing event has occurred-the method cannot tell the difference between two data sets of any total simulation length if no transitions have been observed. 3. Examples Using Molecular Dynamics Simulations of Protein Folding In order to illustrate the use of the methods presented above, we apply them to a set of molecular dynamics simulations of a mutant of the human Hpin1 WW domain, Fip35.24 The threestranded β-sheet Hpin1 proteins are an important model system for experimental24,31 and computational14,31 studies of β-sheet folding. For the purposes of this work, we are not interested the details of Fip35 folding. Instead, we introduce these simulations to illustrate the calculations of section 2. Detailed interpretation of these protein folding simulations and their consequences for understanding β-sheet folding will be the topic of another paper. The molecular dynamics simulations are described in ref 14. The simulations were initiated from two starting configurations of the human Hpin1 WW domain mutant, Fip35: a fully extended and a relaxed structure resulting from a short 20 ns equilibration of the fully extended structure, as described in a previous work. These structures and the presumptive native state were provided by Prof. K. Schulten (University of Illinois at Urbana-Champaign). The simulations used the AMBER-96 force field, and the Onufriev-Bashford-Case (Type II) formulation20 of the generalized Born solvent model. Four solvent conditions were used, consisting of each combination of temperatures 300 and 330 K (close to the experimental temperature of fastest folding) and solvent friction coefficients (related to the inverse of viscosities) 1 and 91 ps-1 which will

TABLE 1: Statistics for Each Set of Simulations of Fip35 WW Domaina T/K

γ/ps-1

structure

n

Θ/µs

〈k〉/ms-1

std(k)/ms-1

300 300 330 330 300 300 330 330

91 91 91 91 1 1 1 1

relaxed extended relaxed extended relaxed extended relaxed extended

1 1 4 3 0 2 7 15

199.8 191.8 258.7 261.4 209.3 204.8 254.3 234.2

10.01 10.42 19.33 15.30 4.78 14.65 31.46 68.31

7.08 7.37 8.64 7.65 4.78 8.46 11.12 17.08

a Four solvent conditions (temperatures 300 and 330 K) and two viscosities (γ ) 91 ps-1 and γ ) 1 ps-1) were used, along with two starting configurations, extended and relaxed (see text). The number of folding events n and the total necessary simulation time Θ are sufficient to determine the posterior probability distribution of the rate k. The statistics 〈k〉 and the standard deviation std(k) are based upon the posterior derived using a uniform prior for the rate; the estimates thus derived differ most from those using a Jeffreys prior for the rate when few barrier-crossing events are observed (small n).

be referred to as “low viscosity” and “waterlike viscosity” hereafter. Simulations at one temperature and viscosity will be referred to as T300-γ91 for 300 K and γ ) 91 ps-1 and so forth. These simulations were run on a version of Folding@home designed and optimized for graphics processing units,42 which for can calculate hundreds of nanoseconds per day for Fip35. Table 1 summarizes estimates of k, using a uniform prior, for the Fip35 WW domain simulations described previously.14 The definition of “folding” is taken from that article; in brief, a snapshot was considered to be folded if the β-sheet root mean square deviation of C-R atom positions compared to the presumptive native state was less than 3 Å and if the conformations of the 10 β-sheet resides were “E” according to DSSP.43 The folding time for each trajectory was taken to be the time of the first snapshot for which these structural criteria were fulfilled. The sufficient statistics n and Θ are listed for each starting configuration and each solvent condition; these are the values that determine the posterior probability of k (either eq 15 for a uniform prior or eq 20 for a Jeffreys prior) and allow calculation of the first two moments (either eq 16 or 21). Because not every data set results in folding, we will use the posterior probability of k, eq 15, which was derived using a uniform prior for k, for all rate calculations. Nevertheless, the posterior distributions of the rate derived from both the Jeffreys and uniform priors are shown in Figure 1. 3.1. Comparing Rates for Fip35 Folding Simulation Data. We can interrogate the data in Table 1 in several ways. However, we limit the discussion to the following: 1. Determining, for each combination of temperature and viscosity, whether and to what degree the two different starting structures give different rates. 2. Determining, for each combination of temperature and starting structure, whether and to what degree the rate is increased when the viscosity decreases. 3. Determining, for each combination of viscosity and starting structure, whether and to what degree the rate is increased when the temperature is increased. In order to determine whether two rates are the same or different, we calculate the Bayes factor, eq 38, using the values in Table 1. In order to evaluate how much the rate changes from one set of conditions to another, we calculate the probability distributions of the difference of rates, ∆ ) k1 - k2 and the probability distribution of the ratio of rates, Q ) k1/k2. These distributions are derived in Appendix A.

Bayesian Single-Exponential Kinetics

J. Phys. Chem. B, Vol. 113, No. 36, 2009 12417

Figure 1. Posterior probability distributions of rates for each set of Fip35 data, using both the uniform prior (solid) and the Jeffreys prior (dashed). Abbreviations “T300-γ91” and so forth refer to, for example, simulations run at 300 K and viscosity (collision frequency parameter) γ ) 91 ps-1. The extended structure is referred to as “ss1” and the relaxed structure as “ss2.” In the T330-γ91 ss1 data, no folding events were observed so the Jeffreys prior could not be used (see text). In all other cases, the uniform prior predicts slightly faster rates than the Jeffreys prior, which converge as more and more data is collected. This is most evident in the T330-γss2 data, for which there were 15 folding events. Here, the two distributions overlap very well.

TABLE 2: Hypothesis Test for Whether Different Starting Structures of Fip35 WW Domain Fold at Different Rates Is Ambiguous (Close to 1) for All Solvent Conditionsa Are the Folding Rates of Different Starting Structures Different? T/K

γ/ps-1

odds favoring two processes

300 330 300 330

91 91 1 1

1.27 0.65 1.27 1.68

Jeffreys’ conclusion barely barely barely barely

worth worth worth worth

mentioning mentioning mentioning mentioning

〈∆〉/µs-1

std(∆)/µs-1

〈Q〉

std(Q)

0.00042 -0.0040 0.0099 0.037

0.010 0.012 0.0097 0.020

1.92 1.68 0.49 0.49

∞ 1.13 ∞ 0.48

a In addition, the mean difference in rates 〈∆〉 (rate of equilibrated minus rate of extended) is small for all solvent conditions within at most 2 standard deviations from zero. Finally, the mean ratio of rates 〈Q〉 (rate of equilibrated divided by rate of extended) is also small and within at most 2 standard deviations of unity. Thus, with the simulation data used for this calculation, one cannot tell with certainty whether the two different starting structures fold at different rates. Since 0 or 1 folding events were observed for the extended structure in the T300-γ91 data and the T300-γ1 data, the second and higher moment of the distributions of Q are infinite (eq 54).

In the course of these calculations, we will find that there is no evidence that the rates of folding for the two starting structures (item 1) are different: the evidence favoring “the same” or “different” rates according to eq 38 is ambiguous. Nevertheless, we will calculate the differences in rates, eq 49, which will aid in understanding why the evidence is ambiguous. In Appendix B, we also derive the posterior distribution, eq 55, appropriate for analyzing data for different starting structures. However, since we cannot unambiguously determine whether the starting structures fold at different rates, we will pool the data from the two starting structures rather than using that distribution to get a posterior distribution of the rates independent of starting structure. While not the formally correct approach, this approach allows us to illustrate the other two calculations (items 2 and 3) with a larger and simpler set of data. Rigorously including the effects of multiple starting structures would make the formulas in section 2.4 unnecessarily complicated for the purposes of illustration. 3.1.1. Differences in Rate for the Two Fip35 Starting Structures. First, we address the difference in folding rate for the two starting structures. The relaxed structure was generated from the fully extended structure by a 20 ns simulation at 337 K;31 one might expect the relaxed structure to be more typical of structures from the unfolded ensemble. As a consequence, the folding rate of the relaxed structure should approximate the folding rate of the unfolded ensemble better than the fully extended structure. On the other hand, the phase space distance between the two conformations may be so small that a 20 ns equilibration may be kinetically insignificant in simulations like these which average more than 200 ns in length.14 To address this question, we calculate eq 38 for data at fixed solvent conditions and with data originating from each starting configuration. The odds ratios are listed in Table 2. All odds ratios

thus calculated are between ∼0.33 and 3, so none of these data are conclusive enough to determine whether there is a substantial kinetic difference between the starting structures or not. In the best case, low viscosity and 330 K, seven folding trajectories were observed for the relaxed structure and 15 were observed for the extended structure, but even here the odds are only 1.68 to 1 in favor of two separate processes. We would only be convinced that the folding rate is different upon seeing odds greater than 3:1 in favor of either hypothesis. Even though we cannot tell by hypothesis testing whether the two starting structures fold at different rates, we can gain further insight into the kinetics of the two starting structures by examining the probability distribution of the difference in rates, for each solvent condition. The rate difference calculation can allow for interpretation of the hypothesis test. To this end, the differences in rates between the two starting configurations (rate of equilibrated structure folding minus extended structure folding) are shown in Figure 2. For the mildest solvent condition T300-γ91, the rates appear virtually identical because the distribution of the difference is almost symmetric about ∆ ) 0. The mean rate difference is -4.2 × 10-4 µs-1 about a very symmetric-looking distribution. The standard deviation is much larger, 0.01 µs-1, revealing to a reasonable degree of accuracy no difference in rate. This corroborates the hypothesis test, which could not detect a difference in the rates. For the T300-γ1 data, the mean difference in rate is 0.0040 µs-1, but with a standard deviation of about 0.012 µs-1 the difference is still indistinguishable from zero. We see the mean and mode shift away from zero for the lowviscosity simulations. At 300 K and low viscosity, the mean rate difference is -0.0099 µs-1, comparable to the standard deviation of 0.0097 µs-1. Thus a difference in rate is tentatively plausible for these simulations. For the data with the most

12418

J. Phys. Chem. B, Vol. 113, No. 36, 2009

Ensign and Pande

Figure 2. Distributions of rate differences between the two starting structures for each solvent condition (rate of equilibrated structure folding minus extended structure folding). Only the data at 330 K shows a faster folding rate for one of the structures, but the hypothesis test indicates that the difference in rates is ambiguous (see text).

Figure 3. Distributions of rate ratios between the starting structures for each solvent condition. The top row is Q ) kequilibrated/kextended and the bottom row is Q′ ) 1/Q ) kextended/kequilibrated. Only the data at 330 K shows a faster folding rate for the equilibrated structure, even though the hypothesis test is ambiguous. The dotted line emphasizes Q ) 1 or Q′ ) 1. The solid line is the mean of each distribution; the distribution of Q′ for T330-γ91 does not have a finite mean.

TABLE 3: Rate Distributions for Fip35 Folding, Combining Data from the Two Different Starting Structuresa T/K

γ/ps-1

n

Θ/µs

〈k〉/ms-1

std(k)/ms-1

300 330 300 330

91 91 1 1

2 2 7 22

391.69825 414.06100 519.83835 483.72055

7.66 7.25 15.39 47.55

4.42 4.18 5.44 9.91

a At each solvent condition (temperature T and inverse viscosity γ), n folding events were observed in time Θ. Values for Θ are exact, not approximate.

folding events, T330-γ1, the mean difference is -0.037 µs-1 with standard deviation 0.020 µs-1. There may actually be a convincing rate difference in this case because of the greater number of barrier-crossing events, but the difference is very small, again corroborating the hypothesis test. Rate ratios for the starting structures are shown in Figure 3 for kequilibrated/kextended (top row) and kextended/kequilibrated (bottom row). Each distribution has significant density near Q ) 1 (or Q′ ) 1). Together with the very small rate differences, this indicates that the two starting structures have essentially identical kinetics at the level of resolution allowed by these data sets; many more barrier-crossing events (larger n’s) would be needed to distinguish different kinetics. For this reason, we will pool data from each starting structure and compute overall folding rate distributions for each solvent condition rather than apply eq 55. Pooling the data provides a more precise estimate of the folding rate, because more data is used to estimate one rate; the results of the subsequent calculations will thus be more precise. 3.1.2. Fip35 Folding at Different Temperatures. We now turn to the question of whether rates are different at different temperatures using the combined (by starting structure) data in Table 3. The statistics (odds ratio, mean and standard deviation of rate difference, and mean and standard deviation of rate ratio) are shown in Table 4. For the waterlike viscosity simulations, the odds calculation slightly favors the singlerate hypothesis with odds 1 to 1.781 but it is still ambiguous (odds < 3). This is an incredible result, since folding is known

to be faster at higher temperature in experiments (presumably at nearer to γ ) 91 ps-1 than γ ) 1 ps-1), but since only two trajectories folded at each solvent condition, we cannot tell whether there is a rate difference. However, the low viscosity simulations convincingly fold faster at higher temperature, with the odds favoring two processes by almost 20:1. 3.1.3. Fip35 Folding at Different Viscosities. Theoretical considerations44,45 have indicated that a Langevin dynamics simulations at low viscosity should yield faster kinetics. We utilize the data in Table 3 to compute compute the ratio of the rate at low viscosity (γ ) 1 ps-1) to the rate at high viscosity (γ ) 91 ps-1), at each temperature. The results are summarized in Table 5. The results clearly show that decreasing the viscosity increases the rate, but the fractional increase upon a 91-fold decrease in viscosity is not likely to be 91×. If collapse were the dominant process for folding in this system, then the rate at low viscosity would be 91 times that for waterlike viscosity. This suggests that intrachain interactions (“internal friction”) play a large role in the folding of Fip35 WW domain. 4. Discussion Perhaps surprisingly, the above analyses can be used even when few barrier-crossing events have been observed, although in this case the results in broad probability distributions and odds ratios close to 1, i.e., vagueness. However, if one is willing to choose uniform priors, then rates may be estimated even when no barrier-crossing events have occurred. This is an intuitively appealing result, as observing no barrier-crossing events in a time period would suggest strongly that the rate is less than one barrier-crossing event per time period. The Jeffreys prior is more conservative, requiring at least one barrier-crossing event to yield a useful probability distribution. The ability to estimate rates without observing barrier-crossing events is particularly interesting with regard to estimating equilibrium constants, the most important of rate ratios. If one runs two sets of simulations in each of the forward and backward directions of some reaction, then as long as at least one barriercrossing event is observed in either direction, one would be

Bayesian Single-Exponential Kinetics

J. Phys. Chem. B, Vol. 113, No. 36, 2009 12419

TABLE 4: Hypothesis Test for Whether Fip35 WW Domain Folds Faster at Higher Temperature, at Two Viscositiesa Is the Folding Rate Faster at 330 K than at 300 K? γ/ps-1

odds favoring two processes

Jeffreys’ conclusion

〈∆〉/µs-1

std(∆)/µs-1

〈Q〉

std(Q)

91 1

0.85 19.6

barely worth mentioning strong evidence

-0.00041 0.032

0.0061 0.011

1.42 3.50

1.83 1.63

a At waterlike viscosity, there is no significant evidence that the rate is faster at 330 K than at 300 K, indicated by the odds ratio close to 1, the mean difference in rate 〈∆〉 being about 1 standard deviation from zero, and the mean rate ratio 〈Q〉 being about 1 standard deviation from 1. However, for the low-viscosity simulations, there is significant evidence for a difference in rates, indicated by the large odds ratio, the value of the mean rate difference 〈∆〉 about 3 standard deviations from zero, and the mean rate ratio 〈Q〉 is about 2 standard deviations from the mean.

TABLE 5: Hypothesis Test for Whether Folding Is Faster at Different Viscosities Yields Conclusive Evidence of a Difference in Ratesa Is the Folding Rate Faster at Low Viscosity than at Waterlike Viscosity? odds favoring T/K two processes 300 330

1.040 465

Jeffreys’ conclusion

〈Q〉 std(Q) P(Qg91)

barely worth mentioning 3.012 decisive evidence 9.75

3.37 10.16

concern for simulation strategies which aim to use one folding trajectory (say, one at each temperature) to understand folding. One should instead run substantially more than two trajectories (which both fold) at each temperature. We can quantify the effect of sparse data on the possible conclusions about rates. Using a uniform prior, the ratio of the expectation of k to its standard deviation is

0.000064 0.0016

a

However, the increased rate in folding is modest (no more than 10× at 330 K) when the viscosity is reduced by a factor of 91. If collapse of the polypeptide chain were to completely dominate the folding mechanism, we would expect a value of 〈Q〉 much closer to 91, but the probability that the rate increases that much (P(Q g 91)), according to these data, is very small.

able to estimate the free energy differences from the distribution of the equilibrium constant. This is immensely powerful when one of the processes is difficult to observe because it is slow, but the reverse process is fast, for example, when estimating binding free energies. One would not expect to see very many binding events of a drug molecule in MD simulation, but unbinding may be easy to see. Thus, one can estimate the free energy of binding, but only using a uniform prior for the binding rate. However, with a small number of observed barrier-crossing events, the results (especially early moments of the distributions) depend greatly on the chosen prior for small amounts of data. Indeed, for one observed barrier-crossing event, using a uniform or a Jeffreys prior yields mean rates which disagree by a factor of 2. While such a disagreement may be insignificant in light of experimental rates (a calculated value within an order of magnitude of an experimental rate is often considered to be “accurate”) it would be more satisfactory to reduce the role that prior ignorance plays with regard to point estimates about the rate. At the end of the day, one should trust distributions derived from the Jeffreys prior more than the uniform prior, although great trust in either result would be misplaced whenever there is significant dominance by the prior. When so few barriercrossing events are observed that the priors give different estimates, one should be cautious in drawing definite conclusions from point estimates. This is evident when comparing the folding rates of Fip35 at waterlike viscosity at 300 and 330 K. Fip35’s temperature of maximum folding rate is known from experiment to be about 337 K. We cannot tell with the data set at handstwo folding events at each temperatureswhether the rate at the two temperatures is different or not. That we observed four folding trajectories is insufficient data to reliably detect a known (experimental) difference in folding rate. This is a significant

w)

√var(k) √n + 1 Θ ) ) (n + 1)-1/2 〈k〉 Θ n+1

(42)

which does not depend on the total simulation time. Therefore, if one desires to compute an accurate rate from protein folding simulations, the simulations should be designed in such a way as to increase the number of folding events that are likely to be observed. One trajectory is a poor method to estimate an accurate single-exponential forward rate since only one folding event may ever be observed (unless the simulation is long enough to unfold and refoldsprobably a long wait!). We expect this fact to hold not only for single-exponential processes but for more complicated multiple exponential and stretched exponential processes as well. To inform simulation design, we can calculate the number of trajectories one should run for a certain amount of time in order to obtain a desired accuracy in the rate. If one desires to calculate the folding rate such that the standard deviation lies within a factor of 1/r of the expectation, then, using eq 42, we must design the simulations so n* ) r2 - 1 folding events are expected. With some estimate of the folding rate a priori as kest, the probability of observing n* folding events in a set of N simulations of length τ is the binomial distribution

p(n ) n*|N, τ, kest) )

( )

N n* p (1 - pf)N-n* n* f

(43)

where the probability of folding, where the rate is k, in a simulation of length τ is

pf ) 1 - exp[-kestτ]

(44)

The expectation of n* is Npf so one should run

N)

r2 - 1 1 - exp[-kestτ]

(45)

simulations of length τ to have a reasonable chance of observing enough folding events to obtain the required accuracy. For

12420

J. Phys. Chem. B, Vol. 113, No. 36, 2009

Ensign and Pande

Figure 4. Posterior probability distributions of the rate, treating each starting structure as equivalent. Because the number of folding events n is greater for each solvent condition (combination of temperature and viscosity), combining the data in this way allows for better estimates of the rates. As a result, probability distribution of rates may be computed using both uniform (solid) and Jeffreys (dashed) priors. Lower-viscosity simulations seem to fold faster than waterlike-viscosity simulations at the same temperature, as expected.

example, let us calculate the number of 13 µs trajectories of Fip35 required to get the standard deviation of the rate within 20% of the mean. Fip35 folds with an experimental rate of kest ) 77 ms-1 (one event per 13 µs). Using eq 45 with r ) 1/(20%) ) 5, τ ) 13 µs, and kest ) 77 ms-1, one should run 38 trajectories (which should generate about 24 folding trajectories) to get the standard deviation of the rate within 20% of the mean. Therefore, in order to obtain even modest accuracy in the rate, dozens of barrier-crossing events are required in a set of long trajectories. As noted above, dozens of barrier-crossing events (12 for each set of simulations) are also required by the hypothesis test to determine if two sets of conditions generate trajectories that cross a barrier at the same rate. Showing that two rates are different can require less simulation; for a set of simulations with one barrier-crossing event, n2 ) 1, solving eq 39 numerically shows that one needs at least n1 ) 7 barriercrossing events in the same amount of simulation in order to tentatively conclude that the two sets of simulations are different. For higher values of n1 the value of n2 required to conclude a difference in the rates increases linearly (with a slope less than unity). Note that these guidelines are for the same amount of simulation time (Θ1 ) Θ2); one may get away with less barriercrossing events depending on the time it took to cross the barrier. We emphasize that these guidelines (24 barrier-crossing events) result in only modest conclusions about the accuracy of a rate or deciding whether two sets of simulations show the same or different rates. Odds of 3:1 in favor of the same or different rates is hardly conclusive about whether such a difference actually exists. More than 12 barrier-crossing events would be required to come to convincing conclusions with regard to the value of the rate or the sameness or difference between two rates. Finally, because the rates in question may be very long, and due to computational restrictions, one should probably run many times more than 12 trajectories per set of simulation parameters; this increases the wall-clock time expected to obtain the requisite number of barrier-crossing events to obtain accurate rates and to have confidence in hypothesis tests. 5. Conclusion In this paper, we have developed a simple and straightforward Bayesian framework for describing the rates of singleexponential processes. With these methods, one can compute probability distributions of rates from individual starting structures or from equilibrium unfolded ensembles. Equilibrium rates can be inferred from nonequilibrium unfolded states. Rate distributions in hand, one can easily compare pairs of singleexponential rates by calculating the probability that the two rates correspond to different processes. The difference between the rates, and their ratio, can also be calculated. These calculations are useful not just for protein folding simulations but for any other single-molecule simulations and experiments where a single-exponential rate is of interest. Any such endeavor only requires that transitions be detectable. We

have assumed for the purposes of this paper, as is common when interpreting protein folding simulations, that the folded state could be defined by a set of structural characteristics of the snapshots in the simulation. The calculation makes no reference to whether the folded state was reached stably, only the point at which it was reached. Accurate computation of the rate would benefit from more robust definitions of the target state. However, this brings up the question of whether any given process truly exhibits two-state or single-exponential kinetics: if the folded state is not simple to define, or if a reasonable definition of the folded state does not yield stably folding trajectories, then the obvious implication is that the system is sampling from more than two states. The theory as presented here deals only with single-exponential folding. Multiple exponential processes cannot be investigated in the framework presented here by simply replacing the likelihood in eq 7 with a multiple exponential and the calculation proceed apace. Instead, for a given starting configuration, one must separately compute rates to the folded state and each presumptive alternate state. This would usually require a method such as structural clustering to identify putative states; then the theory presented here can be used to compute single-exponential rates of transfer between any pair of states. If such a calculation were feasible, the end result would be the probability distribution of a rate matrix describing transitions between any pair of states. In future work, we will present a Bayesian approach for clustering molecular simulation and single-molecule data on observed degrees of freedom; existing clustering techniques may be used as well.11 The choice of uniform priors will ensure that the resultant rate matrix is irreducible when not all transitions are actually observed. Any matrix sampled from this distribution is automatically irreversible, as the rates are positive. Finally, sampling procedures exist38 which can be used to ensure that the matrix is also reversible. Thus, a rate matrix generated in this way may describe a Markov process on the states which describes all the kinetics and thermodynamics of the system. Finally, we must emphasize the conclusions of the statistical approach to simulation design. There are physical parameters that may be calculated from single MD trajectories,46 but we have shown here that accurate rate calculations and rate comparisons require a multitrajectory approach. In the limit of the simplest, single-exponential protein folding process, where there is approximately one transition state structure, a very few simulations should suffice to characterize folding, but our analysis here shows that a good single-exponential rate cannot be calculated with fewer than dozens of trajectories that all result in barrier crossing. That multiple trajectories are required to estimate accurate single-exponential strongly indicates that more data and more detailed analyses will be required for more complicated kinetics.

Bayesian Single-Exponential Kinetics

J. Phys. Chem. B, Vol. 113, No. 36, 2009 12421

Acknowledgment. The authors thank Mr. S. Bacallado and Dr. V. Voelz for informative discussions on many Bayesian problems. NIH (Grant R01-GM062868). Appendix A: Derivations of the Difference of Rates and the Ratio of Rates Distributions A.1. Difference between Two Rates. Beyond deciding whether two data sets are distinct or indistinct processes, one can compute the probability distribution of the difference of two rates, ∆ ) k1 - k2, from the posterior distributions of the rates k1 and k2. Suppose that we have two simulations or two experiments under different conditions resulting in data sets D1 ) {n1,Θ1} and D2 ) {n2,Θ2}. In the first set of simulations, we observe n1 barrier-crossing events in time Θ1, and in the second set of simulations we observe n2 crossings in time Θ2. With this data the posterior probability distributions of the two rates are (choosing uniform priors)

Θn11+1 n1 f(k1 |D1, I) ) k exp[-k1Θ1] n1 ! 1

viscosity may have much to say about protein folding mechanisms: changes in viscosity will not affect the rates of mechanisms dominated by rearrangements in a collapsed state. Let us again assume our data comes from two experiments or sets of simulations D1 ) {n1,Θ1} and D2 ) {n2,Θ2} and use uniform priors for the rates, as in eqs 46 and 47. The ratio of the rates for the two data sets is

Q)

k1 k2

(50)

The probability density of Q may be constructed from the posterior distributions for the rates by integrating over the individual rates subject to the constraint (50):

p(Q|D, I) )

(46)

(

k

)

∫ dk1 ∫ dk2 δ Q - k21 g(k2|D, I)f(k1|D, I)

Using the identity δ(ax) ) (1/a)δ(x) and

Θn22+1 n2 g(k2 |D2, I) ) k exp[-k2Θ2] n2 ! 2

(47)



dk1



dk2 δ(∆ - (k1 - k2))f(k1 |D, I)g(k2 |D, I)

(48)

which is integrated for two regimes, ∆ < 0 and ∆ g 0. The result is

p(∆|D, I) )

{

∫0∞ dk2 f(k2 - ∆|D, I)g(k2|D, I), ∫0∞ dk1 f(k1|D, I)g(k1 + ∆|D, I),

(51)

or, with eqs 46 and 47

Here we are careful to distinguish between these two distinct distributions f(k1) and g(k2) because they are different functions with different arguments; here, the symbol p would be misleading. The probability distribution of ∆ is computed from distributions (46) and (47) subject to the constraint that ∆ ) k1 - k2. Mathematically, this is p(∆|D, I) )

∫ dk2 k2g(k2|D, I)f(Qk2|D, I)

P(Q|D, I) )

p(Q|D, I) )

(n1 + n2 + 1)! n1+1 n2+1 Qn1 Θ1 Θ2 n1!n2 ! (QΘ1 + Θ2)n1+n2+2 (52)

An analogous and equally simple result can be derived if one instead prefers Jeffreys priors for the rates. There is an interesting effect in the probability distribution of Q (eq 52) if there are no barrier-crossing events for either process (n1 ) n2 ) 0). In this case, one obtains an ill-behaved distribution of the kind

p(x) )