Kinetics from Replica Exchange Molecular Dynamics Simulations

Jun 28, 2017 - Transitions between metastable states govern many fundamental processes in physics, chemistry and biology, from nucleation events in ph...
11 downloads 17 Views 1MB Size
Article pubs.acs.org/JCTC

Kinetics from Replica Exchange Molecular Dynamics Simulations Lukas S. Stelzl† and Gerhard Hummer*,†,‡ †

Department of Theoretical Biophysics, Max Planck Institute of Biophysics, Max-von-Laue Straße 3, 60438 Frankfurt am Main, Germany ‡ Institute for Biophysics, Goethe University Frankfurt, 60438 Frankfurt am Main, Germany S Supporting Information *

ABSTRACT: Transitions between metastable states govern many fundamental processes in physics, chemistry and biology, from nucleation events in phase transitions to the folding of proteins. The free energy surfaces underlying these processes can be obtained from simulations using enhanced sampling methods. However, their altered dynamics makes kinetic and mechanistic information difficult or impossible to extract. Here, we show that, with replica exchange molecular dynamics (REMD), one can not only sample equilibrium properties but also extract kinetic information. For systems that strictly obey first-order kinetics, the procedure to extract rates is rigorous. For actual molecular systems whose long-time dynamics are captured by kinetic rate models, accurate rate coefficients can be determined from the statistics of the transitions between the metastable states at each replica temperature. We demonstrate the practical applicability of the procedure by constructing master equation (Markov state) models of peptide and RNA folding from REMD simulations.



INTRODUCTION Many fundamental processes in physics, chemistry, and biology, from nucleation in phase transitions to protein folding, are governed by rare transitions between metastable states. For the underlying mechanisms to be explored, transition events have to be resolved and studied, which remains a formidable task for both experiment and theory. Single-molecule experiments are starting to shed light on molecular-scale transitions,1,2 albeit with limited structural and temporal resolution. Molecular dynamics (MD) simulations can provide an atomistic view of complex molecular processes, but transitions are often too rare to occur spontaneously on the MD time scale. For this challenge to be surmounted, purpose-built hardware3 and a wide range of methods to enhance sampling in MD have been developed, including replica exchange molecular dynamics simulations (REMD).4,5 Here, we show how kinetic information can be extracted from REMD and related enhanced sampling methods that produce only discontinuous trajectories. REMD can significantly accelerate the exploration of free energy landscapes.6,7 By exchanging configurations between replicas of the system at different temperatures (Figure 1 and Figure S1), fast barrier crossing dynamics at elevated temperatures also improves the sampling at lower temperatures. The replica-exchange Monte Carlo moves preserve canonical equilibrium distributions at each of the temperatures. The success of REMD and the related simulated tempering method8,9 has motivated the development of approaches with refined sampling strategies, such as Hamiltonian replica exchange10,11 and replica exchange with solute tempering.12 Enhanced sampling methods such as REMD modify the dynamics, thereby preventing the direct extraction of kinetic © 2017 American Chemical Society

information. However, rate coefficients are among the few readily measurable parameters that report on transitions, and thus their calculation from simulations is particularly important. Kinetic rates can be extracted indirectly from biased simulations such as umbrella sampling13,14 and metadynamics15,16 if the dynamically relevant degrees of freedom are known for the process of interest. Extracting kinetics directly from REMD may seem impossible due to the nonphysical exchange of the configurations between the replicas.5 Replicas are exchanged frequently, typically on a picosecond scale, to enhance the movement of the configurations through the temperature space and thus the efficiency of sampling.7,17 Equilibrium timecorrelation functions can, in general, only be extracted on time scales comparable to the exchange-attempt time δREMD, which is much shorter than the characteristic times of large-scale conformational dynamics. Consequently, the transition rates between the different conformational states cannot be determined straightforwardly. For the special case of a twostate system, rate coefficients for peptide folding were determined from REMD by assuming an Arrhenius temperature dependence.18 Diffusion13,19 and master equation descriptions offer possible routes for non-Arrhenius20 and multistate kinetics. Master equations track the time-evolution of the conformational states and have provided a qualitative understanding of reaction mechanisms from REMD simulations21 and of the REMD sampling efficiency.7 By exploiting the fact that the short trajectories between replica exchanges Received: April 7, 2017 Published: June 28, 2017 3927

DOI: 10.1021/acs.jctc.7b00372 J. Chem. Theory Comput. 2017, 13, 3927−3935

Article

Journal of Chemical Theory and Computation

Figure 1. Schematic representation of REMD simulations for a two-state/two-temperature system. (A) Replica trajectories at two temperatures T1 and T2 are mapped onto discrete states U (blue) and F (orange). Replica exchanges are indicated by thin crossing arrows, and transitions U → F and F → U are indicated by block arrows. Trajectory segments a in state j (here, j = U, F) and replica α (here, α = 1, 2) uninterrupted by replica exchanges or state transitions have durations t(α) j,a . (B) In real systems, transitions between states are not instantaneous. We assign, by definition, half of the time of a transition path (mauve) each to the initial and final state. (C) If replica exchanges occur during a transition path, the transition event is assigned to each visited temperature in a probabilistic manner, resulting in fractional increments to the transition counts.

We then use a maximum likelihood procedure to estimate kinetic rate coefficients kαij for j → i transitions at Tα. The dynamics of the populations p(α) i (t) of states i at temperature Tα in time t is assumed to obey first-order kinetics, dp(α) i (t)/dt = (α) (α) ∑j(≠i)[kαij p(α) j (t) − kji pi (t)]. The likelihood L of a particular REMD trajectory (and thus its path action within the kinetic model) can be expressed as the probability of the observed times t(α) j,a and counts of the number of transitions from j to i, N(α) ij . Because the following analysis applies independently to each temperature Tα, we drop the superscript (α) for notational simplicity. For first-order kinetics, the probability to transition from state j to state i is p(j → i) = kij/∑l(≠j)klj. The lifetimes t in state j are exponentially distributed with a normalized probability density p(t|j) = ∑l(≠j)klj exp(−∑i(≠j)kijt). The exponential decay rate is the sum of all outgoing rates from state j. The probability to survive in j for a time t without a transition, S(t|j), is one minus the cumulative distribution function of the lifetimes, S(t|j) = 1 − ∫ 0t p(t′|j)dt′ = exp(−∑l(≠j)kljt). The likelihood of the set of trajectory segments at a particular temperature can thus be written as a product over the M states

evolve according to the true dynamics, they also provide a framework to extract rates.22 Estimating the rate coefficients from REMD by the method of Buchete and Hummer has provided insight,23 e.g., into amyloid peptide dimerization kinetics,24 but remains challenging. In the method of ref 22, rate coefficients are extracted by fits to the short-time Green’s function for the dynamics in the space of discrete states, requiring unambiguous assignments of transitions between states to a particular temperature and thus transition paths that are short on the time scale of replica exchanges. Pooling of information from all simulated temperatures to estimate the rate coefficients at a specific temperature improves the statistics14,25,26 and can be extended to extract other dynamical properties such as correlation functions.27,28 Nonetheless, the kinetics encoded in REMD trajectories is not yet extracted routinely, in effect wasting useful dynamical information. Here, we show how kinetic information can be extracted directly from a widely used enhanced sampling method, REMD, taking advantage of good sampling and requiring no additional simulations. We assume that we can identify states whose population dynamics follows exponential rate kinetics. Transition events are then uncorrelated from the preceding history. This Markovianity allows us to extract rates from the statistics of transition events in discontinuous REMD trajectories as shown in Theory and in simulations of a minimal model (Supporting Information). We demonstrate the practical applicability of our formalism in simulations of a peptide fragment (alanine dipeptide, Ala2) and a coarse-grained polymer model of the neomycin riboswitch RNA (n-rb). The same formalism, with slight adaptations, allows us to extract kinetics from biased simulations that accelerate dynamics,29−32 as shown in the accompanying paper.33

L =

∏ ∏

S(t j , a|j)

states j a ∈ Rex

=

∏ ∏

p(j → i)p(t j , b|j)

states i b ∈ Trans

N



kij ij exp( −kijt j)

i , j(i ≠ j)

⎛ kijpj N kij ij⎜⎜ ⎝ pi j=2 i=1 M

=

j−1

∏∏

⎞ Nji ⎟⎟ exp[−kij(t j + tip /p )] j i ⎠ (1)



where we factored out time segments a ending in replica exchanges (Rex) and time segments b ending in transitions (Trans). To get to the second line, we substituted the explicit expressions for p(j → i), p(t|j), and S(t|j), rewrote the product of exponentials as an exponential of the sum of times, and grouped identical prefactors kij. Here, tj = ∑a∈Rex tj,a + ∑b∈Trans tj,b is the total time spent in state j. To get to the last line, we imposed detailed balance, kijpj = kjipi (where pi is the equilibrium probability of being in state i), rewriting the likelihood in terms of rate matrix entries above the diagonal, kij (i < j), and the set of pi.37 In essence, the exponential term in our continuous-time likelihood, eq 1, corresponds to the terms for j → j transitions in the likelihood function of discrete-time N Markov state models26,38 LMSM = ∏i∏j pij ij, and the terms involving powers of kij to j → i (≠ j) transitions. The analogy is worked out in detail in the companion paper33 for general biased dynamics. To maximize our likelihood, we set the derivative of ln L with respect to all kmn to zero for m < n, ∂ ln

THEORY Rate Coefficients from REMD. To extract kinetics from REMD, we map the “time-continuous” trajectories (which follow a system through replica exchanges irrespective of temperature) onto a set of discrete states i, as illustrated schematically in Figure 1. This assignment can be performed, for instance, by using the transition-based assignment method,22 which uses core sets in configuration space centered on the well-populated regions to filter out fast recrossings. States can be defined using order parameters22 and/or by methods developed for Markov state models,34−36 and their definition has to be validated (Supporting Information). At each of the temperatures Tα of a REMD simulation, we obtain in this way a sequence of trajectory segments a of durations t(α) j,a in states j uninterrupted by replica exchanges or by transitions between states. 3928

DOI: 10.1021/acs.jctc.7b00372 J. Chem. Theory Comput. 2017, 13, 3927−3935

Article

Journal of Chemical Theory and Computation

the product state i rather than the reactant state j. Transition states have, by definition, committor values of φ = 1/2, such that 2φ(1 − φ) = 1/2. For high barriers, almost all other states have committor values of φ ≈ 0 or 1 and thus 2φ(1 − φ) ≈ 0. Therefore, 4φ(1 − φ) provides us with an approximate indicator function of transition states by adopting a value of one for transition states and values close to zero for nontransition states. We can thus express the probability of being on a transition state given that a trajectory at temperature Tα has a reaction coordinate value of Q as p(TS|Tα, Q) ≈ ⟨4φ(1 − φ)⟩Q,Tα = 2p(TP|Tα, Q). Now, we estimate the probability p(TS|Tα,{Qτ|Tτ = Tα}) of having visited at least one transition state at temperature Tα during a given TP, which is by definition one minus the probability that none of the configurations xτ at Tα along the TP is a transition state

L/∂ kmn = 0, resulting in explicit expressions for the rates, kmn = (Nnm + Nmn)/(tn + tmpn/pm). Because REMD should allow us to estimate the pi quite accurately as the fraction of time spent in i, pi = ti/t, we do not maximize L with respect to pi. Instead, we set pi = ti/t, where t = ∑i ti is the total simulation time at the respective temperature. We arrive at a rate estimate of kmn = (Nnm + Nmn)/2tn

(2)

The topology (connectivity) of the rate model constructed with eq 2 is determined by the observed transition counts. That is, whenever Nmn > 0 or Nnm > 0, states m and n are connected. From the second derivative of the log-likelihood at its maximum, − ∂2 ln L/∂(ln kmn)2|max = Nnm + Nmn, we obtain the standard error of ln kmn as the reciprocal of the square root of the sum of transition counts forward and backward. We can thus estimate both the rates and their errors from the transition counts and waiting times ln kmn = ln

Nnm + Nmn ± (Nnm + Nmn)−1/2 2tn

(1 − p(TS|Tα , Q τ))

τ(Tτ = Tα)

(3)





Aside from the fact that estimating errors from data with poor statistics, small numbers of transitions here, is always challenging, we emphasize that the assumption of Gaussian errors, and thus their representation in terms of “error bars”, becomes problematic for small Nmn + Nnm, where the nonGaussian character of the likelihood function becomes important. We note that, in long equilibrium runs, the number of forward and reverse transitions is balanced, Nmn ≈ Nnm, because of microscopic time reversibility. Therefore, by imposing detailed balance in the analysis of long equilibrium runs, we reduce the expected statistical error in our logarithmic rate estimates by a factor √2 at no additional computational cost. Note that at each REMD temperature transitions occur at their natural frequency and hence the rate estimates also apply to regular MD simulations. With slight modifications, our formalism can be applied to simulations biased to boost the frequency of transitions as demonstrated in the companion paper.33 Transition Counts in REMD. So far we have assumed that the system follows rate kinetics exactly. For such systems, the above procedure would be rigorous. However, in real systems, we face the problems (1) that transitions between states are short but not instantaneous (Figure 1B and C) and (2) that we do not know exact transition states. We therefore assign a particular transition event j → i to temperature Tα by the estimated probability of having crossed the state boundary there using the Bayesian formulation of transition path theory.39 We first derive an approximation for the conditional probability p(TS|Tα, Q) of being in a transition state given a particular value Q of a chosen reaction coordinate at temperature Tα. As a first step, we express the conditional probability of being on a transition path in terms of the committor φ39 p(TP|Tα , Q ) = ⟨2φ(1 − φ)⟩Q , Tα



p(TS|Tα , {Q τ|Tτ = Tα}) = 1 −

p(TS|Tα , Q τ)

τ(Tτ = Tα)

≈2



p(TP|Tα , Q τ)

τ(Tτ = Tα)

(5)

In the first expression, we assumed that the configurations are independent to expand the probability as a product over all time points τ with Tτ = Tα. In the second step, we used the fact that p(TS|Tα, Qτ) ≈ 0 for almost all Qτ, such that their products can be neglected. In the third step, we substituted the approximation for p(TS|Tα, Qτ) given at the end of the preceding paragraph. Finally, we assume that the probabilities p(TS|Tα, {Qτ|Tτ = Tα}) of having visited a transition state at least once at temperature Tα are proportional to the probability p(Tα|TP) that the transition from state j to i has occurred at this temperature. We then arrive at an expression for the probability that the transition occurred at Tα in terms of quantities that can be calculated from the simulations p(Tα|TP) ∝



p(TP|Tα , Q τ )

τ(Tτ = Tα)

(6)

where the sum extends over all time points τ along the transition path at which Tτ = Tα. As a further simplification, in the analysis of the systems studied here we assume that p(TP|Tα, Q) ≈ p(TP|Q) is independent of temperature over the typically narrow temperature window in which most transitions occur. In this way, we can calculate p(TP|Q) directly from the REMD trajectories p(TP|Q ) ∝

p(Q |TP) p(Q )

(7)

where the probability densities of Q over the transition path and equilibrium ensembles in the numerator and denominator, respectively, are evaluated as histograms of Q from all REMD TP and equilibrium trajectories irrespective of temperature. In practical applications, the assumption that p(TP|Q) is independent of temperature should be tested by comparing the ratios p(Q|Tα, TP)/p(Q|Tα) collected at the different temperatures Tα where transitions have been observed. Our approximation for the probability that the transition occurred at

(4)

where the average is over an equilibrium distribution at fixed values of Q. Here, we ignored inertial effects and flux from the transition region between states i and j to other states (i.e., we make a local two-state assumption). The committor φ(x) is the probability that trajectories starting from configuration-space point x with Maxwell−Boltzmann velocities proceed directly to 3929

DOI: 10.1021/acs.jctc.7b00372 J. Chem. Theory Comput. 2017, 13, 3927−3935

Article

Journal of Chemical Theory and Computation Tα (and thus the probabilistic weight factor used to assign events to Tα) then becomes w(α) =

where in the last expression we took advantage of the symmetry of xij. We now substitute xij into zk to eliminate the first set of M(M − 1)/2 equations with M the number of states. We then obtain a set of closed expressions for zk alone

∑τ(T = T ) p(TP|Q τ ) τ

α

∑τ p(TP|Q τ )

(8)

zk =

In the absence of a suitable reaction coordinate Q, one can set p(TP|Q) constant. The weight w(α) for temperature Tα is then proportional to the time the transition path spent at Tα. Simulations of a minimal model with finite transition paths confirmed that rates can be extracted from REMD even when the dynamics deviates from ideal rate kinetics (Figures S2 and S3). Distribution of Lifetimes from REMD. Within this framework, we can also obtain distributions of lifetimes in the metastable states from REMD. If the system followed rate kinetics exactly, with instantaneous and resolved transitions, we would expect the aggregate times spent in state j at a particular temperature Tα between transitions out of state j to be exponentially distributed with a characteristic rate ∑i(≠j)k(α) ij . Because the lifetime distribution for first-order kinetics is unequivocally determined by the rate, we will obtain the correct distribution even if the wait in j at temperature Tα is interrupted by replica exchanges. For this, we have to aggregate all times in j between transitions out of j. In an algorithmic realization, we could thus accumulate waiting times until our transition counter N(α) ij has increased by one. Equivalently, for exponential rate kinetics, we can rescale the aggregate lifetime in j at Tα preceding a transition out of j by 1/w(α), the reciprocal of the estimated probability that the transition occurred at Tα. The rescaled lifetime is assigned a relative weight w(α), which guarantees the consistency of the extracted REMD lifetime statistics and the rate calculation with mean lifetime ⟨τj⟩ = 1/ ∑i(≠j)k(α) ij . We also estimate distributions of transition path times. A transition path time τTP enters the distribution at Tα with a relative weight w(α) given by the probability that the transition occurred at Tα. Maximum-Likelihood Estimates of Rates and Populations. So far we have assumed that, in REMD simulations, the equilibrium populations are sampled accurately,40 and therefore we have not optimized the state populations pk under the assumption of a rate model. In the following, we extend the maximum-likelihood analysis by estimating both rates and populations under the constraints of detailed balance given total times tj spent in states j and transition counts Nij from states j to i. We define zk = pk/tk and, inspired by ref 34, xij = kijpj. The matrix xij is symmetric because of detailed balance, xij = xji. In terms of these variables, we can write the log-likelihood 3 = ln L with L given in eq 1 as 3=

F=

∑ [(Nij + Nji)ln(e−g

i

+ e−gj) + Nijgj + Njigi]

(13)

and kij =

Nij + Nji

(

tj 1 +

zj zi

)

(15)

The latter generalizes eq 2 to the case where not all pk = tk/t (i.e., not all zk = 1). For somewhat different inputs (e.g., discrete time series of states) and objectives (e.g., estimating transition probabilities of a discrete-time Markov state model), a variety of related maximum-likelihood expressions have been derived (see, e.g., refs 26 and 34), and it is not unlikely that, in the vast literature on the estimation of rate models, our expressions have appeared. In the companion paper,33 we determined optimal state populations and rate coefficients from biased simulations, such as umbrella sampling simulations, by extending the above rate formalism.



SIMULATION METHODS MD and REMD simulations were run in GROMACS 4.6.42 For Ala2, we ran simulations with a 2 fs time step at 12 equally spaced temperatures from 295 to 395 K for 450 ns (MD) and 150 ns (REMD) at each temperature. We used the AMBER99sb*-ildn-q43−46 force field and the TIP3P water model.47 Van der Waals interactions were cutoff at 12 Å, and electrostatics were described by the Particle Mesh Ewald method. Sampling from the canonical equilibrium distributions was enabled by the Bussi−Donadio−Parrinello thermostat.48 Replica exchanges were attempted every 5 ps. The atomically

To maximize 3 , we set the derivatives of 3 with respect to xij and zk to zero, obtaining Nij + Nji (10)

and ∑i(≠ k) xik ∑i(≠ k) Nik

(12)

with respect to gk = ln zk with gM = 0 fixed. F was obtained by substituting zk = egk and eq 10 into eq 9, changing the sign, and removing terms that do not alter the location of the extremum. Because of the special form of 3 , the maximum of F coincides with the minimum of eq 13. From the converged and normalized zk, the maximumlikelihood estimates of the equilibrium populations and rate coefficients can then be determined as pk = zktk (14)

(9)

zk =

∑i(≠ k) Nik

i