Metadynamics Enhanced Markov Modeling of Protein Dynamics - The

Jan 16, 2018 - A particularly promising strategy is to combine massive parallel computing of short molecular dynamics (MD) trajectories (to sample the...
0 downloads 4 Views 734KB Size
Subscriber access provided by READING UNIV

Article

Metadynamics Enhanced Markov Modeling of Protein Dynamics Mithun Biswas, Benjamin Lickert, and Gerhard Stock J. Phys. Chem. B, Just Accepted Manuscript • DOI: 10.1021/acs.jpcb.7b11800 • Publication Date (Web): 16 Jan 2018 Downloaded from http://pubs.acs.org on January 19, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

The Journal of Physical Chemistry B is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Metadynamics Enhanced Markov Modeling of Protein Dynamics Mithun Biswas, Benjamin Lickert, and Gerhard Stock∗ Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany E-mail: [email protected]

Abstract Enhanced sampling techniques represent a versatile approach to account for rare conformational transitions in biomolecules. A particularly promising strategy is to combine massive parallel computing of short molecular dynamics (MD) trajectories (to sample the free energy landscape of the system) with Markov state modeling (to rebuild the kinetics from the sampled data). To obtain well-distributed initial structures for the short trajectories, it is proposed to employ Metadynamics MD, which quickly sweeps through the entire free energy landscape of interest. Being only used to generate initial conformations, the implementation of Metadynamics can be simple and fast. The conformational dynamics of helical peptide Aib9 is adopted to discuss various technical issues of the approach, including Metadynamics settings, minimal number and length of short MD trajectories, as well as the validation of the resulting Markov models. Using Metadynamics to launch some thousands of nanosecond trajectories, several Markov state models are constructed which reveal that previous unbiased MD simulations of in total 16 µs length cannot provide correct equilibrium populations or qualitative features of the pathway distribution of the short peptide.

Introduction Capturing rare conformational transitions is a longstanding challenge in biomolecular simulation. A molecular dynamics (MD) trajectory typically oversamples the free energy basins, but rarely visits the barriers which control the rates of interesting processes. Consequently, to access biomolecular processes occurring over large barriers, considerable computational cost is incurred, which can be provided by specialized hardware (like Anton 1 architecture) or massively distributed computing (like Folding@Home 2 ). Alternatively, one may employ an enhanced sampling algorithm, 3 which, e.g., modifies the underlying free energy surface to facilitate barrier-crossing events. Popular methods in∗ To

whom correspondence should be addressed

clude umbrella sampling, 4 steered MD, 5 targeted MD, 6 replica exchange MD, 7 temperature accelerated MD, 8 statistical-temperature MD, 9 and Metadynamics, 10 to name but a few. While these approaches yield an improved thermodynamic model, the kinetic description of the unbiased system usually gets lost. To recover the dynamics of the system, we therefore need to invoke some post-simulation model that is capable of rebuilding the kinetics from the sampled data. To construct such a coarse-grained model from an allatom MD simulation, we may either employ a discrete state space (representing, e.g., the main conformational states of the system) or define some continuous collective variables (that account for the overall structural dynamics of the system). In the latter case, one may consider a data-driven Langevin equation, 11–14 in the former Markov state models (MSMs) are a popular example for a discrete state formulation. 15–19 By partitioning a continuous MD trajectory in discrete metastable states, MSMs describe protein dynamics in terms of memoryless jumps between those states. Since the transition probability matrix relies on the local density of conformations, an essential feature of the MSM is that it does not need a long continuous trajectory to predict slow dynamics of a biomolecular system. Instead several relatively short trajectories can be run independently, 15–17 which enables us to use massive parallel computing. A similar approach has recently been suggested for Langevin modeling. 20 To sample the overall dynamics of the system, we need short trajectory pieces that cover the entire conformational space of interest, which requires us to choose the initial conditions of the short trajectories accordingly. However, since we typically don’t know the relevant conformational space, the choice of well-distributed initial structures turns out to be a nontrivial issue. To generate these “seed” structures for the short trajectories, replica exchange MD, 15 high temperature simulations, 16,21 and adaptive seeding approaches 22 have been employed. The basic idea pursued in this work is to adopt an enhanced sampling algorithm that has no target bias and enables us to rapidly explore the free energy landscape. In this respect, Metadynamics is a biased sampling method that has been used

ACS Paragon Plus Environment 1

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

extensively to map out the free energy landscape of proteins 23–25 and to construct kinetic models. 26–28 To surmount the barriers of the free energy landscape, it employs an external history-dependent bias potential which is constructed as a sum of Gaussians. Here, we employ an “aggressive” Metadynamics setting that quickly sweeps through the free energy landscape by spending minimal computational time for the sampling step. From the Metadynamics run, representative structures of the conformational landscape are chosen as input for subsequent short MD runs. As Metadynamics is only used to generate initial conformations, it is sufficient to provide a rough approximation of the conformation density. Finally, the short trajectories are used to construct a MSM of the dynamics. To demonstrate the virtues and shortcomings of this Metadynamics enhanced Markov modeling, we adopt a well-studied model problem, namely, conformational transitions of Aib9 peptide helix. 29,30 Despite of its small size, extensive unbiased MD simulations have shown that the achiral peptide exhibits complex structural dynamics, which originates from various hierarchically coupled dynamical processes on several timescales. As a consequence, several aspects of the system such as the nature and sampling of intermediate states are still not fully understood. By discussing the choice of collective coordinates and Gaussian parameters, we study the performance of the Metadynamics simulations, which are shown to provide an thorough sampling of the energetically accessible free energy landscape. Based on some thousands of short (10 ns) trajectories, we use dihedral angle principal component analysis (dPCA), 31 density-based clustering, 32 and a reversible maximum likelihood estimator of the transition matrix 33,48 to construct a MSM of the dynamics. Comparing to results of µs long unbiased MD runs, 29 we assess the quality of the resulting MSMs and discuss the potential of the Metadynamics enhanced MSM approach.

Methods MD and Metadynamics Simulations Recently Buchenberg et al. 29 performed extensive unbiased MD simulations of Aib9 (H3 C-CO-(NHCα (CH3 )2 -CO)9 -CH3 ), using the GROMACS program suite 34 with the GROMOS96 43a1 force field 35 and explicit chloroform solvent. 36 As a reference to compare to, here we adopt eight MD trajectories at 320 K of each 2 µs length, using a time step ∆t = 10 ps (1.6×106 frames). All Metadynamics and short trajectory simulations reported below used the same simulation set-up as described in Ref. 29 Metadynamics simulations were run using GROMACS program suite 37 patched with PLUMED. 38 We tested various collective coordinates to bias Metadynamics (Figure S1), and eventually choose the cumuP9 lative backbone dihedral angle (φsum = n=1 φn ) and

the number of backbone-stabilizing hydrogen bonds (Nhb ) as biasing coordinates. Employing the strategy of Ref., 39 the Supporting Information discusses the choice of the Gaussian parameters (height, widths, and frequency of deposition) of the Metadynamics run. Using height w=1 kJ/mol, widths (along the two coordinates) σφsum = 2 radians and σNhb = 1, and deposition frequency τG = 1 ps, we performed Metadynamics MD for 100 ns. Employing a representative set of Metadynamics conformations as starting structures, we next ran short unbiased MD trajectories of Aib9 . To select the starting structures, the Metadynamics free energy landscape along the first two principal components was uniformly divided into bins of equal size. Choosing one seed structure per bin, 10 ns long MD runs were performed using the standard protocol 29 to sample the local conformations. To ensure that the trajectories are equilibrated at 320K, the first 0.5 ns of each trajectory were removed. Rather than using complete trajectories, this strategy was found to give more consistent models. In total we ran 7732 unbiased MD trajectories of each 10 ns length, which results (for a 10 ps time step) in 7.7×106 frames.

Construction of MSM As discussed previously, 29 the conformational dynamics of Aib9 is well described by the (φ, ψ) backbone dihedral angles of the five inner residues of the peptide. To further reduce the dimensionality of coordinate space, principal component analysis (dPCA) was performed on these dihedral angles. 31 To be consistent with earlier work, 29 we used the original dPCA 31 which is based on sine/cosine transformations of the dihedral angles, rather the the improved method dPCA+, 30 which employs directly the dihedral angles. The first five principal components (accounting for 85% of the total variance), which show multipeaked distributions and a slow decay of the autocorrelation function, represent suitable collective coordinates that were used in the further analysis. To identify the metastable states of the system, we used the density-based geometric clustering algorithm by Sittel et al. 32 which detects high-density regions in the given coordinate space, based on the local density of structures measured as a population in a d-dimensional hypersphere (here d=5). The optimal hypersphere radius R is an input parameter that can be found for equilibrium systems using the Boltzmann distribution as a heuristic. Following Ref., 30 we used R=0.2 for Aib9 . The approach has been shown to yield welldefined microstates separated by local free energy barriers and therefore provides an essential improvement over commonly employed k-means-type methods. The PCA and clustering methodology is freely available at https://github.com/lettis. To estimate the conditional transition probabilities Tij = Cij /ci between states i and j, the transition count matrix Cij and the number of visits ci are calculated

ACS Paragon Plus Environment 2

Page 2 of 9

using a combined ensemble/time average over the set of K trajectories with each one having N frames. Given trajectories xk (tn ) with k = 1, . . . , K, n = 1, . . . , N , and a lag time τ = L∆t, we obtain

a

150 50

r*

-50

r

b

l

r* l*

l

r

ci =

K N −L X X 1 χi (xk (tn )) , K(N −L) n=1

6 4

l*

2

-150

0 -150 -50

50

150 -150 -50

φ

50

150

φ

k=1

K N −L X X 1 Cij = χi (xk (tn )) χj (xk (tn+L )) , K(N −L) n=1 k=1

where χi (x) denotes the indicator function for state i, which takes a value 1 if x belongs to this state and 0 otherwise. The transition time is obtained as τij = ∆t/Tij . Employing these transition probabilities as a first estimate, MSMs were constructed using a reversible maximum likelihood estimator of the transition matrix as implemented in the PyEMMA package. 33,48 To characterize the conformational dynamics of the resulting MSMs, we conducted a 10 ms long Markov chain Monte Carlo run of the resulting MSM. Subsequently, we performed an analysis of the involved transition pathways as well as of the waiting time distribution for selected transitions.

x2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

ψ

Page 3 of 9

2 1 0 -1 -2

rllll lllll

rrlll rrrll rrrrl

llllr

-2 -1

c

d

rrrrr

0

4 2

lrrrr llrrr

lllrr

1 x1

2

6

0 3

-2 -1

0

1

2

3

x1

Figure 1: Comparison of the conformational sampling of Aib9 peptide, obtained by (left) 8×2µs unbiased MD trajectories and (right) a 100 ns Metadynamics run. Shown are (a,b) the Ramachandran (φ, ψ) plot of the inner residues of Aib9 and (c,d) the free energy landscape projected on the first two principal components of a dPCA. The black dots indicate the average positions of the ten most populated metastable conformational states of the system.

Results & Discussion Conformational dynamics of Aib peptide Aib9 is a small helical peptide that exhibits surprisingly complex structural dynamics, which reflects hierarchically coupled processes on several timescales. 29 They correspond to chiral left- to right-handed transitions of the entire peptide helix that happen on a 0.1 µs timescale, conformational transitions of individual residues taking about 1 ns, and the opening and closing of structure-stabilizing hydrogen bonds which occur within tens of ps and are triggered by sub-ps structural fluctuations. To demonstrate that the achiral peptide occurs in both left- and right-handed structures, Figure 1a shows the Ramachandran plot of the inner residues of Aib9 , as obtained from long (8 × 2µs) MD trajectories. The point symmetry of the (φ, ψ) plot with respect to (0,0) clearly shows that Aib9 indeed samples both left-handed (φ ≥ 0) and right-handed (φ ≤ 0) conformations with similar probability. The main conformational states at ≈ (∓50◦ , ∓45◦ ) represent a rightand left-handed helix, respectively. Moreover, for each chirality we find (at least) one excited conformational state at ≈ (∓68◦ , ±45◦ ). To obtain a global view of the conformational dynamics, Figure 1c shows the free energy landscape ∆G along the first two principal components x1 and x2 of the dPCA (see Methods). Owing to the achirality of Aib9 , we find an overall symmetry with respect to the first principal component x1 , where the two main minima correspond to the all left-handed structure L and the all right-handed structure R. Moreover, a number

of metastable intermediate states exist that constitute pathways from L to R. Containing a mixture of righthanded (r) and left-handed (l) residues, the intermediate states can be described by a product state of these chiralities. When we restrict ourselves to the five inner residues of Aib9 (to avoid end effects), we obtain e.g., L = (lllll) and R = (rrrrr), as well as (rllll) if all but residue 3 show left-handed conformations. Judging from the prominent appearance of states (lllll), (rlllll), (rrlll),..., (rrrrr), (lrrrr), etc. in the free energy landscape, one might expect that a sequential pathway along these states provides the main way to get from L to R and back. Table 1 lists the population probability of the most important conformational states of Aib9 . Surprisingly, we find that left- and the right-handed ground states L and R occur with different populations (pL ≈ 2pR ), although the achiral peptide Aib9 lacks a chiral center. While this deviation corresponds to only a small free energy difference (∆G ≈ 0.4 kcal/mol), it apparently cannot be explained by incomplete sampling. A closer analysis reveals that this lack of symmetry is caused by an inaccuracy of the force field parametrization of Aib9 first employed in Ref., 40 which arises from a peculiar interplay of the angular bond terms and the improper dihedral angle terms pertaining to the two CH3 groups of the Cα of Aib9 (see Supporting Information). We note that this force field problem does not affect the study of conformational sampling and construction of MSMs pursued in this work.

ACS Paragon Plus Environment 3

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Metadynamics sampling As explained in the Introduction, we wish to employ Metadynamics to sample seed structures for subsequent unbiased short MD simulations. To demonstrate the effect of Metadynamics sampling of Aib9 , Figure 1b shows the Ramachandran plot of the inner residues obtained from the 100 ns Metadynamics run. Compared to the results from unbiased MD (Figure 1a), we note an improved sampling of the excited states at ≈ (∓68◦ , ±45◦ ). Similarly, when we consider the overall free energy landscape ∆G(x1 , x2 ) (Figure 1d), we find that Metadynamics samples the entire energetically accessible energy surface homogeneously. Hence we expect that the conformational distribution produced by Metadynamics provides a good basis to select well-distributed starting structures for the short MD runs. When Metadynamics is used to directly compute the free energy landscape, the choice of appropriate collective coordinates to bias Metadynamics is crucial and therefore a much-discussed topic. 24 Hence it is interesting to study the effect of different collective coordinates in the present approach. To this end, Figure S1 compares the free energy landscape ∆G(x1 , x2 ) obtained by Metadynamics runs using various combinations of backbone dihedral angles and hydrogen bonds as biasing coordinates. While the resulting energy landscapes clearly differ in details, they nonetheless result in similar starting structures for the subsequent MD runs. This is because these structures are obtained by binning of ∆G(x1 , x2 ) and choosing one seed structure for each bin (see Methods). Being only used to generate initial conTable 1: Population (in %) of the main metastable conformational states of Aib9 , as obtained directly from MD simulations (MD) as well as from various Markov models. MSM1 and MSM2 are based on the original MD data and short trajectory data, respectively, MSM3 and MSM4 employ a subset of the short trajectory data. The five-letter code accounts for righthanded (r) and left-handed (l) structures of the inner residues. The asterisk indicates an excited state of the residue (see Figure 1a). Standard errors (which are typically about 10 %) are given in Table S1. state lllll r∗ llll rr∗ lll rrr∗ ll rrrr∗ l rrrrr l∗ rrrr ll∗ rrr lll∗ rr llll∗ r

MD 50.6 6.0 1.3 1.2 1.4 21.7 2.4 1.1 1.5 1.9

MSM1 51.0 6.0 1.3 1.2 1.4 21.4 2.3 1.0 1.5 1.9

MSM2 37.0 11.2 1.8 0.6 1.2 16.5 5.5 1.6 0.7 2.8

MSM3 34.8 10.5 1.4 0.5 1.5 18.8 6.0 2.0 1.0 1.3

MSM4 35.0 10.8 2.0 0.6 1.2 17.2 5.7 1.8 0.7 2.7

formations, a rough approximation of the conformation density by Metadynamics is sufficient.

Characterization of MSMs To identify the metastable conformational states of Aib9 , we employed density-based geometric clustering 32 using the first five principal components of the dPCA (see Methods). In this way, we obtained 42 microstates for the original MD data (8×2µs) and 102 microstates for the short trajectory data (7732×10 ns). Table 1 comprises the structural characterization of the ten most important conformational states, which are well described by the above introduced five-letter code. Interestingly, we find that border residues between left- and righthanded residues ( such as rr∗ lll or l∗ rrrr) are typically found in excited states r∗ and l∗ (see Figure 1a). Most states occur as a mixture of 310 - and α-helical structures, in a few cases these structures split up in two separate states (e.g., the lllll state of MSM2, see Figure 2). While the main states occur in MSM1 and MSM2 as well as in the original MD, the short trajectory data reveal many more conformational states, which are typically quite lowly populated. See Table S1 and Figure S2 for a more detailed characterization of these states. To obtain valid MSMs for both data sets, we next determine a suitable lag time τ to calculate the transition matrix {Tij } of the model. To this end, we calculate the eigenvalues λi of the transition matrix and plot the associated implied timescales ti = −τ / ln λi versus the lag time. Interestingly, Figure 3 reveals that the resulting implied timescales are quite similar for both data sets, which indicates that already the sampling of the original MD data (8 × 2µs) is sufficient to observe the slowest timescales of Aib9 . As the timescales remain roughly constant for lag times τ & 1 ns, we fix the lag time to this value. In the remainder of this paper, we will refer to the resulting Markov model for the original MD data set as “MSM1”, while “MSM2” denotes the model pertaining to the short trajectory data. Table 1 compares the resulting state populations obtained for the original MD data and the two Markov models. The populations of the MD data and the corresponding Markov model MSM1 are quite similar, which indicates that MSM1 represents a consistent model of the original MD data. In the case of MSM2, on the other hand, the populations of the main states L=(lllll) and R=(rrrrr) are significantly reduced, while the intermediate states are higher populated. As detailed in Table S1, the missing population of L and R is redistributed to numerous weakly populated states located in the middle of the free energy landscape in Figure 1. This raises the question if the redistribution of population observed for MSM2 is correct, or rather caused by the fact that we choose the majority of the starting structures in highenergy regions of the energy landscape and used relatively short (10 ns) trajectories. It is well known that short nonequilibrium trajectories may considerably bias a MSM towards their initial conformations. 41 As this

ACS Paragon Plus Environment 4

Page 4 of 9

Page 5 of 9

rrr*ll

rr*lll



● ●





r*llll ●

r*llll ● ●

● ● ●

● ●



●●













lllll





rrrrr







● ●

l*rrrr

● ●



lll*rr

lllll lllll











● ● ●









● ●











● ●

















llll*r



● ●

















●●









ll*rrr

● ● ● ●









●●

















● ●











●●



● ●







● ●



●●











llll*r





rrrr*l

● ●



















rrr*ll

rr*lll

rrrr*l





rrrrr



l*rrrr

● ●



ll*rrr

lll*rr

Implied time (ns)

Figure 2: Network representation of the Markov models MSM1(left) and MSM2 (right). Node size reflects the population of the state, edge thickness the transition rate. Orange transitions go the the right, blue to the left. For better visibility, only the 10 % most frequent transitions are shown.

a MSM 33 (yielding Tijenf ), it is a measure for the quality of a MSM to what extent detailed balance is fulfilled without enforcement (yielding Tijnotenf ). As a simple test, Figure S3 shows the difference between these transition matrices, Tijenf − Tijnotenf , for MSM1 and MSM2. We find that the better sampled MSM2 clearly obeys detailed balance better than MSM1.

100 10 1 0.1 0.6

Population

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

0.4

Analysis of transition pathways

0.2 0 0

0.5 1 1.5 2 Lag time τ (ns)

0

0.5 1 1.5 2 Lag time τ (ns)

Figure 3: (Top) Five slowest implied timescales of (left) MSM1 and (right) MSM2 as a function of lag time τ . (Bottom) Equilibrium populations of conformational states lllll (red), rrrrr (green), r∗ llll (blue) and l∗ rrrr (cyan) for the two models. Error bars are calculated using bootstrapping as detailed in the SI.

effect becomes smaller with increasing lag time τ , Figure 3 shows the populations of some selected states as a function of τ . Being based on long trajectories, the state populations of MSM1 hardly depend on the lag time. The state populations of MSM2, on the other hand, change considerably for small lag times before they level off to a constant value for τ & 1 ns. Since we use a lag time of 1 ns, we conclude that the 10 ns trajectories are long enough to build a (largely) unbiased MSM. This conclusion is supported by the finding that trajectories starting in the middle of the free energy landscape typically take less than 1 ns to reach one of the main conformational states. Another test of the quality of a Markov model is its reversibility, that is, for any two states i and j with equilibrium probability pi and pj the detailed balance condition pi Tij = pj Tji should hold. While there are ways to enforce detailed balance in the construction of

To analyze the dynamics described by the Markov models MSM1 and MSM2, we performed a 10 ms long Markov chain Monte Carlo run and monitored the transition pathways of each model. Figure 2 shows the resulting transition networks, nicely illustrating the conformational dynamics of Aib9 . Located in the x1 -x2 plane of the first two principal components, the nodes of the networks represent all metastable states of the MSM, with the node size indicating the state population. The edges of the network reflect possible transitions between the states, with the thickness of the line reflecting the transition rates. Comparing the networks, we readily see that MSM2 exhibits more states than MSM1 (102 vs. 42), which also give rise to the many more transitions of MSM2. To study the nature of these transitions, we consider the variety of L→R pathways found in the Markov chain Monte Carlo runs, yielding ∼ 29 000 and 45 000 direct pathways for MSM1 and MSM2, respectively. (Results for the R→L backward transitions are quite similar and therefore not discussed.) To describe the L→R transition pathways, we divide the states up in “main states” and “middle states”. The main states comprise the two ground states L = (lllll) and R = (rrrrr) as well as the eight well-populated states along the outer ring of the free energy landscape, see Figure 2. The middle states include all remaining states from the inner region of the energy landscape. To characterize various types of L→R transition pathways, Figure 4 shows the probability Pn of pathways,

ACS Paragon Plus Environment 5

The Journal of Physical Chemistry which use at least one main intermediate state and exactly n middle states. As expected from visual inspection of the MSM1 network, a significant part (13 %) of the L→R transitions proceed solely via the main states. The majority of pathways (60 %) include one or two middle states besides the main states, and about 10 % proceed only via middle states. (For a detailed description, see Table S3.) On the other hand, due to the improved sampling of the middle states in the Metadynamics enhanced model, we obtain a qualitatively different picture for MSM2 that clearly favors pathways including middle states. That is, only 1 % of all transitions use exclusively the main states, while more than 66 % use more than three middle states. Moreover, MSM2 exhibits quite “long” pathways visiting more than ten middle states, which are absent in MSM1. 20 Pn

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

10

0 0

5

10

15

20

25

n

Figure 4: Characterization of different types of L→R transition pathways obtained for various MSMs, where Pn denotes the probability (in %) of pathways which use at least one main intermediate state and exactly n middle states. Compared are MSM1 (red), MSM2 (green), MSM3 (blue) and MSM4 (grey). We note in passing that the above transition pathway analysis obtained from the long Markov chain Monte Carlo run differs from results from transition path theory 17 which calculates the forward and backward fluxes of the system. In the latter approach, for example, a pathway along the main states that visits some middle state and returns to the same main state would be counted as ”main state only”, because there is no net flux to the middle state. As a consequence, the percentage of main-state-only pathways increases significantly to 30.9 and 5.0 % for MSM1 and MSM2, respectively. It is interesting to study to what extent the different pathways exhibited by MSM1 and MSM2 result in different timescales of the two models. Because the trantrans sition times τij = τlag /Tij of a MSM only apply to states j that can be reached from state i during the lag time, in the following we adopt the concept of waiting wait 42 times τij , i.e., the time spend in state i and possible subsequent intermediate states before going to state j. Figure 5 shows the waiting time distribution of various transitions. As expected for converged realizations of a MSM, the distributions decay exponentially (which is in variance to the nonconverged distributions obtained directly from the original MD simulations, see Figure S4). Considering the L↔R transition, MSM1 and MSM2

Page 6 of 9

10 1 0.1 0.01

lllll↔rrrrr

rrlll↔llrrr

10 1 0.1 0.01

rllll↔lrrrr

rrlll↔rrrll

0.2 0.6

1

1.4 0.2 τwait(µs)

0.6

1

1.4

Figure 5: Distribution of waiting times of representative transitions obtained for MSM1 (red) and MSM2 (green). Solid and dotted lines indicate forward and backward transitions, respectively.

yield quite similar results with mean waiting times of 128 and 133 ns for the forward direction and 56 and 67 ns for the backward transition. (See Table S2 for a comprehensive list of mean waiting times.) The shorter backward waiting time is related to the different populations of L and R. Considering that supposedly different pathways are involved, the two models also give rather similar waiting times for the transitions rllll↔lrrrr and rrlll↔llrrr, which both reach across the free energy landscape. Somewhat surprisingly, though, we find significant deviations for transition rrlll→rrrll, which connects two neighboring states in the energy surface. The longer mean waiting times obtained for MSM2 (255 ns) compared to MSM1 (136 ns) indicate that the former model shows longer and more complicated pathways. Nonetheless, we find that in general waiting times appear to be quite robust observables that hardly depend on details of the network.

Convergence of MSM Employing 7732 × 10 ns ≈ 80µs unbiased MD simulations for a system with a slowest implied timescale of ∼ 40 ns, MSM2 is expected to represent a wellconverged model of the conformational dynamics of Aib9 . Using this model as a reference, we now study how the main features of the Metadynamics enhanced MSM change when we restrict ourselves to only 2000 trajectories (referred to as model MSM3). Moreover, we wish to study the effect of reducing the length of the short trajectories, by considering all 7732 trajectories but using only the first 4 ns (referred to as model MSM4). As a first test of MSM3, Figure S5 shows implied timescales and equilibrium populations of selected states as a function of the lag time. Both quantities are quite similar to the results for MSM2 shown in Figure 3, which again suggests that a lag time of τ = 1 ns should be appropriate. Moreover, we find that the populations of the main states (Table 1), the characterization of transition pathways (Figure 4), as well as

ACS Paragon Plus Environment 6

Page 7 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry the mean waiting times (Table S2) are within, ∼10 % equivalent to the results of MSM2. That is, we obtain qualitatively correct results from Metadynamics enhanced MSM3, while the results for MSM1 (which requires about the same aggregated MD sampling time of ∼ 20 µs) can deviate significantly, in particular for state populations and transition pathways. As a further advantage, MSM3 is constructed from short 10 ns trajectories that can be run in parallel, while MSM1 is based on 2 µs long trajectories. Let us finally study the performance of the MSM with respect to the length of the short trajectories. To facilitate massive parallel computing, one would like to keep the propagation time as short as possible. If the length is too short, on the other hand, the short trajectories never reach equilibrium and may therefore considerably bias a MSM towards their initial conformations. 41 To study this behavior, Figure S6 shows the equilibrium populations of selected states as a function of the length of the short trajectories. The results suggest that we need a trajectory length of at least 4 ns to obtain qualitatively converged results. Hence we construct a Markov model (MSM4) with a propagation time of 4 ns, which once more allows for a 1 ns lag time (Figure S6). Comparison of state populations (Table 1), transition pathways (Figure 4) and mean waiting times (Table S2) again gives a typical accuracy of 10 % with respect to the reference model MSM2. In summary, we conclude that short trajectories of & 4 ns length with an aggregated sampling time similar to the long trajectories are sufficient to build a correct MSM for Aib9 .

Conclusions The availability of ever longer MD simulations has revealed that a statistically converged description of biomolecular dynamics requires exhaustive sampling even for small systems. 43 In the case of the peptide helix Aib9 with slowest implied timescale of 40 ns, we have found that 16 µs (400×40 ns) unbiased MD simulations cannot provide correct equilibrium populations (up to 40 % error, see Table 1) or qualitive features of the pathway distribution (Figure 4). This failure has been shown to be a consequence of the numerous lowly populated intermediate states of the system, which are notoriously undersampled in unbiased MD. Interestingly, we have also found that some global properties (such as waiting times) are surprising robust regarding undersampling. Aiming at proteins with functional motions on microsecond timescales, we may nonetheless expect that statisically converged unbiased simulations require millisecond lengths and beyond. 44 As these timescales are still out of reach for standard simulations, enhanced sampling techniques are required. Among the numerous exisiting schemes, the strategy of combining massive parallel computing of short MD trajectories with Markov state modeling seems particularly promising. In order to ensure that the ensemble of short

trajectories covers the entire conformational space of interest, it is important to choose well-distributed initial structures of the MD runs. To this end, we have proposed to employ Metadynamics MD which quickly sweeps through the free energy landscape. Already a crude approximation of the conformation density is sufficient, as long as the simulation roughly covers the energetically accessible energy lansdcape of the system. Hence most technical issues of Metadynamics, including e.g., the choice of biasing coordinates, are relaxed. Moreover, the numerical effort of the Metadynamics run is clearly negligible compared to the time needed to run the ensemble of short trajectories. In this work we have focused on methodological aspects of the approach, including Metadynamics settings, minimal number and length of short MD trajectories, tests of the resulting MSM, and characterization of transition pathways. Future work will address the application to biomolecular systems and processes of current interest, which involve numerous intermediate states. In particular, we expect Metadynamics enhanced Markov modeling to be well suited to describe the conformational distribution and the kinetics of downhill folding proteins, 45 intrinsically disordered systems, 46 and proteins mediating allosteric communication. 47

Supplementary Material Metadynamics details, test of various collective coordinates (Figure S1), description of microstates (Table S1, Figure S2), test of detailed balance (Figure S3), waiting time distribution (Figure S4, Table S2), implied time scales and state populations (Figures S5, S6) and characterization of pathways (Table S3).

Acknowledgment We thank Steffen Wolf, Sebastian Buchenberg and Florian Sittel for many instructive and helpful discussions; Steffen in particular concerning the force field representation of chirality, Sebastian also for sharing the MD trajectories. This work has been supported by the the Deutsche Forschungsgemeinschaft via Grant STO 247/11.

References (1) Shaw, D. E.; Deneroff, M. M.; Dror, R. O.; Kuskin, J. S.; Larson, R. H.; Salmon, J. K.; Young, C.; Batson, B.; Bowers, K. J.; Chao, J. C. et al. Anton, a Special-purpose Machine for Molecular Dynamics Simulation. SIGARCH Comput. Archit. News 2007, 35, 1–12. (2) Shirts, M.; Pande, V. S. Screen Savers of the World Unite! Science 2000, 290, 1903–1904. (3) Chipot, C.; Pohorille, A. Free Energy Calculations; Springer: Berlin, 2007. (4) K¨ astner, J. Umbrella Sampling. Wiley Interdisciplinary Reviews: Computational Molecular Science 2011, 1, 932–942. (5) Isralewitz, B.; Gao, M.; Schulten, K. Steered Molecular Dynamics and Mechanical Functions of Proteins. Current Opinion in Structural Biology 2001, 11, 224–230. (6) Schlitter, J.; Engels, M.; Kr¨ uger, P. Targeted Molecular Dynamics - A New Approach for Searching Pathways of Conformational Transitions. J. Mol Graph. 1994, 12, 84–89.

ACS Paragon Plus Environment 7

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(7) Sugita, Y.; Okamoto, Y. Replica-Exchange Molecular Dynamics Method for Protein Folding. Chem. Phys. Lett. 1999, 314, 141. (8) Maragliano, L.; Vanden-Eijnden, E. A Temperature accelerated Method for sampling Free Energy and determining Reaction Pathways in rare Events Simulations. Chem. Phys. Lett. 2006, 426, 168–175. (9) Kim, J.; Straub, J. E.; Keyes, T. Statistical-Temperature Monte Carlo and Molecular Dynamics Algorithms. Phys. Rev. Lett. 2006, 97, 050601. (10) Laio, A.; Parrinello, M. Escaping Free-Energy Minima. Proc. Natl. Acad. Sci. USA 2002, 99, 12562 – 12566. (11) Lange, O. F.; Grubm¨ uller, H. Collective Langevin Dynamics of conformational Motions in Proteins. J. Chem. Phys. 2006, 124, 214903. (12) Best, R. B.; Hummer, G. Diffusive Model of Protein Folding Dynamics with Kramers Turnover in Rate. Phys. Rev. Lett. 2006, 96, 228104. (13) Micheletti, C.; Bussi, G.; Laio, A. Optimal Langevin Modeling of Out-Of-Equilibrium Molecular Dynamics Simulations. J. Chem. Phys. 2008, 129, 074105. (14) Hegger, R.; Stock, G. Multidimensional Langevin Modeling of biomolecular Dynamics. J. Chem. Phys. 2009, 130, 034106. (15) Chodera, J. D.; Swope, W. C.; Pitera, J. W.; Dill, K. A. Obtaining Long-Time Protein Folding Dynamics from short-Time Molecular Dynamics Simulations. Multiscale Modeling & Simulation 2006, 5, 1214–1226. (16) Singhal, N.; Snow, C. D.; Pande, V. S. Using Path Sampling to build better Markovian State Models: Predicting the folding Rate and Mechanism of a tryptophan Zipper Beta Hairpin. J. Chem. Phys. 2004, 121, 415–425. (17) Noe, F.; Sch¨ utte, C.; Vanden-Eijnden, E.; Reich, L.; Weikl, T. Constructing the Full Ensemble of Folding Pathways from Short Off-Equilibrium Simulations. Proc. Natl. Acad. Sci. USA 2009, 106, 19011–19016. (18) Prinz, J.-H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J. D.; Sch¨ utte, C.; Noe, F. Markov Models of Molecular Kinetics: Generation and Validation. J. Chem. Phys. 2011, 134, 174105. (19) Bowman, G. R.; Pande, V. S.; Noe, F. An Introduction to Markov State Models; Springer: Heidelberg, 2013. (20) Rzepiela, A. J.; Schaudinnus, N.; Buchenberg, S.; Hegger, R.; Stock, G. Communication: Microsecond Peptide Dynamics from Nanosecond Trajectories: A Langevin Approach. J. Chem. Phys. 2014, 141, 241102. (21) Bowman, G. R.; Ensign, D. L.; Pande, V. S. Enhanced Modeling via Network Theory: Adaptive Sampling of Markov State Models. J. Chem. Theory Comput. 2010, 6, 787–794. (22) Huang, X.; Bowman, G. R.; Bacallado, S.; Pande, V. S. Rapid Equilibrium Sampling initiated from Nonequilibrium Data. Proc. Natl. Acad. Sci. USA 2009, 106, 19765–19769. (23) Bonomi, M.; Branduardi, D.; Gervasio, F. L.; Parrinello, M. The unfolded Ensemble and Folding Mechanism of the Cterminal GB1 Beta-Hairpin. J. Am. Chem. Soc. 2008, 130, 13938–13944. (24) Barducci, A.; Bonomi, M.; Parrinello, M. Metadynamics. Comput. Mol. Sci. 2011, 1, 826–843. (25) Granata, D.; Camilloni, C.; Vendruscolo, M.; Laio, A. Characterization of the Free-Energy Landscapes of Proteins by NMRguided Metadynamics. Proc. Natl. Acad. Sci. USA 2013, 110, 6817–6822. (26) Tiwary, P.; Parrinello, M. From Metadynamics to Dynamics. Phys. Rev. Lett. 2013, 111, 230602. (27) Marinelli, F.; Pietrucci, F.; Laio, A.; Piana, S. A kinetic Model of Trp-Cage Folding from multiple biased Molecular Dynamics Simulations. PLoS Comput. Biol. 2009, 5, e1000452. (28) Juraszek, J.; Saladino, G.; van Erp, T. S.; Gervasio, F. L. Efficient numerical Reconstruction of Protein Folding Kinetics with partial Path Sampling and Path-like Variables. Phys. Rev. Lett. 2013, 110, 108106. (29) Buchenberg, S.; Schaudinnus, N.; Stock, G. Hierarchical biomolecular Dynamics: Picosecond Hydrogen Bonding regulates Microsecond conformational Transitions. J. Chem. Theo. Comp. 2015, 11, 1330–1336. (30) Sittel, F.; Filk, T.; Stock, G. Principal Component Analysis on a Torus: Theory and Application to Protein Dynamics. J. Chem. Phys. 2017, 147, 244101 (31) Altis, A.; Otten, M.; Nguyen, P. H.; Hegger, R.; Stock, G. Construction of the Free Energy Landscape of Biomolecules via Dihedral Angle Principal Component Analysis. J. Chem. Phys. 2008, 128, 245102. (32) Sittel, F.; Stock, G. Robust Density-Based Clustering to Identify Metastable Conformational States of Proteins. J. Chem. Theo. Comp. 2016, 12, 2426–2435.

(33) Scherer, M. K.; Trendelkamp-Schroer, B.; Paul, F.; PrezHernndez, G.; Hoffmann, M.; Plattner, N.; Wehmeyer, C.; Prinz, J.-H.; Noe, F. PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. J. Chem. Theory Comput. 2015, 11, 5525–5542 (34) van der Spoel, D.; Lindahl, E.; Hess, B.; Groenhof, G.; Mark, A. E.; Berendsen, H. J. C. Gromacs; Fast, Flexible and Free. J. Comput. Chem. 2005, 26, 1701–1718. (35) van Gunsteren, W. F.; Billeter, S. R.; Eising, A. A.; H¨ unenberger, P. H.; Kr¨ uger, P.; Mark, A. E.; Scott, W. R. P.; Tironi, I. G. Biomolecular Simulation: The GROMOS96 Manual and User Guide; Vdf Hochschulverlag AG an der ETH Z¨ urich: Z¨ urich, 1996. (36) Tironi, I. G.; van Gunsteren, W. F. A molecular Dynamics Simulation Study of Chloroform. Mol. Phys. 1994, 83, 381. (37) Abraham, M. J.; Murtola, T.; Schulz, R.; Pll, S.; Smith, J. C.; Hess, B.; Lindahl, E. GROMACS: High Performance molecular Simulations through Multi-Level Parallelism from Laptops to Supercomputers. SoftwareX 2015, 12, 19 – 25. (38) Tribello, G. A.; Bonomi, M.; Branduardi, D.; Camilloni, C.; Bussi, G. PLUMED 2: New Feathers for an old Bird. Comp. Phys. Comm. 2014, 185, 604 – 613. (39) Laio, A.; Rodriguez-Fortea, A.; Gervasio, F. L.; Ceccarelli, M.; Parrinello, M. Assessing the Accuracy of Metadynamics. J. Phys. Chem. B 2005, 109, 6714–6721. (40) Botan, V.; Backus, E.; Pfister, R.; Moretto, A.; Crisma, M.; Toniolo, C.; Nguyen, P. H.; Stock, G.; Hamm, P. Energy Transport in Peptide Helices. Proc. Natl. Acad. Sci. USA 2007, 104, 12749–12754. (41) N¨ uske, F.; Wu, H.; Prinz, J.-H.; Wehmeyer, C.; Clementi, C.; No´ e, F. Markov State Models from short Non-Equilibrium Simulations—Analysis and Correction of Estimation Bias. J. Chem. Phys. 2017, 146, 094104. (42) Gernert, R.; Emary, C.; Klapp, S. Waiting Time Distribution for continous stochastic Systems. Phys. Rev. E 2014, 90, 062115. (43) Bowman, G. R. Accurately modeling Nanosecond Protein Dynamics requires at least Microseconds of Simulation. J. Comput. Chem. 2016, 37, 558–566. (44) Shaw, D. E.; Maragakis, P.; Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; Eastwood, M. P.; Bank, J. A.; Jumper, J. M.; Salmon, J. K.; Shan, Y. et al. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science 2010, 330, 341 – 346. (45) Sabelko, J.; Ervin, J.; Gruebele, M. Observation of strange Kinetics in Protein Folding. Proc. Natl. Acad. Sci. USA 1999, 96, 6031–6036. (46) Levine, Z. A.; Shea, J.-E. Simulations of disordered Proteins and Systems with conformational Heterogeneity. Curr. Opin. Struct. Biol. 2017, 43, 95 – 103, Theory and simulation Macromolecular assemblies. (47) Buchenberg, S.; Sittel, F.; Stock, G. Time-resolved Observation of Protein allosteric Communication. Proc. Natl. Acad. Sci. USA 2017, 114, E6804–E6811. (48) Bowman, G. R.; Beauchamp, K. A.; Boxer, G.; Pande, V. S. Progress and Challenges in the automated Construction of Markov State Models for full Protein Systems. J. Chem. Phys. 2009, 131, 124101.

ACS Paragon Plus Environment 8

Page 8 of 9

Page 9 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

TOC graphic PC2

L

MD-MSM

Meta-MSM

R PC1

ACS Paragon Plus Environment 9