Robust Density-Based Clustering To Identify ... - ACS Publications

Apr 8, 2016 - microstates separated by local free energy barriers. The algorithm is embedded into a complete workflow to robustly generate. Markov sta...
0 downloads 10 Views 4MB Size
Subscriber access provided by UOW Library

Article

Robust Density-Based Clustering to Identify Metastable Conformational States of Proteins Florian Sittel, and Gerhard Stock J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.5b01233 • Publication Date (Web): 08 Apr 2016 Downloaded from http://pubs.acs.org on April 11, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 14

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

which are not cut at, but rather include energy barriers. Using these states to calculate transition matrices, subsequent dynamic clustering cannot produce appropriate metastable states and often yields results that are very sensitive to details of the procedure, in particular in the case of low statistics. Moreover, as the k-means algorithm is not deterministic, results may be difficult to reproduce. As an alternative, geometric clustering techniques relying on densities of local neighborhoods have been developed. 18–21 This approach does not separate the underlying space into a Voronoi partitioning, but rather assigns densities to local points and separate clusters with respect to high- and low-density regions. However, existing methods are not necessarily well suited to analyze MD trajectories, since they may slightly distort the local neighborhood by removing neighbors, 19 require extensive amounts of memory 20 or do not provide a clear means to choose optimal input parameters for the clustering process. 21 In this work, we introduce a novel density-based clustering method that is closely related to the aforementioned techniques. The approach is deterministic, computationally efficient, and self-consistent in its parameter choice. The main idea is to assign a local free energy estimate to every frame of the trajectory from geometric coordinate space density at the given point. Using these free energy estimates, the frames are lumped into local free energy minima, ultimately forming microstates separated at local free energy barriers. The algorithm is embedded into a complete workflow to robustly generate Markov state models from MD trajectories, consisting of (i) a dimensionality reduction step, (ii) the proposed density-based clustering to generate microstates, and (iii) dynamical clustering to construct metastable states. Due to its selfconsistent nature, the density-based clustering approach works well with different methods for both, dimensionality reduction and dynamic clustering. Thus the workflow is highly modular and various established methods may be combined to produce Markov state models, depending on the demands of the underlying system. Moreover, the density-based geometric clustering provides also well-separated states and a good overview over the system’s dynamics in case of nonequilibrium dynamics, i.e., when dynamical clustering methods are not applicable. To test the proposed clustering workflow, we consider three well-established model problems: Conformational transitions of hepta-alanine (Ala7 ), folding of villin headpiece (HP-35) and functional dynamics of bovine pancreatic trypsin inhibitor (BPTI). For our test systems, we will employ dihedral angle-based PCA 26 for dimensionality reduction and the most probable path algorithm 25 for dynamic clustering.

2 2.1

Page 2 of 14

Theory and methods MD data

For the three considered systems, previously published all-atom MD trajectories are available. In the case of Ala7 , we adopted the 800 ns trajectory at 300 K provided by Altis et al., 27 which used the GROMOS 45A3 force field 28 and a sampling rate of 1 frame/ps. Long trajectories of HP-35 and BPTI were kindly provided by D.E. Shaw Research Group, using the Amber ff99SB*ILDN force field 29–31 on the special purpose Anton supercomputer. The trajectory of the 35 aa protein HP-35 was run at 360 K with a sampling rate of 1 frame/(200 ps) and showed 61 folding events within the simulation time of 300 µs. 32 The trajectory of the 58 aa protein BPTI was run at 300 K with a sampling rate of 1 frame/(250 ps). 33 We used only the first 800 µs for analysis, thus skipping an undersampled transition event at the end of the simulation. To reduce the number of dimensions taken into account for clustering, we performed a PCA on the MD trajectories. By diagonalizing the covariance matrix σij = h(ri − hri i) (rj − hrj i)i and projecting the original system coordinates r onto the resulting eigenvectors vi , we define the principal components, Vi = vi ·r, which are naturally ordered in descending variance of the system. Including only the first few PCs (depending on the system, about three to ten) in the model, we define a reduced coordinate space of drastically lower dimensionality, still describing most of the proteins dynamics. 7–9 When working with Cartesian coordinates, it is often not possible to properly separate overall and internal motion of the protein, which may lead to spurious results of the statistic and dynamics in PC space. 34 We therefore used sin/cos-transformed backbone dihedralangles as input coordinates of the PCA, 9 which –as internal coordinates– do not suffer from this problem. For all systems, only principal components showing structured distributions (i.e., non-random behavior) were included in the analysis, yielding five dimensions for Ala7 and ten dimensions for HP-35 and BPTI. Details of the dihedral-angle PCA of the three systems are found in Refs. 27,34,35 .

2.2

Density Measure

The local conformational space density is estimated by a simple distance criterion in N -dimensional space. That is, we define an N -dimensional hypersphere of radius R centered at a given point r′ in conformational space, and count how many frames of the trajectory lie inside this hypersphere. The distance d of a point r to the reference r′ is measured by the N -dimensional Euclidian norm d2 (r, r′ ) =

N X

n=1

ACS Paragon Plus Environment 2

2

(rn − rn′ ) .

(1)

Page 3 of 14

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

is RHP-35 = 0.5 and RBPTI = 0.6, respectively.

2.3

Lumping microstates by their free energy estimates

Having defined a free energy estimate for each MD frame, we do not yet know their connections in conformational space, which define the clusters of the trajectory. However, based on the free energy estimates, clusters may be identified by a simple procedure. As a first step, we will define a cutoff value at a relatively low free energy (e.g., F < 0.1 kB T ) and select all structures from the trajectory that suffice this criterion. All other structures will –for this step– be ignored. Second, all microstates with a squared geometric distance below a certain lumping threshold d2lump (described below) will be assigned to the same cluster. This assignment is transitive, i.e., if structure B is assigned to the same cluster as structure A, structure C will be assigned to A’s cluster, even if it is only close to B and not to A. Thus, a cluster S is defined recursively as the set S = {ri | d2 (ri , rj ) ≤ d2lump ∧ rj ∈ S ∪ S0 }

(4)

of structures ri with at least one close neighbor in the same set and an initial assignment S0 = {r0 }, where r0 is the frame of highest density (lowest free energy) which has not yet been assigned to any cluster. A set of structures will therefore form a single cluster, if for any structure in the set at least one other structure exists that is closer than the (geometric) lumping threshold. Two sets of structures are disjoint and form two distinct clusters, if none of the structures of the first set is closer than the lumping threshold to any of the structures of the second set. In a second step the free energy cutoff is increased in a step-wise manner, and at every step all structures with lower energy than the current energy cutoff are lumped into clusters according to the procedure described above. Absorbing more structures, clusters will grow (in population and geometric expansion) and at some point be closer to each other than the geometric lumping threshold, resulting in a new state consisting of the previously disjoint clusters. As an illustration of this lumping scheme, Fig. 2 shows the free energy landscape of a simple three-state model together with clusters resulting from the step-wise increase of the energy cutoff. The procedure is seen to generate a hierarchical network of cluster nodes, that are lumped at various levels of the free energy. In this way, the network provides a concise illustration of the geometric differences and relations of the microstates comprising the trajectory. Filling the free energy landscape from its basins to the barriers to identify transition regions, the procedure bares similarities to advanced sampling methods like conformational flooding 37 and metadynamics, 38 or to dynamic coarse graining methods like hierarchical Nystr¨om clustering. 39 To obtain a criterion for a suitable lumping thresh-

Page 4 of 14

old dlump , we assume that the assignment of individual structures to clusters needs to fulfill two plausible assumptions: First, for high-density structures a structure’s nearest neighbor should always fall in the same cluster. Second, clusters with closest distance much bigger than the ’typical’ nearest neighbor distances should always be separated. Under the assumption that the average nearest neighbor distance hdNN i is much smaller than the average pairwise distance of all frames hdi = h{d (ri , rj ) | i 6= j}i, we want to find a neighbor of a given structure with high probability (e.g., ≈ 0.95), which corresponds to a 2σ deviation of a Gaussian distribution for nearest neighbor distances. For computational efficiency, we express the lumping criterion in squared distances, leading to a criterion of finding the structures ri and rj in the same cluster as 2

d2 (ri , rj ) < (hdNN i + 2σdNN ) .

(5)

Given the decomposition 2

(hdNN i + 2σdNN )  

2 2 = hdNN i + 4 d2NN − hdNN i + 4 hdNN i σdNN

= 4 d2NN − hdNN i (3 hdNN i − 4σdNN ) ,

(6)



we identify d2lump := 4 d2NN as an upper bound for expression (5), since the right-hand side term (3 hdNN i − 4σdNN ) is in general positive. This is because of highly populated cluster centers with relatively evenly distributed frames and therefore small variances σdNN in nearest-neighbor distances. This definition for the lumping criterion, of course, is defined in an ad-hoc manner and based on the assumption of local Gaussian distribution of neighbor structures. Nonetheless, it conveniently adapts to the given conformational space, rendering it unnecessary to introduce another user-defined parameter. Also, tests using various data sets have shown this definition to be, above all, stable in regards of variations (e.g.,

quite d2lump := 3 d2NN ), leading to comparable results. Furthermore, all empirical tests have shown that the lumping criterion readily fulfills the condition (hdNN i + 2σdNN )

2

< d2lump

≪ {d2 (ri , rj ) | i 6= j} ,

(7)

comparing the average nearest neighbor distance to the average of pairwise distances between all microstates. The tree structure (Fig. 2) resulting from this densitybased geometric clustering procedure is a concise representation of the underlying free energy landscape in reduced dimensions. Again, we emphasize that all parameters used to define the tree, are estimated from the data itself and are thus properties of the system. In this way, we have defined a black-box-like procedure which encodes cluster populations at different levels of the free energy, geometric distances and neighborhood relations

ACS Paragon Plus Environment 4

Page 5 of 14

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 6 of 14

Page 7 of 14

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3 3.1

Page 8 of 14

Results The free energy landscape of Ala7 is well described by density-based clustering

As a first test of the density-based clustering approach, we adopt an existing 800 ns trajectory of Ala7 which has been discussed as a standard example of conformational dynamics of a small peptide. 3,5,25,27 As described in Sec. 2.1, the MD data are first preprocessed by performing a PCA on the backbone dihedral angles of the five inner states. In the case of Ala7 the conformation of the inner residues is well described by a right-handed helix (α) and an extended state (β). Hence a “primitive clustering” of the overall conformation is given by a product state of the states of the five inner residues, e.g., ααααα for the all-α conformation. 27 Using the optimal radius R = 0.2 determined in Sect. 2.2, the free energy landscape was screened at discrete steps of 0.1 kT (Fig. 6). Interestingly, the tree structure reveals 32 microstates which exactly match the expected 25 = 32 states from the α/β-permutations of the five inner residues. Analyzing the microstates by their dihedral angle content (Fig. S1) shows that this assumption is indeed justified.

Figure 7: (Top) Markov state model of HP-35. Blue states are in the folded basin, yellow are in the intermediate basin and purple states are in the unfolded basin. The native and prime intermediate states are annotated by their lifetime. The node area indicates the population ratio of the states. The colored bars show the cumulative transition times between folded, intermediate and unfolded basins. (Bottom) Structure overlays representing for the highest populated states in the folded (state 2), intermediate(state 1) and unfolded (state 3) basins, respectively.

work of states (Fig. S2), the cored state trajectories were subjected to a Chapman-Kolmogorov test [Eq. (8)], which revealed Markovian behavior for lag times τ & 1 ps (Fig. S3). We note that –even for this simple case– previously performed k-means clustering required subsequent MPP dynamic clustering, in order to obtain the correct 32 clusters of the system. 27

3.2

Figure 6: Density network for Ala7 with free energy levels from 0.0 kT to 7.0 kT in steps of 0.1 kT . Microstates are annotated by their respective α/β product states. Since the resulting 32 states of Ala7 are well defined and can be shown to exhibit sufficient metastability, further lumping of states by dynamical clustering is not necessary. Hence boundary corrections using variable dynamic coring can be performed directly on the microstates resulting from density-based geometric clustering. To test the Markovianity of the resulting net-

The free energy landscape of the folding of HP-35 requires dynamical clustering

As a second example, we chose a 300 µs trajectory of the fast-folding HP-35 protein described in Sec. 2.1, for which a detailed metastable state analysis based on a combined k-means/MPP procedure was recently reported. 35 Applying the above proposed workflow, we performed dihedral angle PCA and density-based clustering, yielding a tree representation of the free energy with a broad network with 543 local minima (Fig. 8, cut at 8 kT for display purposes). Close to the folding temperature of 360 K, the vast majority of microstates is seen to be comprised in the entropic unfolded basin of HP-35. In contrast to the small peptide Ala7 , the considerable conformational diversity of HP-35 makes it necessary to refine the geometrically clustered model by dynamical clustering. Hence MPP dynamical clustering (with a lag time τ = 1 ns and minimal metastability Qmin = 0.85) and subsequent coring (using a state-dependent mini-

ACS Paragon Plus Environment 8

Page 9 of 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

mal time 2 ns ≤ tmin ≤ 20 ns) was performed. This results in twelve conformational states with significant metastability, which satisfy the Chapman-Kolmogorov criterion of Eq. (8) for lag times & 160 ns (Fig. S4). The metastable states are nicely characterized via their dihedral angle color plots in Fig. 5. Figure 7 shows an illustration of the Markov state model of HP-35, comprising the metastable states including their overall characterization as ’native’, ’intermediate’ and ’unfolding’ as well as their transition pathways. The state numbering is ordered by decreasing population. Interestingly, the highest populated state is not the folded state with three stable α-helices (state 2), but an ’intermediate’ state with stable α-helices but a slightly tilted N -terminus (state 1). The overall geometrical similarity and the difference in the N -terminus of these two state, which comprise roughly 42% of the total population, are well observed in the Ramacolor-plot shown in Fig. 5. By visual inspection of the filtered structures of both states we note a significant destabilization of the turns connecting the α-helices, leading to a more dynamic behavior of state 1, while showing high stability of state 2. Thus, the tilt in the N -terminal region may be interpreted as a sign not only of local, but also global destabilization, leading to the classification of state 1 as an intermediate state. State 8 (3.2% of total population), on the other hand, was classified as a second folded state due to its very stable turns, even though it shows a partially destabilized helix 3 (residues 23-32). A third folded structure is shown by state 12 (0.38%), with all three α-helices stabilized and very stable turns. In contrast to state 2, however, helix 1 (residues 4–10) is slightly tilted to a more parallel position towards helix 2 (residues 15–19). This tilt is also depicted in the Ramacolor diagram for residues 10 and 11 of state 12. State 4 (7.43%) also exhibits three stable α-helices, but shows strong dynamic behavior due to unstable turns. In contrast to this, states 10 (1.20%) and 11 (0.42%) show rather stable turns and helices 2 and 3, yet have a completely destabilized helix 1. The difference of state 11 to state 10 as a tilt in the N -terminus, clearly observable in the Ramacolor diagram at residues 2 and 3. Also still considered as ’intermediate’, state 9 (1.38%) is highly similar to state 1, yet shows more flexible turns and partial destabilization of the α-helices. The large entropic basin of HP-35 can be divided in four different states, that show large dynamical fluctuations, yet have remarkable structural attributes. The largest in population is state 3 (12.36%), which has partially stable helices 1 and 3 providing an elongated structure by lying parallel to each other while being connected by an unfolded helix 2. In contrast to state 3, while having a totally unfolded helix 1, state 5 (6.03%) shows a completely stable helix 3. States 6 (5.13%) and 7 (4.58%) may be interpreted as even more unstable with helix 3 being largely unfolded. Interestingly, while the relative orientation of secondary structure elements

is not directly observable from the Ramacolor plot in Fig. 5, distinctive structural elements like (partially) folded α-helices or tilts in turn-regions can nonetheless directly be observed. This already gives much information about the system even before deeper inspection of the single molecular structures belonging to a metastable state. Following the most frequently taken pathways along the Markov state model, it is obvious that states 1 and 4 act as hubs between the folded and unfolded regions. The most frequent folding pathway is the stabilization of helix 3, followed by the stabilization of helix 1 and 2, while retaining a tilted N -terminus. A further change in residue 3 from a α- to a αL -state goes hand in hand with the stabilization of the turns, resulting in the stable folded state 2. Overall, the result is a significant improvement to previous clustering of the HP-35 trajectory using k-means and the most probable path algorithm, which could identify only six (instead of twelve) metastable states showing Markovian behavior, 35 which however reduce to roughly the same three basins.

3.3

Dynamical clustering improves description of the functional dynamics of BPTI

As a final example, we have chosen BPTI as a wellestablished model of small-amplitude functional dynamics. As revealed by an analysis of the Ramachandran plots of all residues, BPTI shows significant structural changes in only 24 out of 56 residues. Due to this rigid nature of BPTI and the good statistic provided by the 1 ms trajectory of Shaw, 33 “primitive clustering” of the 24 dynamical residues can be performed. To this end, the Ramachandran plots of the selected residues are clustered by discrete binning for microstate generation followed by the application of the most probable path algorithm. 25 In this way, a small set of clusters per residue is identified, which are combined into product states to label the global state. We note that this clustering method based on per-residue product states is not feasible for more flexible or larger systems due to the exponentially growing number of substate permutations. Taking the primitive clustering results as reference, we performed density-based clustering to construct a tree representation of the free energy landscape (Fig. S5), which yields 78 microstates. Subsequent MPP dynamical clustering (with a lag time τ = 50 ns and minimal metastability Qmin = 0.79) and dynamical coring (using a state-dependent minimal time 50 ns ≤ tmin ≤ 90 ns) finally yields 8 metastable states, which constitute a Markov state model shown in Fig. S6. As discussed previously, 34 the metastable states of BPTI differ mostly in the conformations of the first and third flexible loop, while the two β-sheets and the α-helix remain relatively stable. While it is well-known 24 that dynamical methods may

ACS Paragon Plus Environment 9

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 14

Figure 8: Density network for HP-35 with free energy levels from 0.0 kT to 8.0 kT in steps of 0.1 kT . The full free energy landscape tree reaches 12.2 kT , and is cut to provide a concise picture. On the left, the metastable states of the folded and intermediate basins are clearly distinguishable from the large number of comparatively unstable microstates of the unfolded basin to the right.

vastly improve the clustering results of MD trajectories in means of producing Markov state models, one may naively assume that a geometric clustering algorithm which properly separates free energy barriers may – at a higher level – also provide good macrostates. To show that this is not necessarily the case and to emphasize the importance of the dynamical clustering as well as the proper choice of dimensionality reduction method, we also constructed comparable models with 8 metastable states in two other ways. To compare to density-based and primitive clustering, we also applied the MPP method on the primitive clustering results. To consider the effect of dynamical clustering in comparison to a geometric-clustering-only procedure, we furthermore partitioned the density-based free energy landscape into its eight main branches, that represent the states that are separated by the highest energy barriers. Subsequently all three sets of metastable states were refined by dynamic coring to correct for transition ambiguities at barrier regions. Figure S7 shows the Ramacolor plots obtained for the three models. While results from geometric-clustering-only differ significantly, results from density-based and primitive clustering are found to yield very similar results. The two methods differ only in state 4 which is a derivative of the native state with the difference of showing a left-handed α-conformation in residue 3 instead of a pronounced (right-handed) α-helix in residues 3 - 6. However, further analysis shows that the local but metastable change in residue 3 (which is more or less uncorrelated to the rest of the protein) is not resolved by the dihedral-angle PCA in any of the ten principle components used as reduced space description. Missing this state has thus to be attributed not to the density-based geometric clustering, but rather to the preparing dimensionality reduction step. To demonstrate the effects of the various clustering methods on the resulting Markov state model, Fig. 9a shows the corresponding metastabilities ob-

tained for a lag time τ = 50 ns). While the results from density-based and primitive clustering are again quite similar (maybe except for state 4, see above), the metastabilities from geometric-clustering-only are significantly lower. Calculating the population difference ||P (0)(T (nτ ) − T n (τ ))|| for n = 8 as defined by the Chapman-Kolmogorov equation (8), Fig. 9 reveals clearly that for BPTI even a high-quality geometric clustering method alone is not able to yield a suitable Markov state model.

4

Conclusion

We have outlined a robust, efficient and complete workflow to derive Markov state models from MD trajectories. At the heart of this workflow is a new density-based geometrical clustering algorithm that constructs microstates, which are well-separated at local free energy barriers, in a self-consistent manner. To characterize the resulting state-resolved conformational distribution, dihedral angle content color plots have been introduced which identify structural differences of protein states in a concise, yet expressive way. By combining the algorithm with existing dynamical clustering and coring methods, the potential of the approach has been shown by successfully constructing Markov state models for various representative model problems of protein conformational dynamics. The presented workflow is implemented in a freely available, highly optimized C++ application, downloadable at http://www.moldyn.unifreiburg.de/software/software.html. Acknowledgement We thank D. E. Shaw Research for sharing their trajectories of HP-35 and BPTI and Matthias Ernst, Sebastian Buchenberg and Abhinav Jain for numerous instructive and helpful discussions. Supporting Information Available:

ACS Paragon Plus Environment 10

State de-

Page 11 of 14

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(16) Prinz, J.-H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J. D.; Sch¨ utte, C.; Noe, F. Markov models of molecular kinetics: generation and validation. J. Chem. Phys. 2011, 134, 174105. (17) Shukla, D.; Hern´ andez, C. X.; Weber, J. K.; Pande, V. S. Markov state models provide insights into dynamic modulation of protein function. Acc. Chem. Res. 2015, 48, 414 – 422. (18) Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Simoudis, E., Han, J., Fayyad, U. M., Eds.; AAAI Press, 1996; pp 226–231.

Page 12 of 14

(28) Schuler, L. D.; Daura, X.; van Gunsteren, W. F. An improved GROMOS96 force field for aliphatic hydrocarbons in the condensed phase. J. Comput. Chem. 2001, 22, 1205–1218. (29) Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; Simmerling, C. Comparison of multiple Amber force fields and development of improved protein backbone parameters. Proteins 2006, 65, 712–725. (30) Best, R. B.; Hummer, G. Optimized Molecular Dynamics Force Fields Applied to the Helix-Coil Transition of Polypeptides. J. Phys. Chem. B 2009, 113, 9004–9015.

(19) Keller, B.; Daura, X.; van Gunsteren, W. F. Comparing geometric and kinetic cluster algorithms for molecular simulation data. J. Chem. Phys. 2010, 132, 074110.

(31) Lindorff-Larsen, K.; Piana, S.; Palmo, K.; Maragakis, P.; Klepeis, J. L.; Dror, R. O.; Shaw, D. E. Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins 2010, 78, 1950 – 1958.

(20) Campos, S. R.; Baptista, A. M. Conformational analysis in a multidimensional energy landscape: Study of an arginylglutamate repeat. J. Phys. Chem. B 2009, 113, 15989–16001.

(32) Piana, S.; Lindorff-Larsen, K.; Shaw, D. E. Protein folding kinetics and thermodynamics from atomistic simulation. Proc. Natl. Acad. Sci. USA 2012, 109, 17845–17850.

(21) Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496.

(33) Shaw, D. E.; Maragakis, P.; Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; Eastwood, M. P.; Bank, J. A.; Jumper, J. M.; Salmon, J. K.; Shan, Y.; Wriggers, W. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science 2010, 330, 341 – 346.

(22) Hartigan, J. A.; Wong, M. A. A K-means clustering algorithm. Applied Statistics 1979, 28, 100– 108. (23) R¨oblitz, S.; Weber, M. Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification. Adv. Data Analysis and Classification 2013, 7, 147–179. (24) Bowman, G. R.; Meng, L.; Huang, X. Quantitative comparison of alternative methods for coarsegraining biological networks. J. Chem. Phys. 2013, 139, 121905. (25) Jain, A.; Stock, G. Identifying metastable states of folding proteins. J. Chem. Theo. Comp. 2012, 8, 3810 – 3819. (26) Mu, Y.; Nguyen, P. H.; Stock, G. Energy Landscape of a Small Peptide Revealed by Dihedral Angle Principal Component Analysis. Proteins 2005, 58, 45 – 52. (27) Altis, A.; Otten, M.; Nguyen, P. H.; Hegger, R.; Stock, G. Construction of the free energy landscape of biomolecules via dihedral angle principal component analysis. J. Chem. Phys. 2008, 128, 245102.

(34) Sittel, F.; Jain, A.; Stock, G. Principal component analysis of molecular dynamics: On the use of Cartesian vs. internal coordinates. J. Chem. Phys. 2014, 141, 014111. (35) Jain, A.; Stock, G. Hierarchical folding free energy landscape of HP35 revealed by most probable path clustering. J. Phys. Chem. B 2014, 118, 7750 – 7760. (36) Epanechnikov, V. A. Non-Parametric Estimation of a Multivariate Probability Density. Theory of Probability & Its Applications 1969, 14, 153–158. (37) Grubm¨ uller, H. Predicting slow structural transitions in macromolecular systems: Conformational flooding. Phys. Rev. E 1995, 52, 2893–2906. (38) Laio, A.; Parrinello, M. Escaping free-energy minima. Proc. Natl. Acad. Sci. USA 2002, 99, 12562 – 12566. (39) Yao, Y.; Cui, R. Z.; Bowman, G. R.; Silva, D.-A.; Sun, J.; Huang, X. Hierarchical Nystr¨om methods for constructing Markov state models for conformational dynamics. The Journal of Chemical Physics 2013, 138 .

ACS Paragon Plus Environment 12

Page 13 of 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(40) Faradjian, A. K.; Elber, R. Computing time scales from reaction coordinates by milestoning. J. Chem. Phys. 2004, 120, 10880–10889. (41) Kabsch, W.; Sander, C. Dictionary of protein secondary structure: Pattern recognition of hydrogen bonded and geometrical features. Biopolymers 1983, 22, 2577–2637. (42) Frishman, D.; Argos, P. Knowledge-based protein secondary structure assignment. Proteins 1995, 23, 566–79. (43) Humphrey, W.; Dalke, A.; Schulten, K. VMD – Visual Molecular Dynamics. Journal of Molecular Graphics 1996, 14, 33–38.

ACS Paragon Plus Environment 13

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 14 of 14