A minimum variance clustering approach produces ... - ACS Publications

Nov 27, 2017 - agglomerative clustering with Ward's minimum variance objective function, and the similarity of the microstate dynamics are determined ...
0 downloads 0 Views 1MB Size
Article Cite This: J. Chem. Theory Comput. 2018, 14, 1071−1082

pubs.acs.org/JCTC

A Minimum Variance Clustering Approach Produces Robust and Interpretable Coarse-Grained Models Brooke E. Husic, Keri A. McKiernan, Hannah K. Wayment-Steele, Mohammad M. Sultan, and Vijay S. Pande* Department of Chemistry, Stanford University, Stanford, California 94305, United States ABSTRACT: Markov state models (MSMs) are a powerful framework for the analysis of molecular dynamics data sets, such as protein folding simulations, because of their straightforward construction and statistical rigor. The coarsegraining of MSMs into an interpretable number of macrostates is a crucial step for connecting theoretical results with experimental observables. Here we present the minimum variance clustering approach (MVCA) for the coarse-graining of MSMs into macrostate models. The method utilizes agglomerative clustering with Ward’s minimum variance objective function, and the similarity of the microstate dynamics is determined using the Jensen−Shannon divergence between the corresponding rows in the MSM transition probability matrix. We first show that MVCA produces intuitive results for a simple tripeptide system and is robust toward long-duration statistical artifacts. MVCA is then applied to two protein folding simulations of the same protein in different force fields to demonstrate that a different number of macrostates is appropriate for each model, revealing a misfolded state present in only one of the simulations. Finally, we show that the same method can be used to analyze a data set containing many MSMs from simulations in different force fields by aggregating them into groups and quantifying their dynamical similarity in the context of force field parameter choices. The minimum variance clustering approach with the Jensen−Shannon divergence provides a powerful tool to group dynamics by similarity, both among model states and among dynamical models themselves.

1. INTRODUCTION Molecular dynamics (MD) simulations provide the temporal and spatial resolution to investigate biomolecular dynamics in silico.1 Specialized hardware2 and distributed computing networks3,4 have enabled simulations at biologically relevant timescales up to tens of milliseconds. These data sets are so massive that they require complementary analysis procedures in order to process and understand the simulation results. Markov state models (MSMs)5,6 represent one framework that has successfully harnessed huge MD data sets to describe the dynamics of a simulated system. MSMs are a master equation approach to conformational dynamics that partition the phase space explored by a system into disjoint states. The states are assumed to be Markovian, or “memoryless”, where the probability of transitioning between two states is independent of the previously occupied states. The transition matrix of an MSM, often defined for a discrete Markovian time step, contains thermodynamic, kinetic, and pathway information and can propagate the system forward in time. Early MSM methods development for applications to MD focused on choosing disjoint states such that the system dynamics is approximately Markovian. 7,8 For the first quantitatively predictive studies of protein folding, tens or hundreds of thousands of states were used to describe the conformational dynamics.9,10 These states were determined by system experts and often relied on heuristic choices. In these © 2017 American Chemical Society

studies, macrostate analyses were performed on the transition matrices to reduce the number of states to a few thousand.9−11 Many methods developers advocated a two-step process: first, use heuristics or system expertise to develop a quantitative microstate model with many thousands of states; then, create a fewer-state coarse-grained model for interpretability.12−15 The variational principle for MSMs16,17 transitioned MSM methods development away from heuristic state choices toward systematic and objectively comparable state decompositions. This variational approach to conformational dynamics (VAC)16,17 thus introduced a systematic method for choosing among state decompositions. The VAC, combined with improvements in model selection18−20 and the clustering step of the state decomposition,21 has led to more statistically robust models with variationally optimized numbers of microstates, usually on the order of hundreds.22 Coarse-grained models for these MSMs often contain just a few states, representing a bridge between many-state theoretical models and generally coarser experimental observables. Thus, the transformation of an optimized MSM containing a few hundred microstates into an interpretable, coarse-grained model with two or three macrostates is crucial for communicating predictions or explanations derived from an MSM.14 Received: September 27, 2017 Published: December 18, 2017 1071

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation

which is the Kullback−Leibler divergence from the “reference” distribution P to the “test” distribution Q,32 which is generally not equal to divKL(Q∥P). To compare two probability distributions in a symmetric way that does not require one distribution to be the test and the other to be the reference, the Jensen−Shannon divergence can be used:23

In this work, we adapt agglomerative hierarchical clustering to produce interpretable coarse-grained MSMs. By using a version of the Jensen−Shannon divergence23 to determine similarity, we can group states with similar dynamics using a minimum variance criterion.24 We begin by introducing the MSM transition matrix and discussing how to compare and cluster its constituent probability densities according to minimum variance. We then introduce a new minimum variance clustering approach for the coarse-graining of MSMs into interpretable macrostate models. After motivating the technique, we compare it to other coarse-graining algorithms for a proline-centered tripeptide system. We then demonstrate its performance on two protein folding simulations of the same system performed in different force fields, revealing a misfolded state present in only one of the simulations. Finally, we demonstrate how the method can be extended to the clustering of groups of transition matrices from separate MSMs by analyzing a protein folding simulation performed in nine different force fields. The combination of the Jensen−Shannon representation for assessing differences in dynamics with a minimum variance criterion is shown to be effective in generating robust, interpretable models.

divJS(P Q) ≡ 1

div

where the transition matrix T is row-stochastic, a are the eigenvectors, and the eigenvalues λi are real and indexed in decreasing order. The first eigenvector corresponds to the identity function, and the eigenvalue λ1 = 1 is unique. All subsequent eigenvalues are on the interval |λi>1| < 1 and correspond to dynamical processes in the time series. The timescales characterizing the decay of these processes to the stationary distribution can be obtained from the eigenvalues and the lag time at which T was defined by ti = −τ/ln λi.25 The MSM is obtained from simulation data by transforming the raw Cartesian coordinates into (integer) microstate labels, such as by clustering on the pairwise root-mean-square deviations between conformations, on an internal coordinate system such as residue contact distances or backbone dihedral angles, or on a transformed data set with reduced dimensionality, such as from time structure-based independent component analysis (tICA).26−28 These choices can be evaluated and optimized using the VAC.29,30 Then, given a discrete, disjoint state decomposition, the elements of the transition matrix can be obtained from observed transition counts. Finally, the maximum likelihood estimator (MLE) of the transition matrix that fulfills the detailed balance is used.31 2.2. Comparing Transition Probability Matrices. For an MSM represented by a row-stochastic transition probability matrix with N microstates, one way to write the similarity between row P and row Q is divKL(P Q) ≡

∑ P(i) log i

P(i) Q (i )

1

JS (P

Q) ≡

divJS(P Q)

(4)

Similar formulations using either the Kullbeck−Leibler or Jensen−Shannon divergence have been used to compare MSM transition matrices for adaptive sampling,32 coarse-graining,34 perturbations to dynamics,35−37 coupling to experiment,38 and surprisal39 and error40 analyses. 2.3. Agglomerative Clustering of Transition Matrices. Hierarchical agglomerative clustering algorithms partition data points into clusters according to a similarity function and an objective function. Initially, each data point is the sole element of a singleton cluster. At every step, the closest two clusters according to the similarity function are agglomerated into a new cluster according to the objective function. The clustering stops when the desired number of clusters or a threshold in the objective function is reached. Common objective functions for hierarchical agglomerative clustering include single, complete, and average linkages, which define the distance between clusters i and j as the smallest, greatest, or average pairwise distance between any point in i and any point in j, respectively. A more intricate objective function known as Ward’s method defines the distance between two clusters as the increase in intracluster variance resulting from the agglomeration of those clusters.24 At each agglomeration step, the distances between the clusters are updated according to the formula41,42

(1)

N

(3)

where M = 2 P + 2 Q . In other words, we separately evaluate how well the mean of P and Q represents each distribution and sum the normalized divergences. Endres and Schindelin33 showed that the square root of the Jensen−Shannon divergence obeys the triangle inequality, so we write the following symmetric distance metric:

2. THEORY BACKGROUND 2.1. The Markovian Transition Matrix. A discrete-time, discrete-space MSM can be created from trajectory data X under the assumption of the Markov property, which states that the conditional probability of transitioning from state Xt = i to state Xt+τ = j after a lag time of τ is independent of the state of the system at times before t. This transition matrix is used to solve the generalized eigenvalue problem,17 Ta = λa

1 1 divKL(P M) + divKL(Q M) 2 2

d(u , v) =

| v | + | s| |v | + |t | |v | d(v , t )2 − d(s , t )2 d(v , s)2 + T T T (5)

where the clusters s and t have been agglomerated to make a new cluster u; d(u,v) is the updated distance between the newly formed cluster u and another unchanged cluster v; |s|, |t|, and |v| are the numbers of data points in clusters s, t, and v, respectively; and T ≡ |s| + |t| + |v|. When performing hierarchical agglomerative clustering using the objective function shown in eq 5, we refer to this as using Ward’s minimum variance criterion. Ward’s method for agglomerative clustering has been used43−47 and shown to produce variationally optimal models21 for MSM state decomposition. In addition to the objective function for cluster agglomeration, the clustering algorithm also requires a pairwise similarity function such as Euclidean distance or RMSD.48 In this work, we use div JS defined in section 2.2 as the similarity function to cluster the rows of transition probability matrices to coarse-grain MSMs in an interpretable and robust way.

(2) 1072

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation

3. CHOOSING MINIMUM-VARIANCE MACROSTATES

calculated from the eigendecomposition of the transition matrix. The PCCA algorithm assumes that the transition matrix has a block structure, which can be seen in the toy transition matrix above. Using the eigenvectors from the decomposition, the microstates are separated according to positive and negative elements of the eigenvectors. For example, the dominant (right) eigenvector of the transition matrix above would be

3.1. Motivation and Background. In this section, we use div JS and Ward’s minimum variance criterion to cluster rows within an MSM transition matrix to produce coarse-grained MSMs. As an example, consider an equally spaced four-state partition of a one-dimensional potential with two evenly spaced

[−1.1586 − 1.0192 0.8578 0.9812 ]

so the first two microstates would be assigned to one macrostate since the corresponding elements are both negative, and the third and fourth microstates would be assigned to another macrostate since those elements are both positive. Coarse-graining into more than two macrostates employs the signs of elements in subsequent eigenvectors to further split the existing macrostates. Since PCCA does not incorporate the magnitudes of the eigenvector elements, it is easy to see that elements very close to zero will affect the results of PCCA in a possibly arbitrary way. Robust PCCA, often called PCCA+,11,56,57 was thus developed as an augmentation of PCCA to improve treatment of eigenvector elements close to zero instead of classifying them according to their sign as in PCCA. The PCCA+ algorithm identifies representative states according to their distinctness and assigns to the remaining states degrees of membership (as opposed to “crisp” membership) to states in the representative set.58 The Bayesian agglomerative clustering engine (BACE)34 is a different approach in which the matrix of observed transition counts is analyzed. First, a Bayes factor between each pair of states is computed, which is a version of the Jensen−Shannon divergence where each term is weighted by the number of observed counts for that state.34 Then agglomerative clustering is performed by merging the two states with the smallest Bayes factor. At each agglomeration, new Bayes factors are recomputed, and the clustering continues until the desired number of macrostates is reached. This number can be chosen a priori, as in PCCA or PCCA+, or by evaluating the change in the Bayes factor of the merged states at each step of the algorithm. The minimum variance clustering analysis (MVCA) introduced in this work utilizes the transition probability matrix by coarse-graining the probability distributions represented by its rows. MVCA relies on the intuitive minimum variance objective function to group the rows of the transition matrix according to their similarity, which is measured by div JS . The change in the objective function for each agglomeration can also be used to determine how many macrostates are appropriate for a given system. In the following sections, we apply MVCA to proline dynamics and compare its performance with PCCA, PCCA+, and BACE. Finally, we demonstrate the ability of MVCA to determine the appropriate number of macrostates for two different models for the folding of the CLN025 miniprotein. Other MSM coarse-graining algorithms include the hierarchical Nyström extension graph (HNEG)59,60 and projected Markov models (PMMs).61 HNEG is discussed separately in Appendix C because, unlike the coarse-graining algorithms above, it does not permit a user-selected number of macrostates. PMMs are briefly discussed in section 3.4; they

Figure 1. Double-well potential partitioned into four equal microstates along the dotted lines. A two-macrostate model for this potential should map the left two microstates to the left basin and the right two microstates to the right basin.

wells (Figure 1). Numbering the four states from left to right might produce the following row-stochastic transition matrix: ⎡ 0.9733 0.0267 0. 0. ⎤ ⎢ ⎥ 0. ⎥ ⎢ 0.0250 0.9714 0.0036 ⎢ 0. 0.0030 0.9739 0.0231⎥ ⎢ ⎥ ⎣ 0. 0. 0.0255 0.9745⎦

where each row represents the probability of transitioning from a state to itself (diagonal) or any other state (off-diagonal). The div JS value between the first and second rows (Figure 1, left well) is 0.757, and the div JS value between the third and fourth rows (Figure 1, right well) is 0.761. The div JS value between any pair of rows with one state in each well is greater than 0.819. It is clear from this simple example that a reasonable two-state approximation to the four-state model represented by the transition matrix would involve merging the two states in the left basin and merging the two states in the right basin. To extend this analysis to larger systems with more states, we use Ward’s minimum variance objective function. The other agglomerative clustering objective functions enumerated in section 2.2 (single, average, and complete linkages) would not be appropriate for this type of analysis. Since single and complete linkages define the distance between any two clusters by only one pair of points, these options will lead to pathological examples since most of the data are not used when evaluating the objective function. Although average linkage does incorporate all of the data points, it has also been shown to produce results that are similarly pathological to single linkage on simple data sets.49 In contrast, Ward’s method has successfully been used to analyze MD simulations21,43−47,50 and data sets in other fields such as neuroimaging.51−53 Determining MSM macrostates via analysis of the transition matrix is not a new approach. An early method developed to coarse-grain MSMs is Perron cluster cluster analysis (PCCA).8,54,55 PCCA leverages the separation of timescales 1073

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation

This mislabeled microstate is accessed only once during the simulation: in this instance, the system transitions to the mislabeled state from a state in the β-sheet basin at the previous time step and to a state in the α-helix basin at the next time step. The corresponding element of the first non-Perron right eigenvector, on which PCCA and PCCA+ operate for the first splitting, incorrectly dictates that this microstate should belong to the α-helix basin. The implementation of the BACE algorithm as described in ref 34 involves a preprocessing step in which each undersampled state is merged with its kinetically nearest neighbor before coarse-graining. It is in this step that the microstate near ψ ≈ 140° is misassigned. If the algorithm is adjusted such that no states are considered to be undersampled, BACE correctly assigns this microstate, but it mislabels a different state in the β-sheet basin.67 The MVCA algorithm uses the transition probability matrix to coarse-grain the MSM. The row of the transition matrix corresponding to this mislabeled microstate shows that it has a 49.8% chance of transitioning to the β-sheet region microstate from which it came in the original trajectory and a 50.2% chance of transitioning to the α-helix region microstate to which it transitions in the original trajectory. To understand why MVCA allocates this row to the β-sheet macrostate instead of the α-helix one, we note that most of the higher-energy microstates in the transition region are assigned to the pink βsheet macrostate (Figure 3). These higher-energy microstates are reasonably likely to transition to both the β-sheet (pink) and α-helical (blue) macrostates. Therefore, even though the microstate near ψ ≈ 140° that is mislabeled by the other algorithms is slightly more likely to transition to the α-helical macrostate according to the transition matrix probabilities, it contributes less variance to the sum of total intracluster variances (i.e., the Ward objective function) when it is merged into the β-sheet state, since the pink β-sheet state in the twomacrostate model contains most of the higher-energy microstates from the transition region. In MD simulations of large systems, it may be the case that some degrees of freedom are not fully sampled, and an event unrelated to the dynamics under study, such as a dihedral flip or a contracted distance between residues during a protein folding simulation, may occur on a much longer timescale than the process of interest.68−70 To explore the robustness of the macrostating algorithm toward this type of event, we also investigated how the addition of an artifact to the trajectory affects resulting coarse-grained model. To do this, we incorporate a new, previously unseen microstate into the raw trajectory at the step before the MSM is built. The system enters this new microstate at a randomly chosen time point, remains in the system for a set duration of time, and then exits.71 This analysis is designed to model the effect of a single, rare dihedral flip or other event that is orthogonal to the dynamics of interest or undersampled at the timescales of interest. Since this state is inserted after the data has been clustered into microstates, it has no physical meaning in ϕ × ψ space. By increasing the duration of time that the system stays in this “artifact” microstate, we can observe the duration at which each coarse-graining algorithm skews its macrostate model to accommodate this artifact. The results are shown in Figure 3 (bottom). PCCA+ is found to be the least robust toward the artifact, which affects the results of the macrostate model after a duration of 0.4 ns and leads to a macrostate model in which the artifact is a singleton macrostate and the second macrostate

are considered separately because they require a prior macrostate model. 3.2. Identifying Macrostates of Proline Dynamics. We first investigate the dynamics of the tripeptide Asp-Pro-Glu, comprising the third, fourth, and fifth residues of CLN025 and its predecessor chignolin, both of which are 10-residue βhairpin peptides designed to fold quickly.62,63 This segment is interesting because the proline is the last residue of the β-sheet and the glutamic acid is the first residue of the turn. Furthermore, proline occupies only two metastable basins with known secondary structure character.64 The tripeptide was simulated for 1 μs in the AMBER ff99SB-ILDN force field65 at 300 K with frames saved every 100 ps. To isolate the proline dynamics, we first partition the ϕ and ψ space of the central proline residue into 36 regularly spaced intervals for a total of 1296 square clusters. We then build an MSM from this state decomposition at a lag time of 100 ps, which was verified to be Markovian using an analysis of implied timescales.66 Figure 2 shows the free energy of each of the populated microstates. With the regular spatial clustering described above,

Figure 2. Free energy surface of the central proline residue in the AspPro-Glu tripeptide based on an MSM constructed from regular spatial clustering of the ϕ and ψ proline backbone angles. The boundaries are periodic, so the free energy surface identifies two basins. The basin at higher values of ψ represents β-sheet conformations, and the basin at lower values of ψ represents α-helical conformations. Of the 1296 possible microstates, 172 are occupied.

only 172 of 1296 possible ϕ × ψ microstates are occupied for this system. The transition matrix is relatively sparse; for each row, the number of nonzero transitions ranges from 1 to 70. The mean number of nonzero transitions is 22, and the median is 15. Since the edges of the free energy surface are periodic, it is qualitatively clear from Figure 2 that there are two basins, which is expected for proline.64 The basin at greater values of ψ is the β-sheet region, and the basin at smaller values of ψ is the α-helical region. An effective coarse-graining algorithm should identify these basins when two macrostates are used. PCCA, PCCA+, BACE, and MVCA are applied to this data set to create two-macrostate models; the last of these is applied by agglomerative clustering of the rows of the transition matrix using div JS . The four results are shown in Figure 3 (top row). All four methods successfully identify the two metastable basins in Figure 2 as macrostates. All of the methods have some trouble around the transition region, which is difficult to sample effectively. Additionally, PCCA, PCCA+, and BACE each mislabel a microstate near ψ ≈ 140° as α-helical even though it is in the β-sheet basin, while MVCA does not. 1074

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation

Figure 3. Coarse-graining of 172 microstates into two macrostates with PCCA, PCCA+, BACE, and MVCA for the central proline residue of the Asp-Pro-Glu tripeptide (top) and robustness of each coarse-graining algorithm toward an artifact of variable duration (bottom). All of the coarsegraining algorithms identify the two metastable basins, but the PCCA, PCCA+, and BACE models mislabel a state near ψ ≈ 140°, whereas the MVCA model does not. The algorithms exhibit varying levels of robustness toward a random artifact inserted once in the trajectory for 100 ps or longer. The bottom row shows plots of the number of microstates contained in the two macrostates identified by the models in the top row. The artifact is not shown, since it is physically meaningless. The PCCA, PCCA+, and BACE plots show the artifact duration after which the pink macrostate contains only the artifact and the blue macrostate contains all of the other states. The dashed line represents the longest timescale in the microstate MSM without the artifact. PCCA and PCCA+ are the least robust toward this artifact and break down when the duration of the artifact approaches the longest MSM timescale. After the artifact duration is increased by 2 orders of magnitude, BACE also breaks down. MVCA is not affected by this type of artifact.

sampling or of a few starting positions when many short trajectories are used to model a system. It may be the case that at very long durations the “artifact” is physically meaningful. Monitoring the change in the Bayes factors with the artifact duration increasingly favors a threemacrostate model containing the two macrostates identified in the top row of Figure 3 and the artifact as a singleton cluster. This BACE result is replicated by MVCA if Bayes factors are used as the similarity function instead of div JS because the Bayes factors incorporate observed counts when assessing microstate similarity (the minimum variance criterion need not be modified). However, it may also be the case that a very longlasting artifact obscures the slow process of interest, as in Figure 9 of ref 70. The appropriate way to analyze such a longduration artifact should therefore be decided on a case-by-case basis. Both analysis methods, i.e., using unweighted and countsbased microstate similarity functions, are amenable to the minimum variance objective function by simply swapping out the standard div JS for a counts-weighted function in the latter case. We therefore recommend MVCA, which is shown here to be robust toward such artifacts and amenable to modularity if a different similarity function is preferred. 3.3. Analyzing the Folding of CLN025 in Different Force Fields. The folding of the CLN025 miniprotein was simulated by both Lindorff-Larsen et al.72 and McKiernan et al.73 in different force fields. The Lindorff-Larsen data set uses the CHARMM22* protein force field74 and the mTIP3P water model75 and comprises a single 106 μs simulation. The McKiernan simulation was performed in the AMBER-FB15 protein force field76 and the TIP3P-FB water model77 and

comprises all of the other states after an artifact duration of 2.9 ns. PCCA is also not robust toward artifact durations on the order of nanoseconds, and the coarse-graining is affected after an artifact duration of 4.3 ns and produces a singleton artifactonly macrostate after 9.3 ns. The BACE macrostate model is robust to the artifact for all durations shorter than 324 ns (nearly one-quarter of the modified trajectory length); at longer durations, the coarsegraining model yields one singleton macrostate containing the artifact. Since the BACE Bayes factors incorporate the number of observed transition counts from each state, as these counts increase the Bayes factors change until the artifact becomes the most dissimilar macrostate. In contrast, the MVCA model is robust toward any duration of the artifact, evaluated up to 100 ms. This is the case because at no duration of the artifact does the variance after merging it with either of the basins exceed the variance of merging the basins with each other. The longest timescale of the microstate MSM is 3.9 ns, and we see that PCCA and PCCA+ break down when the duration of the artifact approaches the length of the longest timescale because the first splitting is performed on the basis of the dominant eigenvector of the transition matrix eigendecomposition. PCCA and PCCA+ are therefore not expected to produce interpretable macrostate models for MSMs created from MD simulations when the timescale of interest is shorter than another timescale in the simulation, whereas BACE and MVCA are robust toward artifacts at moderate durations. This type of artifact can be common in large MD simulations because of modes that are rarely accessed for substantial durations of time, which might occur as a result of stochastic 1075

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation contains 144 short (on the order of 1 μs) trajectories for a total of 101 μs. The simulation details are described in the respective publications. For this example, we build an optimized MSM for each simulation data set. For both MSMs, we represent the system by its α-dihedral angles, defined by each set of four adjacent α-carbons along the protein backbone. We then use tICA27,28 to transform the data set, retaining all of the components after weighting with the kinetic mapping scheme;19 cluster from tICA space using mini-batch k-means; and build an MSM. For the CHARMM22* system we use a tICA lag time of 128 ns, 704 clusters, and an MSM lag time of 50 ns; for the AMBER-FB15 model we use a tICA lag time of 20 ns, 157 clusters, and an MSM lag time of 20 ns. The tICA and clustering parameters were selected through a variational analysis.18 The CHARMM22* analysis is described in the supplementary material of ref 73. The AMBER-FB15 analysis was performed using 200 trials, each with ten 50/50 crossvalidation splits, to select the number of k-means cluster centers for a tICA lag time of 20 ns. The best model was chosen as the MSM (20 ns lag time) with the highest median cross-validated sum of eigenvalues for five timescales. Although the protein system is the same, these two models were constructed independently with different protocols in different force fields. Here we demonstrate how analysis with MVCA shows that different numbers of macrostates are appropriate for the two models. We analyze the optimized models separately by using MVCA to choose the appropriate number of macrostates and then to coarse-grain each model. Figure 4 shows how the updated sum of intracluster variances (recall eq 5) changes with the number of macrostates (left), the coarse-grained model superimposed on the two-dimensional free energy surface (right), and example structures for each macrostate (bottom). For the CHARMM22* model, there is a sharp decrease in the Ward objective function in going from one to two macrostates, with much smaller decreases for models with more macrostates. In contrast, for the AMBER-FB15 model, there are substantial decreases in the objective function in going from both one to two macrostates and two to three macrostates, with shallower decreases thereafter.78 These plots justify a two-macrostate model for the CHARMM22* data and a three-macrostate model for the AMBER-FB15 data. Even though the same system was simulated, the MVCA analysis suggests a different number of macrostates is best for each force field. Analysis of the identified macrostates shows that the AMBER-FB15 model identifies an α-helical misfolded state79 in addition to the folded state and the extended denatured state. Analysis of the secondary structure distributions of the original microstates shows that the CHARMM22* model does not contain the α-helical state. Therefore, it is appropriate to choose a different number of macrostates for each simulation, which is consistent with the MVCA analysis. The differing results for the two force fields are likely due to the different protein and water model parameters. We see from this application that MVCA is useful for identifying an optimal number of macrostates through monitoring of the reduction in variance at different levels of coarse-graining. 3.4. Other Algorithms and Extensions. BACE (section 3.1) is an agglomerative clustering method for the coarsegraining of MSMs that quantifies the similarity between states by a Bayes factor, which is a reweighted form of the Jensen− Shannon divergence (eq 3).34 Both the MVCA and BACE

Figure 4. Coarse-grained models for the folding free energy landscape of CLN025 in the CHARMM22* force field (top) and AMBER-FB15 force field (middle). Reaction coordinates 1 and 2 correspond to the first and second tICs for each model. According to changes in the Ward objective function at each agglomerating step (left), MVCA identifies two macrostates for the CHARMM22* model and three macrostates for the AMBER-FB15 model. Variance values are given by the median variance from 200 rounds of bootstrapping with replacement. Error bars represent the 5th and 95th percentiles from the bootstrapping analysis and are too small to extend beyond the data point markers for greater numbers of macrostates. The three AMBERFB15 macrostates correspond to the folded β-hairpin structure (green), a misfolded α-helical structure (blue), and the extended denatured state (yellow), whereas the two CHARMM22* macrostates identify only the folded and extended structures (bottom).

algorithms produce a hierarchical tree that can be cut at any number of clusters, and extensions to MVCA are similar to those possible with BACE.34 For example, a state that results in high intracluster variance for any cluster it joins can be addressed as an outlier from the beginning using the pairwise distances. Whereas the BACE algorithm effectively forgets any original distance between any pair of states once they are merged, the MVCA algorithm retains the pairwise distances from the original microstate decomposition for the entirety of the clustering process, incorporating that information into each distance update and retaining a richer representation of the system. In a sense, the practical trade-off between MVCA and BACE is thus a choice between memory and speed. Since MVCA updates every pairwise distance at every step, it is important to note that MVCA will scale poorly with the addition of data points (see Appendix A). However, for a coarse-graining analysis, the number of data points is the number of rows of the transition matrix (i.e., the number of microstates). Since most MSMs contain only a few hundred states, we recommend MVCA since it integrates knowledge of the original state decomposition throughout the coarse-graining procedure. In the case of a prohibitive number of microstates, 1076

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation

we calculate the divergences among separate MSMs by summing the divergences of corresponding pairs of rows for each pair of MSM transition matrices. In this way, an aggregation of MSMs can be separated into groups on the basis of the dynamics encoded in their transition matrices. Below we demonstrate this method by analyzing transition matrices originating from nine different force fields. 4.2. Comparing Protein and Water Models. Hierarchical clustering on the MSM transition matrices constructed from independent simulations of a protein using various force fields can provide insight into modeling choices by demonstrating which protein and water models yield similar dynamics for the model system. To investigate this, we examine nine folding simulations of the 10-residue CLN025 miniprotein.63 The first simulation, performed by Lindorff-Larsen et al.,72 is described in section 3.3 and uses the CHARMM22* protein force field74 and the mTIP3P water model.75 The remaining data sets, one of which is also analyzed in section 3.3, contain 136 to 160 short trajectories for a total of 100 to 107 μs. The simulation details are provided in the Supporting Information of ref 73. These data sets represent eight combinations of protein force field (AMBER ff99SB-ILDN65 or AMBER-FB1576) and water model (TIP3P,88 TIP3P-FB,77 TIP4P-ew,89 or TIP4P-FB77). To approximate a situation in which the modeler has performed some simulations and is investigating the system dynamics to motivate further study, we assume that we can use the CHARMM22* trajectory as a reference to investigate the dynamics of the eight AMBER simulations, since MSMs for long trajectories such as the Lindorff-Larsen data set are expected to be well-behaved and suitable for benchmarking.90 To analyze our aggregate data set, we first construct an unoptimized MSM for the CHARMM22* reference data by following recommendations for current modeling practices.30 We choose an MSM using the α-carbon contact distances, which are transformed into all 28 tICs weighted by their eigenvalues with a tICA lag time of 100 ns.19,27,28 From this tICA representation, we discretize conformation space using the mini-batch k-means clustering algorithm with 100, 200, 300, 400, and 500 cluster centers and construct an MSM with a lag time of 50 ns for each number of microstates.43 We then use these pretrained tICA and clustering models to create “projected” MSMs, which we denote in quotation marks to distinguish them from the PMM macrostating method. A description of the MSM “projection” procedure can be found in ref 73. “Projected” MSMs sometimes do not occupy all of the microstates used to create the baseline model. In our analysis, for 100 microstates, no states were unoccupied for any model. For microstate numbers 200 through 500, at most two states were unoccupied for any “projected” model, and the transition matrices were resized for the MVCA analysis. For more complicated systems in which microstate coverage is not complete, a combined MSM can be created by aggregating all of the data sets and constructing a single tICA model for the entire data set, after which separate MSMs are created.85−87 For each number of k-means microstates, we now have nine transition matrices to cluster using the D JS similarity function. By performing agglomerative clustering with Ward’s minimum variance objective function for all possible numbers of clusters, we can see a hierarchical representation of which aspects of the simulations partition the dynamics and at what clustering step certain attributes are agglomerated (Figure 5). Figure 5

landmark points can be used instead of all data points. For a detailed description of the differences between BACE and MVCA, see Appendix B. Two additional methods for macrostating have not yet been addressed: HNEG59,60 and PMMs.61 Since HNEG automatically chooses the appropriate number of microstates, it was not included in the above analysis and is instead discussed in Appendix C. PMMs as approximated by hidden Markov models (HMMs)61,80 or observable operator models (OOMs)81 represent an alternative method for the coarse-graining of MSMs. HMMs must be seeded with a reasonable starting model, for which the authors use PCCA+ on a microstate MSM in ref 61 and k-means on the raw trajectory data in ref 80. Seeding the HMM model with an initial coarse-grained model using MVCA could represent a valuable augmentation to this method. Agglomerative clustering requires two choices: the choice of a similarity function between data points (e.g., div JS ) and the choice of an objective function used to agglomerate clusters (e.g., Ward’s method). Future improvements to MVCA could be incorporated into either choice. For example, the pairwise div JS distances input into MVCA could be weighted to incorporate observed transition counts, similar to the weighting of Bayes factors in BACE (section 3.2). For the objective function, a standard error approach could be considered that is a function of both the variance and the number of data points of each cluster. Since the MVCA method introduced in this work is both effective and easy to implement out of the box, we suggest starting with the methods proposed herein and then exploring modifications to either the distance function or the objective function.

4. GROUPING AND COMPARING MANY MODELS 4.1. Quantifying Similarity among MSMs. We now extend the methods developed in this work to group the transition matrices of many MSMs in order to facilitate understanding of systems requiring many models. The introduction of the VAC16,17 and complementary updates to MSM building software29,82 have enabled the widespread utility of MSM analysis for complex, multisystem analyses such as the kinetics of protein−protein association,83,84 mutant effects on protein folding dynamics,85−87 merging of simulation and experimental data,38 and comparison of force field dynamics.73 These cutting-edge MSM applications involve aggregated analysis of many models. In section 2.2, we used div JS to quantify the similarity between two rows of the same transition matrix. We can extend this method to quantify the similarity between two transition matrices from separate MSMs with the same state definitions. For two MSMs with row-stochastic probability transition matrices A and B, which we assume were generated from the same state decomposition, we define D

JS



∑ div i

JS (A i

Bi )

(6)

where Ai is the ith row of the transition probability matrix A. In other words, we simply sum the div JS values for each corresponding row to produce the divergence D JS between the two MSMs. Thus, instead of calculating the divergences among rows within a single transition matrix as in the previous sections, here 1077

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation

allows quantitative comparison of the free energy landscapes modeled by the different force fields, and using a variable number of groups allows one to make dynamical comparisons at variable levels of scale. Comparing the MSMs generated by a set of force fields is a useful approach for comparing the landscapes approximated by the different force fields. This is the case because an MSM summarizes the MD data set using a landscape of energy basins (state populations) and barrier heights (transition probabilities). When a macrostate model is built, the coarse-graining process has the effect of smoothing this landscape, averaging over short-range noise. Comparison of these coarse-grained landscapes created with MVCA enables a comparison of the force fields according to their descriptions of only the most salient dynamical processes.

Figure 5. Agglomerative clustering of nine transition matrices using Ward’s method and the D JS distance metric. The tree shows a hierarchy of which factors contribute to dissimilarity of the CLN025 dynamics for all five sets of microstate models used. The transition matrices are distinguished by the protein force field (CHARMM22* = 22*, AMBER ff99SB-ILDN = ILDN, AMBER-FB15 = FB15) when the fewest clusters are used. As more clusters are used, the transition matrices separate according to the water model in the AMBER-FB15 branch. Finally, water models in the AMBER ff99SB-ILDN branch are differentiated. In one case (100 microstates), the AMBER ff99SBILDN/TIP3P model is isolated before the last two AMBER-FB15 water models are separated. It is interesting that the dynamics under CHARMM22* and AMBER ff99SB-ILDN are more similar to each other than to the dynamics under AMBER-FB15. It is also noteworthy that the water models separate in the AMBER-FB15 branch before the AMBER ff99SB-ILDN branch. This illustrates the unique sensitivity of the AMBER-FB15 force field to the water model.

5. CONCLUSIONS The coarse-graining of MSMs for interpretability and further analysis is important for understanding what our models are telling us and for connecting theoretical models with experimental observables. Furthermore, the current state of the art in MSM methods and applications involves aggregated analysis of many MSMs.73,85−87 Here we adapt agglomerative hierarchical clustering with the square root of the Jensen− Shannon divergence and Ward’s minimum variance objective function to implement a method for comparing and clustering probability distributions within the transition matrix of a single MSM or the transition matrices themselves from a data set of many MSMs. In section 3, we introduce MVCA, which clusters probability densities within a transition matrix using div JS as the similarity function and Ward’s method as the objective function in order to produce robust coarse-grained MSMs. We show that MVCA can be used to determine the appropriate number of macrostates for a model and revealed a misfolded state of CLN025 present in one force field but not another. Furthermore, MVCA is straightforward and easy to execute with available implementations of Ward’s method for agglomerative clustering.42,82,91 Finally, in section 4, we show utility of the methods presented in this work for analyzing sets of many MSMs by comparing simulations of the same folding system in multiple force fields. Applying MVCA to groups of MSMs in order to cluster their transition matrices on the basis of dynamical similarity provides a versatile tool for multimodel analyses. For example, a set of simulations including a wild-type protein and several mutants can be clustered using MVCA when the same state decomposition is used for all of the models. The change in objective function can be monitored to determine the appropriate number of groups, and then the membership of those groups can aid further analysis: for example, determining what mutations are in the same dynamical group as the wild type and what mutations form separate groups can provide insight into interpreting and further analyzing the mutated system. We note that other methods for analyzing these types of mutated data sets have been presented in refs 85−87 and 92. The minimum variance criterion that characterizes Ward’s method enables an intuitive grouping of microstates or entire models on the basis of the variance in their dynamics, and the variance within each cluster can lend insight into that group of data points. We anticipate that the methods in this paper, namely, the union of the div JS and D JS similarity functions with agglomerative clustering using Ward’s minimum variance

presents the clustering results that are robust toward the number of microstates used to create the “projected” models. For all five sets of “projected” microstate models, the coarsest clustering separates the simulations by protein model. When three clusters are used to group the nine transition matrices, the clusters are characterized by the three protein force fields. When only two clusters are used, the tree shows that the CHARMM22* and AMBER ff99SB-ILDN dynamics are more similar to each other than to the dynamics in AMBER-FB15. At finer partitions of the nine transition matrices, the clusters are characterized by their water models. In the AMBER-FB15 branch, the four water models separate when additional clusters are used. The AMBER ff99SB-ILDN models separate according to water model only after the water models in the AMBERFB15 branch have separated, except in one case (100 microstates) where one AMBER ff99SB-ILDN model separates before the last AMBER-FB15 model. This analysis shows that the AMBER-FB15 simulations with different water models are more dissimilar to each other than the AMBER ff99SB-ILDN simulations with different water models. This provides further evidence that protein dynamics are uniquely sensitive to the water model when AMBER-FB15 is used as the protein model.76 Application of MVCA to these nine CLN025 folding data sets, which vary only with respect to force field, thus allows a comparison of the underlying free energy landscapes. Furthermore, variable levels of detail can be unpacked when this method is applied with a variable number of dynamical subgroups (Figure 5, different layers of the tree). Broadly, using a small number of clusters, we find that the data sets are distinguishable by protein model. When a larger number of clusters is used, the models become distinguishable by water model. This suggests that simulation dynamics is primarily guided by protein force field and that the influence of the water model is of a finer dynamical character. Overall we find that coarse-graining of the force field-dependent MSMs via MVCA 1078

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation

merged. The choice of whether to update all of the Bayes factors at every agglomeration is a choice of whether to retain outdated finer-grained information or to remove that information entirely. MVCA represents a richer approach wherein knowledge of the original distances between data points is maintained and incorporated throughout the clustering analysis. For comparison, in Figure 6 we present the MVCA analysis performed in section 3.3 using BACE, reproducing the plots in

objective function, will provide an easy-to-implement analysis tool for coarse-graining of MSM transition matrices. The same method can also be used to group many MSMs on the basis of their dynamics, which will become increasingly common as MSM methods continue to be developed. Free open-source software implementing the methods used in this work is available in the MDTraj,93 MSMBuilder,82 and Osprey94 packages available from mdtraj.org and msmbuilder.org. See Appendix D for a note on different implementations of agglomerative clustering.



APPENDIX A: COMPUTATION TIME FOR COARSE-GRAINING ANALYSES All coarse-graining was performed locally on a Mac mini (2.8 GHz Intel Core i5, four cores). Agglomerative clustering was performed using all data points without a landmark approximation. The coarse-grainings of 172 microstates to two macrostates in section 3.2 were recorded in seconds as follows: PCCA, 0.11; PCCA+, 0.19; BACE, 0.38; MVCA, 20.40. To produce the macrostate models in section 3.3, the coarse-graining of 704 microstates to two macrostates for the CHARMM22* model took 1589.35 s, and the coarse-graining of 157 microstates to three macrostates for the AMBER-FB15 model took 32.37 s.



APPENDIX B: COMPARISON WITH BACE The BACE coarse-graining method34 also uses agglomerative clustering to produce a macrostate model, but it differs in both its similarity function and objective function. In BACE, the most similar clusters are merged according to their Bayes factor, which is a form of the Jensen−Shannon divergence (eq 3) weighted by the observed counts. At each agglomeration, the Bayes factors are recomputed for all pairs involving the newly merged cluster. The MVCA approach is similar in that it also assesses similarity according to the (unweighted) Jensen− Shannon divergence; however, the minimum variance criterion optimized during agglomeration, which is a function of both the original div JS distances and the cardinality of the clusters (eq 5), differentiates MVCA from BACE in two key ways. First, MVCA is easier to execute in practice both because pairwise div JS distances need only be computed once and because existing implementations of Ward’s method are agnostic toward the distance metric and can be used out of the box (see Appendix D). For example, given an MSM transition matrix, MVCA can be performed with various functions from the SciPy Python package91 and does not require custom software. Second, the similarity update between clusters is performed differently after an agglomeration. In MVCA, the div JS values are computed at the outset and do not need to be updated at every step because the updated distance is calculated according to Ward’s method. This updated distance incorporates the original pairwise distances between all data points that were agglomerated. In contrast, at each agglomeration in the BACE approach, the original pairwise Bayes factors are not incorporated into further analysis, and subsequent agglomerative steps are performed without knowledge of the original separation between the merged states. In the implementation of BACE introduced in ref 34, the Bayes factors between states i and j are updated only if either i or j is agglomerated into another cluster. However, if two other states k and l have just been agglomerated, the Bayes factor between i and j will differ depending whether k and l are considered to be separate or

Figure 6. Coarse-grained models for the folding free energy landscape of CLN025 in the CHARMM22* force field (top) and AMBER-FB15 force field (bottom). Reaction coordinates 1 and 2 correspond to the first and second tICs for each model. The left panel shows the median change in Bayes factors with the number of macrostates from 200 rounds of bootstrapping with replacement. Error bars represent the 5th and 95th percentiles of the Bayes factors obtained from the bootstrapping analysis. These results can be contrasted with Figure 4 in the main text.

Figure 4. We use the default settings, i.e., the unsymmetrized counts matrix and an undersampled threshold of 1.1.34 We see that the uncertainties in the Bayes factors are much higher than the corresponding uncertainties in the variances in Figure 4 and that the macrostates are somewhat different for the AMBERFB15 three-macrostate model; namely, the folded (green) and misfolded (blue) macrostates are more diffuse. We note that the BACE macrostates become more compact and resemble the MVCA macrostates when an undersampled threshold of 0 is used (i.e., no states are considered to be undersampled).



APPENDIX C: ANALYSIS WITH THE HIERARCHICAL NYSTRÖ M EXTENSION GRAPH The hierarchical Nyström extension graph (HNEG)59,60 is a method for coarse-graining of MSMs that is designed to avoid pathologies due to poorly sampled microsates by building a population-based hierarchy of microstates from the observed transition counts matrix and utilizing spectral clustering within each layer of the hierarchy. Since HNEG automatically chooses the optimal number of macrostates, we discuss its advantages and disadvantages separately. First, we implement the HNEG algorithm for the proline data set in section 3.2. The HNEG model finds two macrostates and does not mislabel the 1079

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation microstate near ψ ≈ 140° that is incorrectly assigned by PCCA, PCCA+, and BACE in Figure 3. However, when the HNEG algorithm is implemented on the AMBER-FB15 data set for CLN025 presented in section 3.3, it recommends two macrostates and does not separate the folded state from the extended state (green and yellow, respectively, in Figure 4). The MVCA algorithm presented in this work offers a method for choosing how many macrostates best describe the data (Figure 4, left) while also enabling a user-determined choice of macrostates so that knowledge of the system can be incorporated. Additionally, we tested the HNEG algorithm for robustness by incorporating an artifact with variable duration as described in section 3.2. The HNEG macrostate model was robust toward the artifact up to 3.4 ns. At greater durations, HNEG either produced a variety of two-macrostate models or recommended one of several three-macrostate models with the artifact isolated as a singleton third macrostate. At some artifact durations greater than 3.4 ns, the original two-macrostate clustering was unaffected. Whether HNEG returned two or three macrostates for the data and which model it returned varied inconsistently with the duration of the artifact in an apparently random fashion. It therefore may not be straightforward (or possible) to determine when HNEG would or would not be robust toward similar artifacts. In contrast, MVCA is reliably robust toward this type of artifact. The HNEG algorithm can be downloaded from simtk.org/ home/hneg.

Notes

The authors declare the following competing financial interest(s): V.S.P. is a consultant and SAB member of Schrodinger, LLC and Globavir, sits on the Board of Directors of Apeel Inc., Freenome Inc., Omada Health, Patient Ping, and Rigetti Computing, and is a General Partner at Andreessen Horowitz.



ACKNOWLEDGMENTS B.E.H. is extremely grateful to John Dabiri, Matt Harrigan, Xuhui Huang, Sukrit Singh, Greg Bowman, and Swetava Ganguli for insightful discussions. The authors thank Jade Shi and Nate Stanley for manuscript feedback and the anonymous reviewers for their suggestions. We are grateful to Matt Harrigan for providing the tripeptide data set. We graciously acknowledge Folding@home donors who contributed to the AMBER force field simulations and D. E. Shaw Research for providing CHARMM22* simulation data.



(1) Dror, R. O.; Dirks, R. M.; Grossman, J.; Xu, H.; Shaw, D. E. Biomolecular Simulation: A Computational Microscope for Molecular Biology. Annu. Rev. Biophys. 2012, 41, 429−452. (2) Shaw, D. E.; Deneroff, M. M.; Dror, R. O.; Kuskin, J. S.; Larson, R. H.; Salmon, J. K.; Young, C.; Batson, B.; Bowers, K. J.; Chao, J. C.; Eastwood, M. P.; Gagliardo, J.; Grossman, J. P.; Ho, C. R.; Lerardi, D. J.; Kolossváry, I.; Klepeis, J. L.; Layman, T.; McLeavey, C.; Moraes, M. A.; Mueller, R.; Priest, E. C.; Shan, Y.; Spengler, J.; Theobald, M.; Towles, B.; Wang, S. C. Anton, a Special-purpose Machine for Molecular Dynamics Simulation. Commun. ACM 2008, 51, 91−97. (3) Shirts, M.; Pande, V. S. Screen Savers of the World Unite! Science 2000, 290, 1903−1904. (4) Buch, I.; Harvey, M. J.; Giorgino, T.; Anderson, D. P.; De Fabritiis, G. High-Throughput All-Atom Molecular Dynamics Simulations Using Distributed Computing. J. Chem. Inf. Model. 2010, 50, 397−403. (5) Schütte, C.; Sarich, M. Metastability and Markov State Models in Molecular Dynamics: Modeling, Analysis, Algorithmic Approaches; Courant Lecture Notes, Vol. 24; American Mathematical Society: Providence, RI, 2013. (6) An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation; Bowman, G. R., Pande, V. S., Noé, F., Eds.; Advances in Experimental Medicine and Biology, Vol. 797; Springer: Dordrecht, The Netherlands, 2014. (7) Chodera, J. D.; Singhal, N.; Pande, V. S.; Dill, K. A.; Swope, W. C. Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. J. Chem. Phys. 2007, 126, 155101. (8) Noé, F.; Horenko, I.; Schütte, C.; Smith, J. C. Hierarchical analysis of conformational dynamics in biomolecules: Transition networks of metastable states. J. Chem. Phys. 2007, 126, 155102. (9) Voelz, V. A.; Bowman, G. R.; Beauchamp, K.; Pande, V. S. Molecular Simulation of ab Initio Protein Folding for a Millisecond Folder NTL9(1−39). J. Am. Chem. Soc. 2010, 132, 1526−1528. (10) Bowman, G. R.; Voelz, V. A.; Pande, V. S. Atomistic Folding Simulations of the Five-Helix Bundle Protein λ6−85. J. Am. Chem. Soc. 2011, 133, 664−667. (11) Deuflhard, P.; Weber, M. Robust Perron cluster analysis in conformation dynamics. Lin. Alg. Appl. 2005, 398, 161−184. (12) Noé, F.; Fischer, S. Transition networks for modeling the kinetics of conformational change in macromolecules. Curr. Opin. Struct. Biol. 2008, 18, 154−162. (13) Bowman, G. R.; Beauchamp, K. A.; Boxer, G.; Pande, V. S. Progress and challenges in the automated construction of Markov state models for full protein systems. J. Chem. Phys. 2009, 131, 124101.



APPENDIX D: A NOTE ON THE AGGLOMERATIVE CLUSTERING CODE Clustering involves fitting a set of data points to clusters in an unsupervised fashion. Given a fit cluster model, the cluster assignments of new data points can then be predicted. For agglomerative hierarchical clustering, the “fit” cluster assignment of a data point may not be the same as the “predicted” cluster assignment for the exact same data point. This phenomenon occurs near cluster edges and is discussed for Ward’s method in an interactive example that can be found at github.com/msmbuilder/msmbuilder/blob/master/examples/ Ward-Clustering.ipynb. MSMBuilder algorithms are designed to be amenable to cross-validation;82 therefore, the MVCA output in the MSMBuilder implementation is the prediction output.21 This prediction algorithm was derived in ref 21 and is not implemented in SciPy, which returns the fit output.91 All of the results presented in this paper use the MSMBuilder convention. Though they are in general not identical, both the fit and predicted outputs are nonstochastic.



REFERENCES

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Brooke E. Husic: 0000-0002-8020-3750 Mohammad M. Sultan: 0000-0001-5578-6328 Funding

We acknowledge funding from the National Institutes of Health (NIH R01-GM62868). H.K.W.-S. acknowledges support from the NSF Graduate Research Fellowhip Program. 1080

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation (14) Pande, V. S.; Beauchamp, K.; Bowman, G. R. Everything you wanted to know about Markov State Models but were afraid to ask. Methods 2010, 52, 99−105. (15) Keller, B.; Daura, X.; van Gunsteren, W. F. Comparing geometric and kinetic cluster algorithms for molecular simulation data. J. Chem. Phys. 2010, 132, 074110. (16) Noé, F.; Nüske, F. A Variational Approach to Modeling Slow Processes in Stochastic Dynamical Systems. Multiscale Model. Simul. 2013, 11, 635−655. (17) Nüske, F.; Keller, B. G.; Pérez-Hernández, G.; Mey, A. S. J. S.; Noé, F. Variational Approach to Molecular Kinetics. J. Chem. Theory Comput. 2014, 10, 1739−1752. (18) McGibbon, R. T.; Pande, V. S. Variational cross-validation of slow dynamical modes in molecular kinetics. J. Chem. Phys. 2015, 142, 124105. (19) Noé, F.; Clementi, C. Kinetic Distance and Kinetic Maps from Molecular Dynamics Simulation. J. Chem. Theory Comput. 2015, 11, 5002−5011. (20) Wu, H.; Noé, F. Variational approach for learning Markov processes from time series data. 2017, arXiv:1707.04659. arXiv.org ePrint archive. https://arxiv.org/abs/1707.04659 (21) Husic, B. E.; Pande, V. S. Ward Clustering Improves CrossValidated Markov State Models of Protein Folding. J. Chem. Theory Comput. 2017, 13, 963−967. (22) Schwantes, C. R.; McGibbon, R. T.; Pande, V. S. Perspective: Markov models for long-timescale biomolecular dynamics. J. Chem. Phys. 2014, 141, 090901. (23) Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145−151. (24) Ward, J. H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236−244. (25) Negative eigenvalues correspond to dynamical processes that oscillate instead of decay to equilibrium. When computing the timescales in practice, we can take the absolute value of the argument of the logarithm to convert these numerical artifacts to real-valued, albeit meaningless, timescales. (26) Naritomi, Y.; Fuchigami, S. Slow dynamics in protein fluctuations revealed by timestructure based independent component analysis: The case of domain motions. J. Chem. Phys. 2011, 134, 065101. (27) Schwantes, C. R.; Pande, V. S. Improvements in Markov State Model Construction Reveal Many Non-Native Interactions in the Folding of NTL9. J. Chem. Theory Comput. 2013, 9, 2000−2009. (28) Pérez-Hernández, G.; Paul, F.; Giorgino, T.; De Fabritiis, G.; Noé, F. Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 2013, 139, 015102. (29) Scherer, M. K.; Trendelkamp-Schroer, B.; Paul, F.; PérezHernández, G.; Hoffmann, M.; Plattner, N.; Wehmeyer, C.; Prinz, J.H.; Noé, F. PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. J. Chem. Theory Comput. 2015, 11, 5525−5542. (30) Husic, B. E.; McGibbon, R. T.; Sultan, M. M.; Pande, V. S. Optimized parameter selection reveals trends in Markov state models for protein folding. J. Chem. Phys. 2016, 145, 194103. (31) Prinz, J.-H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J. D.; Schütte, C.; Noé, F. Markov models of molecular kinetics: Generation and validation. J. Chem. Phys. 2011, 134, 174105. (32) Bowman, G. R.; Ensign, D. L.; Pande, V. S. Enhanced Modeling via Network Theory: Adaptive Sampling of Markov State Models. J. Chem. Theory Comput. 2010, 6, 787−794. (33) Endres, D. M.; Schindelin, J. E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858−1860. (34) Bowman, G. R. Improved coarse-graining of Markov state models via explicit consideration of statistical uncertainty. J. Chem. Phys. 2012, 137, 134111. (35) McClendon, C. L.; Hua, L.; Barreiro, G.; Jacobson, M. P. Comparing Conformational Ensembles Using the Kullback-Leibler Divergence Expansion. J. Chem. Theory Comput. 2012, 8, 2115−2126.

(36) Dixit, P. Introducing prescribed biases in out of equilibrium Markov models. bioRχiv preprint server, 2017. https://www.biorxiv. org/content/early/2017/10/09/198697. (37) Dixit, P.; Dill, K. A. Caliber Corrected Markov Modeling (C2M2): Correcting Equilibrium Markov models. 2017, arXiv:1711:03043. arXiv.org e-Print archive. https://arxiv.org/abs/ 1711.03043. (38) Olsson, S.; Wu, H.; Paul, F.; Clementi, C.; Noé, F. Combining experimental and simulation data of molecular processes via augmented Markov models. Proc. Natl. Acad. Sci. U. S. A. 2017, 114, 8265−8270. (39) Voelz, V. A.; Elman, B.; Razavi, A. M.; Zhou, G. Surprisal Metrics for Quantifying Perturbed Conformational Dynamics in Markov State Models. J. Chem. Theory Comput. 2014, 10, 5716−5728. (40) Wu, H.; Paul, F.; Wehmeyer, C.; Noé, F. Multiensemble Markov models of molecular thermodynamics and kinetics. Proc. Natl. Acad. Sci. U. S. A. 2016, 113, E3221−E3230. (41) Müllner, D. Modern hierarchical, agglomerative clustering algorithms. 2011, arXiv:1109.2378. arXiv e-Print archive. https://arxiv. org/abs/1109.2378. (42) Müllner, D. fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python. J. Stat. Software 2013, 53, 9. (43) Beauchamp, K. A.; McGibbon, R.; Lin, Y.-S.; Pande, V. S. Simple few-state models reveal hidden complexity in protein folding. Proc. Natl. Acad. Sci. U. S. A. 2012, 109, 17807−17813. (44) Dickson, A.; Brooks, C. L. Native States of Fast-Folding Proteins Are Kinetic Traps. J. Am. Chem. Soc. 2013, 135, 4729−4734. (45) Chang, H.-W.; Bacallado, S.; Pande, V. S.; Carlsson, G. E. Persistent Topology and Metastable State in Conformational Dynamics. PLoS One 2013, 8, e58699. (46) Weber, J. K.; Jack, R. L.; Pande, V. S. Emergence of Glass-like Behavior in Markov State Models of Protein Folding Dynamics. J. Am. Chem. Soc. 2013, 135, 5501−5504. (47) Lapidus, L. J.; Acharya, S.; Schwantes, C. R.; Wu, L.; Shukla, D.; King, M.; DeCamp, S. J.; Pande, V. S. Complex Pathways in Folding of Protein G Explored by Simulation and Experiment. Biophys. J. 2014, 107, 947−955. (48) We differentiate the objective function (“Ward’s minimum variance criterion”) from the full method (“Ward’s method”, i.e., agglomerative hierarchical clustering using Euclidean distances and the objective function in eq 5) to reflect the understanding that eq 5 no longer corresponds to Euclidean variance, which requires the l2-norm, when a non-Euclidean pairwise similarity function is used. (49) See Figure 1 of ref 21 for an example. (50) De Paris, R.; Quevedo, C. V.; Ruiz, D. D. A.; Norberto de Souza, O. An Effective Approach for Clustering InhA Molecular Dynamics Trajectory Using Substrate-Binding Cavity Features. PLoS One 2015, 10, e0133172. (51) Michel, V.; Gramfort, A.; Varoquaux, G.; Eger, E.; Keribin, C.; Thirion, B. A supervised clustering approach for fMRI-based inference of brain states. Pattern Recogn. 2012, 45, 2041−2049. (52) Varoquaux, G.; Gramfort, A.; Thirion, B. In Proceedings of the 29th International Confer- ence on Machine Learning, Edinburgh, Scotland, GB, June 25-July 1, 2012; Langford, J., Pineau, J., Eds.; Omnipress: New York, 2012; pp 1375−1382. (53) Thirion, B.; Varoquaux, G.; Dohmatob, E.; Poline, J.-B. Which fMRI clustering gives good brain parcellations? Front. Neurosci. 2014, 8, 167. (54) Deuflhard, P.; Huisinga, W.; Fischer, A.; Schü tte, C. Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains. Linear Algebra Appl. 2000, 315, 39−59. (55) Buchete, N.-V.; Hummer, G. Coarse Master Equations for Peptide Folding Dynamics. J. Phys. Chem. B 2008, 112, 6057−6069. (56) Weber, M. Improved Perron Cluster Analysis; Technical Report ZIB 03-04; Konrad-Zuse-Zentrum für Informationstechnik Berlin: Berlin, 2003. (57) Kube, S.; Weber, M. A coarse graining method for the identification of transition rates between molecular conformations. J. Chem. Phys. 2007, 126, 024103. 1081

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082

Article

Journal of Chemical Theory and Computation (58) Röblitz, S. Statistical Error Estimation and Grid-free Hierarchical Refinement in Conformation Dynamics. Ph.D. Thesis, Freie Universität, Berlin, 2008. (59) Huang, X.; Yao, Y.; Bowman, G. R.; Sun, J.; Guibas, L. J.; Carlsson, G. E.; Pande, V. S. Constructing multi-resolution Markov state models (MSMs) to elucidate RNA hairpin folding mechanisms. Pac. Symp. Biocomput. 2010, 228−239. (60) Yao, Y.; Cui, R. Z.; Bowman, G. R.; Silva, D.-A.; Sun, J.; Huang, X. Hierarchical Nyström methods for constructing Markov state models for conformational dynamics. J. Chem. Phys. 2013, 138, 174106. (61) Noé, F.; Wu, H.; Prinz, J.-H.; Plattner, N. Projected and hidden Markov models for calculating kinetics and metastable states of complex molecules. J. Chem. Phys. 2013, 139, 184114. (62) Honda, S.; Yamasaki, K.; Sawada, Y.; Morii, H. 10 Residue Folded Peptide Designed by Segment Statistics. Structure 2004, 12, 1507−1518. (63) Honda, S.; Akiba, T.; Kato, Y. S.; Sawada, Y.; Sekijima, M.; Ishimura, M.; Ooishi, A.; Watanabe, H.; Odahara, T.; Harata, K. Crystal Structure of a Ten-Amino Acid Protein. J. Am. Chem. Soc. 2008, 130, 15327−15331. (64) Vitalini, F.; Noé, F.; Keller, B. Molecular dynamics simulations data of the twenty encoded amino acids in different force fields. Data Brief 2016, 7, 582−590. (65) Lindorff-Larsen, K.; Piana, S.; Palmo, K.; Maragakis, P.; Klepeis, J. L.; Dror, R. O.; Shaw, D. E. Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins: Struct., Funct., Genet. 2010, 78, 1950−1958. (66) Swope, W. C.; Pitera, J. W.; Suits, F. Describing Protein Folding Kinetics by Molecular Dynamics Simulations. 1. Theory. J. Phys. Chem. B 2004, 108, 6571−6581. (67) For this implementation of BACE we have used the observed transition counts, but the misassignment also occurs when the transition counts are symmetrized by averaging the raw matrix with its transpose. When no states are considered to be undersampled, different mislabeled states appear. (68) Wedemeyer, W. J.; Welker, E.; Scheraga, H. A. Proline CisTrans Isomerization and Protein Folding. Biochemistry 2002, 41, 14637− 14644. (69) Banushkina, P. V.; Krivov, S. V. Nonparametric variational optimization of reaction coordinates. J. Chem. Phys. 2015, 143, 184108. (70) McGibbon, R. T.; Husic, B. E.; Pande, V. S. Identification of simple reaction coordinates from complex dynamics. J. Chem. Phys. 2017, 146, 044109. (71) At this stage, the trajectory is represented by a list of integers 0 through 172. To introduce the new microstate, we choose a random point in the trajectory and insert state 173 for a set number of time steps. (72) Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; Shaw, D. E. How Fast-Folding Proteins Fold. Science 2011, 334, 517−520. (73) McKiernan, K. A.; Husic, B. E.; Pande, V. S. Modeling the mechanism of CLN025 betahairpin formation. J. Chem. Phys. 2017, 147, 104107. (74) Piana, S.; Lindorff-Larsen, K.; Shaw, D. E. How Robust Are Protein Folding Simulations with Respect to Force Field Parameterization? Biophys. J. 2011, 100, L47−L49. (75) MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wiörkiewicz-Kuczera, J.; Yin, D.; Karplus, M. All-Atom Empirical Potential for Molecular Modeling and Dynamics Studies of Proteins. J. Phys. Chem. B 1998, 102, 3586−3616. (76) Wang, L.-P.; McKiernan, K. A.; Gomes, J.; Beauchamp, K. A.; Head-Gordon, T.; Rice, J. E.; Swope, W. C.; Martínez, T. J.; Pande, V. S. Building a More Predictive Protein Force Field: A Systematic and Reproducible Route to AMBER-FB15. J. Phys. Chem. B 2017, 121, 4023−4039.

(77) Wang, L.-P.; Martinez, T. J.; Pande, V. S. Building Force Fields: An Automatic, Systematic, and Reproducible Approach. J. Phys. Chem. Lett. 2014, 5, 1885−1891. (78) It should be noted that the variances obtained from the entire data set, analyzed without boostrapping, are expected to be higher than those of the bootstrapped samples since multiple instances of the same part of the data will generally decrease the variance (79) García, A. E.; Sanbonmatsu, K. Y. Exploring the energy landscape of a hairpin in explicit solvent. Proteins: Struct., Funct., Genet. 2001, 42, 345−354. (80) McGibbon, R. T.; Ramsundar, B.; Sultan, M. M.; Kiss, G.; Pande, V. S. Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models. In Proceedings of the 31st International Conference on Machine Learning; Association for Computing Machinery: New York, 2014; pp 1197−1205. (81) Wu, H.; Prinz, J.-H.; Noé, F. Projected metastable Markov processes and their estimation with observable operator models. J. Chem. Phys. 2015, 143, 144101. (82) Harrigan, M. P.; Sultan, M. M.; Hernändez, C. X.; Husic, B. E.; Eastman, P.; Schwantes, C. R.; Beauchamp, K. A.; McGibbon, R. T.; Pande, V. S. MSMBuilder: Statistical Models for Biomolecular Dynamics. Biophys. J. 2017, 112, 10−15. (83) Plattner, N.; Doerr, S.; De Fabritiis, G.; Noé, F. Complete protein-protein association kinetics in atomic detail revealed by molecular dynamics simulations and Markov modelling. Nat. Chem. 2017, 9, 1005−1011. (84) Zhou, G.; Pantelopulos, G. A.; Mukherjee, S.; Voelz, V. A. Bridging Microscopic and Macroscopic Mechanisms of p53-MDM2 Binding with Kinetic Network Models. Biophys. J. 2017, 113, 785−793. (85) Razavi, A. M.; Voelz, V. A. Kinetic Network Models of Tryptophan Mutations in -Hairpins Reveal the Importance of NonNative Interactions. J. Chem. Theory Comput. 2015, 11, 2801−2812. (86) Zhou, G.; Voelz, V. A. Using Kinetic Network Models To Probe Non-Native Salt-Bridge Effects on -Helix Folding. J. Phys. Chem. B 2016, 120, 926−935. (87) Wan, H.; Zhou, G.; Voelz, V. A. A Maximum-Caliber Approach to Predicting Perturbed Folding Kinetics Due to Mutations. J. Chem. Theory Comput. 2016, 12, 5768−5776. (88) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 1983, 79, 926−935. (89) Horn, H. W.; Swope, W. C.; Pitera, J. W.; Madura, J. D.; Dick, T. J.; Hura, G. L.; Head-Gordon, T. Development of an improved four-site water model for biomolecular simulations: TIP4P-Ew. J. Chem. Phys. 2004, 120, 9665−9678. (90) Nüske, F.; Wu, H.; Prinz, J.-H.; Wehmeyer, C.; Clementi, C.; Noé, F. Markov state models from short non-equilibrium simulationsAnalysis and correction of estimation bias. J. Chem. Phys. 2017, 146, 094104. (91) SciPy: Open source scientific tools for Python (2001). http:// www.scipy.org/. (92) Hart, K. M.; Ho, C. M. W.; Dutta, S.; Gross, M. L.; Bowman, G. R. Modelling proteins’ hidden conformations to predict antibiotic resistance. Nat. Commun. 2016, 7, 12965. (93) McGibbon, R. T.; Beauchamp, K. A.; Harrigan, M. P.; Klein, C.; Swails, J. M.; Hernändez, C. X.; Schwantes, C. R.; Wang, L.-P.; Lane, T. J.; Pande, V. S. MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories. Biophys. J. 2015, 109, 1528− 1532. (94) McGibbon, R. T.; Hernández, C. X.; Harrigan, M. P.; Kearnes, S.; Sultan, M. M.; Jastrzebski, S.; Husic, B. E.; Pande, V. S. Osprey: Hyperparameter Optimization for Machine Learning. J. Open Source Software 2016, 1, 34.

1082

DOI: 10.1021/acs.jctc.7b01004 J. Chem. Theory Comput. 2018, 14, 1071−1082