A Minimum Variance Clustering Approach Produces Robust and

Dec 18, 2017 - For example, given an MSM transition matrix, MVCA can be performed with various functions from the SciPy Python package(91) and does no...
0 downloads 10 Views 1MB Size
Subscriber access provided by UNIV OF NEW ENGLAND ARMIDALE

Article

A minimum variance clustering approach produces robust and interpretable coarse-grained models Brooke E Husic, Keri A. McKiernan, Hannah K WaymentSteele, Mohammad Muneeb Sultan, and Vijay S. Pande J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.7b01004 • Publication Date (Web): 18 Dec 2017 Downloaded from http://pubs.acs.org on December 19, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

A minimum variance clustering approach produces robust and interpretable coarse-grained models Brooke E. Husic, Keri A. McKiernan, Hannah K. Wayment-Steele, Mohammad M. Sultan, and Vijay S. Pande Department of Chemistry, Stanford University, Stanford CA 94305, USA (Dated: 27 November 2017)

Markov state models (MSMs) are a powerful framework for the analysis of molecular dynamics datasets, such as protein folding simulations, due to their straightforward construction and statistical rigor. Coarse-graining MSMs into an interpretable number of macrostates is a crucial step for connecting theoretical results with experimental observables. Here, we present the minimum variance clustering approach (MVCA) for coarse-graining a MSM into a macrostate model. The method utilizes agglomerative clustering with Ward’s minimum variance objective function, and the similarity of the microstate dynamics are determined using the Jensen-Shannon divergence between the corresponding rows in the MSM transition probability matrix. We first show that MVCA produces intuitive results for a simple tripeptide system, and is robust to long-duration statistical artifacts. MVCA is then applied to two protein folding simulations of the same protein in different force fields to demonstrate that a different number of macrostates is appropriate for each model, revealing a misfolded state present in only one of the simulations. Finally, we show that the same method can be used to analyze a dataset containing many MSMs from simulations in different force fields by aggregating them into groups and quantifying their dynamical similarity in the context of force field parameter choices. The minimum variance clustering approach with the Jensen-Shannon divergence provides a powerful tool to group dynamics by similarity, both among model states and among dynamical models themselves.

ACS Paragon 1Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

I.

INTRODUCTION Molecular dynamics (MD) simulations provide the temporal and spatial resolution to

investigate biomolecular dynamics in silico.1 Specialized hardware2 and distributed computing networks3,4 have enabled simulations at biologically relevant timescales up to tens of milliseconds. These datasets are so massive that they require complementary analysis procedures in order to process and understand the simulation results. Markov state models (MSMs)5,6 represent one framework that has successfully harnessed huge MD datasets to describe the dynamics of a simulated system. MSMs are a master equation approach to conformational dynamics that partition the phase space explored by a system into disjoint states. The states are assumed to be Markovian, or “memoryless,” where the probability of transitioning between two states is independent of the previously occupied states. The transition matrix of a MSM, often defined for a discrete, Markovian time step, contains thermodynamic, kinetic, and pathway information, and can propagate the system forward in time. Early MSM methods development for applications to MD focused on choosing disjoint states such that the system dynamics are approximately Markovian.7,8 For the first quantitatively predictive studies of protein folding, tens or hundreds of thousands of states were used to describe the conformational dynamics.9,10 These states were determined by system experts and often relied on heuristic choices. In these studies, macrostate analyses were performed on the transition matrices to reduce the number of states to a few thousand.9–11 Many methods developers advocated a two step process: first, use heuristics or system expertise to develop a quantitative microstate model with many thousands of states, and then create a fewer-state coarse-grained model for interpretability.12–15 The variational principle for MSMs16,17 transitioned MSM methods development away from heuristic state choices towards systematic and objectively comparable state decompositions. This variational approach to conformational dynamics (VAC)16,17 thus introduced a systematic method for choosing among state decomposition choices. The VAC, combined with improvements in model selection18–20 and the clustering step of the state decomposition,21 have led to more statistically robust models with variationally optimized numbers of microstates, usually on the order of hundreds.22 Coarse-grained models for these MSMs often contain just a few states, and represent a bridge between many-state theoretiACS Paragon 2Plus Environment

Page 2 of 34

Page 3 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

cal models and generally coarser experimental observables. Thus, the transformation of an optimized MSM containing a few hundred microstates into an interpretable, coarse-grained model with two or three macrostates is crucial for communicating predictions or explanations derived from a MSM.14 In this work, we adapt agglomerative hierarchical clustering to produce interpretable coarse-grained MSMs. By using a version of the Jensen-Shannon divergence23 to determine similarity, we can group states with similar dynamics using a minimum variance criterion.24 We begin by introducing the MSM transition matrix and discussing how to compare and cluster its constituent probability densities according to minimum variance. We then introduce a new minimum variance clustering approach for coarse-graining MSMs into interpretable macrostate models. After motivating the technique, we compare it to other coarse-graining algorithms for a proline-centered tripeptide system. We then demonstrate its performance on two protein folding simulations of the same system performed in different force fields, revealing a misfolded state present in only one of the simulations. Finally, we demonstrate how the method can be extended to cluster groups of transition matrices from separate MSMs by analyzing a protein folding simulation performed in nine different force fields. The combination of the Jensen-Shannon representation for assessing differences in dynamics with a minimum variance criterion is shown to be effective in generating robust, interpretable models.

II.

THEORY BACKGROUND

A.

The Markovian transition matrix A discrete time, discrete space MSM can be created from trajectory data X under the

assumption of the Markov property, which states that the conditional probability of transitioning from state Xt = i to state Xt+τ = j after a lag time of τ is independent of the state of the system at times before t. This transition matrix is used to solve the generalized eigenvalue problem,17

T a = λa.

(1)

where the transition matrix T is row-stochastic, a are the eigenvectors, and the eigenvalues ACS Paragon 3Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 34

λi are real and indexed in decreasing order. The first eigenvector corresponds to the identity function and the eigenvalue λ1 = 1 is unique. All subsequent eigenvalues are on the interval |λi>1 | < 1 and correspond to dynamical processes in the time series. The timescales characterizing the decay of these processes to the stationary distribution can be obtained from the eigenvalues and the lag time at which T was defined by ti = −τ / ln λi .25 The MSM is obtained from simulation data by transforming the raw Cartesian coordinates into (integer) microstate labels such as by clustering on the pairwise root mean square deviations between conformations, on an internal coordinate system such as residue contact distances or backbone dihedral angles, or on a transformed dataset with reduced dimensionality, such as from time structure-based independent component analysis (tICA).26–28 These choices can be evaluated and optimized using the VAC.29,30 Then, given a discrete, disjoint state decomposition, the elements of the transition matrix can be obtained from observed transition counts. Finally, the maximum likelihood estimator (MLE) of the transition matrix that fulfills the detailed balance is used.31

B.

Comparing transition probability matrices For a MSM represented by a row-stochastic transition probability matrix with N mi-

crostates, one way to write the similarity between row P and row Q is,

divKL (P ||Q) ≡

N X

P (i) log

i

P (i) , Q(i)

(2)

which is the Kullback-Leibler divergence from the “reference” distribution P to the “test” distribution Q,32 and is generally not equal to divKL (Q||P ). To compare two probability distributions in a symmetric way that does not require one distribution to be the test and the other to be the reference, the Jensen-Shannon divergence can be used,23 1 1 divJS (P ||Q) ≡ divKL (P ||M ) + divKL (Q||M ), 2 2

(3)

where M = 12 P + 12 Q. In other words, we separately evaluate how well the mean of P and Q represents each distribution and sum the normalized divergences. Endres and Schindelin 33 showed that the square root of the Jensen-Shannon divergence obeys the triangle inequality, so we write the following symmetric distance metric, ACS Paragon 4Plus Environment

Page 5 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

div√JS (P ||Q) ≡

p divJS (P ||Q).

(4)

Similar formulations using either the Kullbeck-Leibler or Jensen-Shannon divergences have been used to compare MSM transition matrices for adaptive sampling,32 coarsegraining,34 perturbations to dynamics,35–37 coupling to experiment,38 and surprisal39 and error40 analyses.

C.

Agglomerative clustering of transition matrices Hierarchical agglomerative clustering algorithms partition data points into clusters ac-

cording to a similarity function and an objective function. Initially, each data point is the sole element of a singleton cluster. At every step, the closest two clusters according to the similarity function are agglomerated into a new cluster according to the objective function. The clustering stops when the desired number of clusters or a threshold in the objective function is reached. Common objective functions for hierarchical agglomerative clustering include single, complete, and average linkages, which define the distance between clusters i and j as the smallest, greatest, or average pairwise distance between any point in i and any point in j, respectively. A more intricate objective function known as Ward’s method defines the distance between two clusters as the increase in intra-cluster variance resulting from the agglomeration of those clusters.24 At each agglomeration step, the distances between the clusters are updated according to the formula,41,42 r d(u, v) =

|v| + |s| |v| + |t| |v| d(v, s)2 + d(v, t)2 − d(s, t)2 , T T T

(5)

where the clusters s and t have been agglomerated to make new cluster u, and d(u, v) represents the updated distance between the newly formed cluster u and another unchanged cluster v. |s|, |t|, and |v| represent the number of data points in clusters s, t, and v, and T ≡ |s|+|t|+|v|. When performing hierarchical agglomerative clustering using the objective function in (5), we refer to this as using Ward’s minimum variance criterion. Ward’s method for agglomerative clustering has been used43–47 and shown to produce variationally optimal models21 for MSM state decomposition. In addition to the objective ACS Paragon 5Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

function for cluster agglomeration, the clustering algorithm also requires a pairwise similarity function such as Euclidean distance or RMSD.48 In this work, we use the div√JS defined in Sec. II B as the similarity function for clustering the rows of transition probability matrices to coarse-grain MSMs in an interpretable and robust way.

III. A.

CHOOSING MINIMUM VARIANCE MACROSTATES Motivation and background

In this section, we use div√JS and Ward’s minimum variance criterion to cluster rows within a MSM transition matrix to produce coarse-grained MSMs. As an example, consider an equally spaced, four-state partition of a one-dimensional potential with two evenly spaced wells (Fig. 1). Numbering the four states from left to right might produce the following rowstochastic transition matrix,

  0.9733 0.0267 0. 0.   0.0250 0.9714 0.0036 0.     ,  0. 0.0030 0.9739 0.0231   0. 0. 0.0255 0.9745 where each row represents the probability of transitioning from a state to itself (diagonal) or any other state (off-diagonal). The div√JS between the first and second rows (Fig. 1, left well) is 0.757 and the div√JS between the third and fourth rows (Fig. 1, right well) is 0.761. The div√JS between any pair of rows with one state in each well is greater than 0.819. It is clear from this simple example that a reasonable two-state approximation to the four-state model represented by the transition matrix would involve merging the two states in the left basin and merging the two states in the right basin. To extend this analysis to larger systems with more states, we use Ward’s minimum variance objective function. The other agglomerative clustering objective functions enumerated in Sec. II C (single, average, or complete linkages) would not be appropriate for this type of analysis. Since single and complete linkages define the distance between any two clusters by only one pair of points, these options will lead to pathological examples since most data is not used when evaluating the objective function. Although average linkage does incorporate ACS Paragon 6Plus Environment

Page 6 of 34

Page 7 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

FIG. 1.

Double well potential partitioned into four equal microstates along the dotted lines. A

two-macrostate model for this potential should map the left two microstates to the left basin and the right two microstates to the right basin.

all data points, it has also been shown to produce results that are similarly pathological to single linkage on simple datasets.49 In contrast, Ward’s method has successfully been used to analyze MD simulations21,43–47,50 and datasets in other fields such as neuroimaging.51–53 Determining MSM macrostates via analysis of the transition matrix is not a new approach. An early method developed to coarse-grain MSMs is Perron cluster cluster analysis (PCCA).8,54,55 PCCA leverages the separation of timescales calculated from the eigendecomposition of the transition matrix. The PCCA algorithm assumes the transition matrix has a block structure, which can be seen in the toy transition matrix above. Using the eigenvectors from the decomposition, states are separated according to positive and negative elements of the eigenvectors. For example, the dominant (right) eigenvector of the transition matrix above would be,

h

i −1.1586 −1.0192 0.8578 0.9812 ,

so the first two microstates would be assigned to one macrostate, since the corresponding elements are both negative, and the third and fourth microstates would be assigned to another macrostate, since those elements are both positive. Coarse-graining into more than two macrostates employs the signs of elements in subsequent eigenvectors to further split the existing macrostates. Since PCCA does not incorporate the magnitude of the eigenvector elements, it is easy to see that elements very close to zero will affect the results of PCCA in a possibly arbitrary way. Robust PCCA, often called PCCA+,11,56,57 was thus developed as an augmentation ACS Paragon 7Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

to PCCA to improve treatment of eigenvector elements close to zero instead of classifying them according to their sign as in PCCA. The PCCA+ algorithm identifies representative states according to their distinctness, and assigns remaining states degrees of membership (as opposed to “crisp” membership) to the representative set.58 The Bayesian agglomerative clustering engine (BACE)34 is a different approach in which the matrix of observed transition counts is analyzed. First, a Bayes factor between each pair of states is computed, which is a version of the Jensen-Shannon divergence where each term is weighted by the number of observed counts for that state.34 Then, agglomerative clustering is performed by merging the two states with the smallest Bayes factor. At each agglomeration, new Bayes factors are recomputed and the clustering continues until the desired number of macrostates is reached. This number can be chosen a priori, as in PCCA or PCCA+, or by evaluating the change in the Bayes factor of the merged states at each step of the algorithm. The minimum variance clustering analysis (MVCA) introduced in this work utilizes the transition probability matrix by coarse-graining the probability distributions represented by its rows. MVCA relies on the intuitive minimum variance objective function to group the rows of the transition matrix according to their similarity, which is measured by div√JS . The change in the objective function for each agglomeration can also be used to determine how many macrostates are appropriate for a given system. In the following sections, we apply MVCA to proline dynamics and compare its performance with PCCA, PCCA+, and BACE. Finally, we demonstrate the ability of MVCA to determine the appropriate number of macrostates for two different models for the folding of the CLN025 miniprotein. Other MSM coarse-graining algorithms include the hierarchical Nystr¨om extension graph (HNEG)59,60 and projected Markov models (PMMs).61 HNEG is discussed separately in Appendix C because, unlike the coarse-graining algorithms above, it does not permit a userselected number of macrostates. PMMs are briefly discussed in Sec. III D, and are considered separately because they require a prior macrostate model.

B.

Identifying macrostates of proline dynamics We first investigate the dynamics of the tripeptide Asp-Pro-Glu, which are the third,

fourth, and fifth residues of CLN025 and its predecessor chignolin, both of which are 10ACS Paragon 8Plus Environment

Page 8 of 34

Page 9 of 34

residue β-hairpin peptides designed to fold quickly.62,63 This segment is interesting because the proline is the last residue of the β-sheet and the glutamic acid is the first residue of the turn. Furthermore, proline occupies only two metastable basins with known secondary structure character.64 The tripeptide was simulated for 1 µs in the AMBER ff99SB-ILDN force field65 at 300 K with frames saved every 100 ps. To isolate the proline dynamics, we first partition the φ and ψ space of the central proline residue into 36 regularly spaced intervals for a total of 1296 square clusters. We then build a MSM from this state decomposition at a lag time of 100 ps, which was verified to be Markovian using an analysis of implied timescales.66 Figure 2 shows the free energy of each of the populated microstates. Using the regular spatial clustering described above, only 172 of 1296 possible φ × ψ microstates are occupied for this system. The transition matrix is relatively sparse; for each row, the number of nonzero transitions ranges from 1 to 70. The mean number of nonzero transitions is 22, and the median is 15. Since the edges of the free energy surface are periodic, it is qualitatively clear from Fig. 2 there are two basins, which is expected for proline.64 The basin at greater values of ψ is the β-sheet region, and the basin at lower values of ψ is the α-helical region. An effective coarse-graining algorithm should identify these basins when two macrostates are used. Proline free energy surface 3.5

120

3.0 2.5 2.0

0

1.5 1.0

120

0.5

120

FIG. 2.

0

Free energy (kcal/mol)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

120

Free energy surface of the central proline residue in the Asp-Pro-Glu tripeptide based

on a MSM constructed from regular spatial clustering of the φ and ψ proline backbone angles. The boundaries are periodic, so the free energy surface identifies two basins. The basin at higher values of ψ represents β-sheet conformations and the basin at lower values of ψ represents α-helical conformations. 172 of 1296 possible microstates are occupied. ACS Paragon 9Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

PCCA, PCCA+, BACE, and MVCA are applied to this dataset to create a twomacrostate model, the latter of which is applied by agglomerative clustering of the rows of the transition matrix using div√JS . These four results are shown in Fig. 3 (top row). All four methods successfully identify the two metastable basins in Fig. 2 as macrostates. All methods have some trouble around the transition region, which is difficult to sample effectively. Additionally, PCCA, PCCA+, and BACE each mislabel a microstate near ψ ≈ 140 as α-helical even though it is in the β-sheet basin, while MVCA does not. This mislabeled microstate is only accessed once during the simulation: in this instance, the system transitions to the mislabeled state from a state in the β-sheet basin at the previous time step and to a state in the α-helix basin at the next time step. The corresponding element of the first non-Perron right eigenvector, on which PCCA and PCCA+ operate for the first splitting, incorrectly dictates that this microstate should belong to the α-helix basin. The implementation of the BACE algorithm as described in Ref. 34 involves a preprocessing step in which undersampled states are merged with their kinetically nearest neighbor before coarse-graining. It is in this step that the microstate near ψ ≈ 140 is misassigned. If the algorithm is adjusted such that no states are considered to be undersampled, BACE correctly assigns this microstate, but mislabels a different state in the β-sheet basin.67 The MVCA algorithm uses the transition probability matrix to coarse-grain the MSM. The transition matrix row corresponding to this mislabeled microstate shows it has a 49.8% chance of transitioning to the β-sheet region microstate from which it came in the original trajectory and a 50.2% chance of transitioning to the α-helix region microstate to which it transitions in the original trajectory. To understand why MVCA allocates this row to the β-sheet macrostate instead of the α-helix one, we note that most of the higher energy microstates in the transition region are assigned to the pink β-sheet macrostate (Fig. 3). These higher energy microstates are reasonably likely to transition to both the β-sheet (pink) and α-helical (blue) macrostates. Therefore, even though the microstate near ψ ≈ 140 that is mislabeled by the other algorithms is slightly more likely to transition to the α-helical macrostate according to the transition matrix probabilities, it contributes less variance to the sum of total intra-cluster variances (i.e., the Ward objective function) when it is merged into the β-sheet state, since the pink β-sheet state in the two-macrostate model contains most of the higher-energy microstates from the transition region. In MD simulations of large systems, it may be the case that some degrees of freedom are ACS Paragon10 Plus Environment

Page 10 of 34

Page 11 of 34

PCCA

PCCA+

BACE

MVCA

120

120

120

120

0

0

0

0

120

120

120

120

120

Number of microstates

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

0

120

120

0

120

120

0

120

120

150

150

150

150

100

100

100

100

50

50

50

50

FIG. 3.

0

1 10 100 500 Artifact duration (ns)

0

1 10 100 500 Artifact duration (ns)

0

1 10 100 500 Artifact duration (ns)

0

0

120

1 10 100 500 Artifact duration (ns)

Coarse-graining of 172 microstates into two macrostates with PCCA, PCCA+, BACE,

and MVCA for the central proline residue of the Asp-Pro-Glu tripeptide (top) and robustness of each coarse-graining algorithm to an artifact of variable duration (bottom). All coarse-graining algorithms identify the two metastable basins, but the PCCA, PCCA+, and BACE models identify a mislabled state near ψ ≈ 140, whereas the MVCA model does not. The algorithms exhibit varying levels of robustness to a random artifact inserted once in the trajectory for 100 ps or longer. The bottom row shows plots of the number of microstates contained in the two macrostates identified by the models in the top row. The artifact is not shown, since it is physically meaningless. The PCCA, PCCA+, and BACE plots show after what artifact duration the pink macrostate contains only the artifact and the blue macrostate contains all other states. The dashed line represents the longest timescale in the microstate MSM without the artifact. PCCA and PCCA+ are the least robust to this artifact and break down when the duration of the artifact approaches the longest MSM timescale. After a two orders of magnitude increase in the artifact duration, BACE also breaks down. MVCA is not affected by this type of artifact.

not fully sampled and an event unrelated to the dynamics under study, such as a dihedral flip or a contracted distance between residues during a protein folding simulation, may occur on a much slower timescale than the process of interest.68–70 To explore the robustness of ACS Paragon11 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the macrostating algorithm to this type of event, we also investigate how the addition of an artifact to the trajectory affects resulting coarse-grained model. To do this, we incorporate a new, previously unseen microstate into the raw trajectory at the step before the MSM is built. The system enters this new microstate at a randomly chosen time point, remains in the system for a set duration of time, and then exits.71 This analysis is designed to model the effect of a single, rare dihedral flip or other event that is orthogonal to the dynamics of interest or undersampled at the timescales of interest. Since this state is inserted after the data has been clustered into microstates, it has no physical meaning in φ × ψ space. By increasing the duration of time that the system stays in this “artifact” microstate, we can observe at what durations each coarse-graining algorithm skews its macrostate model to accommodate this artifact. The results are shown in Fig. 3 (bottom). PCCA+ is found to be the least robust to the artifact, which affects the results of the macrostate model after a duration of 0.4 ns, and leads to a macrostate model where the artifact is a singleton macrostate and all other states comprise the second macrostate after an artifact duration of 2.9 ns. PCCA is also not robust to artifact durations on the order of nanoseconds, and the coarse-graining is affected after an artifact duration of 4.3 ns and produces a singleton, artifact-only macrostate after 9.3 ns. The BACE macrosate model is robust to the artifact until a duration of 324 ns (nearly one quarter of the modified trajectory length), after which the coarse-grained model yields one singleton macrostate containing the artifact. Since the BACE Bayes factors incorporate the number of observed transition counts from each state, as these counts increase the Bayes factors change until the artifact becomes the most dissimilar macrostate. In contrast, the MVCA model is robust to any duration of the artifact, evaluated up to 100 ms. This is because at no duration of the artifact does the variance after merging it with either of the basins exceed the variance of merging the basins with each other. The longest timescale of the microstate MSM is 3.9 ns, and we see that PCCA and PCCA+ break down when the duration of the artifact approaches the length of the longest timescale because the first splitting is performed based on the dominant eigenvector of the transition matrix eigendecomposition. PCCA and PCCA+ are therefore not expected to produce interpretable macrostate models for MSMs created from MD simulations when the timescale of interest is shorter than another timescale in the simulation, whereas BACE and MVCA are robust to artifacts at moderate durations. This type of artifact can be common ACS Paragon12 Plus Environment

Page 12 of 34

Page 13 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

in large MD simulations due to modes that are rarely accessed for substantial durations of time, which might occur due to stochastic sampling or a few starting positions when using many short trajectories to model a system. It may be the case that, at very long durations, the “artifact” is physically meaningful. Monitoring the change in the Bayes factors with the artifact duration increasingly favors a three-macrostate model containing the two macrostates identified in the top row of Fig. 3 and the artifact as a singleton cluster. This BACE result is replicated by MVCA if Bayes factors are used as the similarity function instead of div√JS , because the Bayes factors incorporate observed counts when assessing microstate similarity (the minimum variance criterion need not be modified). However, it may also be the case that a very long-lasting artifact obscures the slow process of interest, as in Ref. 70, Fig. 9. The appropriate way to analyze such a long-duration artifact should therefore be decided on a case-by-case basis. Both analysis methods, i.e. using unweighted or counts-based microstate similarity functions, are amenable to the minimum variance objective function by simply swapping out the standard div√JS for a counts-weighted function in the latter case. We therefore recommend MVCA, which is shown here to be robust to such artifacts and amenable to modularity if a different similarity function is preferred.

C.

Analyzing the folding of CLN025 in different force fields The folding of the CLN025 miniprotein was simulated by both Lindorff-Larsen et al. 72

and McKiernan et al. 73 in different force fields. The Lindorff-Larsen dataset uses the CHARMM22* protein force field74 and the mTIP3P water model75 and comprises a single 106 µs simulation. The McKiernan simulation was performed in the AMBER-FB15 protein force field76 and the TIP3P-FB water model,77 and contains 144 short (order 1 µs) trajectories for a total of 101 µs. The simulation details are described in the respective publications. For this example, we build an optimized MSM for each simulation dataset. For both MSMs, we represent the system by its α-dihedral angles, defined by each set of four adjacent α-carbons along the protein backbone. We then use tICA27,28 to transform the dataset, retaining all components after weighting with the kinetic mapping scheme;19 cluster from tICA space using mini-batch k-means; and build a MSM. For the CHARMM22* system we use a tICA lag time of 128 ns, 704 clusters and a MSM ACS Paragon13 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

lag time of 50 ns; for the AMBER-FB15 model we use a tICA lag time of 20 ns, 157 clusters, and a MSM lag time of 20 ns. The tICA and clustering parameters were selected through a variational analysis.18 The CHARMM22* analysis is described in the supplementary material of Ref. 73. The AMBER-FB15 analysis was performed using 200 trials, each with ten 50/50 cross-validation splits, to select the number of k-means cluster centers for a tICA lag time of 20 ns. The best model was chosen as the MSM (20 ns lag time) with the highest median cross-validated sum of eigenvalues for 5 timescales. Although the protein system is the same, these two models have been constructed independently with different protocols in different force fields. Here, we demonstrate how analysis with MVCA shows that different numbers of macrostates are appropriate for each model.

We analyze the optimized models separately by using MVCA to choose the appropriate number of macrostates and then to coarse-grain each model. Figure 4 shows how the updated sum of intra-cluster variances (recall (5)) changes with the number of macrostates (left), the coarse-grained model superimposed on the two-dimensional free energy surface (right), and example structures for each macrostate (bottom). For the CHARMM22* model, there is a sharp decrease in the Ward objective function from one to two macrostates, and much smaller decreases for models with more macrostates. In contrast, for the AMBER-FB15 model, there are substantial decreases in the objective function from both one → two and two → three macrostates, and shallower decreases thereafter.78 These plots justify a two-macrostate model for the CHARMM22* data and a three-macrostate model for the AMBER-FB15 data.

Even though the same system was simulated, the MVCA analysis suggests a different number of macrostates is best for each force field. Analysis of the identified macrostates show that the AMBER-FB15 model identifies an α-helical misfolded state79 in addition to the folded state and the extended denatured state. Analysis of the secondary structure distributions of the original microstates shows that the CHARMM22* model does not contain the α-helical state. Therefore, it is appropriate to choose a different number of macrostates for each simulation, which is consistent with the MVCA analysis. The differing results for each force field are likely due to the different protein and water model parameters. We see from this application that this method is useful for identifying an optimal number of macrostates through monitoring the reduction in variance at different levels of coarse-graining. ACS Paragon14 Plus Environment

Page 14 of 34

Page 15 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

FIG. 4.

Coarse-grained models for the folding free energy landscape of CLN025 in the

CHARMM22* force field (top) and AMBER-FB15 force field (bottom). Reaction coordinates 1 and 2 correspond to the first and second tICs for each model. According to changes in the Ward objective function at each agglomerating step (left), MVCA identifies two macrostates for the CHARMM22* model and three macrostates for the AMBER-FB15 model. Variance values are given by the median variance from 200 rounds of bootstrapping with replacement. Error bars represent the 5th and 95th percentiles from the bootstrapping analysis, and are too small to extend beyond the data point markers for greater numbers of macrostates. The three AMBER-FB15 macrostates correspond to the folded β-hairpin structure (green), a misfolded α-helical structure (blue), and the extended denatured state (yellow), whereas the two CHARMM22* macrostates identify only the folded and extended structures.

D.

Other algorithms and extensions BACE (recall Sec. III A) is an agglomerative clustering method for coarse-graining MSMs

that quantifies similarity between states by a Bayes factor, which is a reweighted form of the Jensen-Shannon divergence (3).34 Both the MVCA and BACE algorithms produce a ACS Paragon15 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

hierarchical tree that can be cut at any number of clusters, and extensions to MVCA are similar to those possible with BACE.34 For example, a state that results in high intracluster variance for any cluster it joins can be addressed as an outlier from the beginning using the pairwise distances. Whereas the BACE algorithm effectively forgets any original distance between any pair of states once they are merged, the MVCA algorithm retains the pairwise distances from the original microstate decomposition for the entirety of the clustering process, incorporating that information into each distance update and retaining a richer representation of the system. In a sense, the practical tradeoff between MVCA and BACE is thus a choice between memory and speed. Since MVCA updates every pairwise distance at every step, it is important to note that MVCA will scale poorly with the addition of data points (see Appendix A). However, for a coarse-graining analysis, the number of data points is the number of rows of the transition matrix (i.e., the number of microstates). Since most MSMs contain only a few hundred states, we recommend MVCA since it integrates knowledge of the original state decomposition throughout the coarse-graining procedure. In the case of a prohibitive number of microstates, landmark points can be used instead of all data points. For a detailed description of the differences between BACE and MVCA, see Appendix B. Two additional methods for macrostating have not yet been addressed: HNEG59,60 and PMMs.61 Since HNEG automatically chooses the appropriate number of microstates, it was not included in the above analysis and is instead discussed in Appendix C. PMMS as approximated by hidden Markov models (HMMs)61,80 or observable operator models (OOMs)81 represent an alternative method for coarse-graining MSMs. HMMs must be seeded with a reasonable starting model, for which the authors use PCCA+ on a microstate MSM in Ref. 61 and k-means on the raw trajectory data in Ref. 80. Seeding the HMM model with an initial coarse-grained model using MVCA could represent a valuable augmentation to this method. Agglomerative clustering requires two choices: the choice of a similarity function between data points (e.g., div√JS ) and the objective function used to agglomerate clusters (e.g., Ward’s method). Future improvements to MVCA could be incorporated into either choice. For example, the pairwise div√JS distances input into MVCA could be weighted to incorporate observed transition counts similar to the weighting of Bayes factors in BACE (recall Sec. III B). For the objective function, a standard error approach could be considACS Paragon16 Plus Environment

Page 16 of 34

Page 17 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

ered that is a function of both the variance and the number of data points of each cluster. Since the MVCA method introduced in this work is both effective and easy to implement out-of-the-box, we suggest starting with the methods proposed herein and then exploring modifications to either the distance function or objective function.

IV. A.

GROUPING AND COMPARING MANY MODELS Quantifying similarity among MSMs

We now extend the methods developed in this work to group the transition matrices of many MSMs in order to facilitate understanding of systems requiring many models. The introduction of the VAC16,17 and complementary updates to MSM building software29,82 have enabled the widespread utility of MSM analysis for complex, multi-system analyses such as the kinetics of protein–protein association,83,84 mutant effects on protein folding dynamics,85–87 merging of simulation and experimental data,38 and comparison of force field dynamics.73 These cutting-edge MSM applications involve an aggregated analysis of many models. In Sec. II B, we used div√JS to quantify the similarity between two rows of the same transition matrix. We can extend this method to quantify the similarity between two transition matrices from separate MSMs with the same state definitions. For two MSMs with row-stochastic probability transition matrices A and B, which we assume were generated from the same state decomposition,

D√JS ≡

X

div√JS (Ai ||Bi ),

(6)

i

where Ai is the ith row of the transition probability matrix A. In other words, we simply sum the div√JS for each corresponding row to produce the divergence D√JS between the two MSMs. Thus, instead of calculating the divergences among rows within a single transition matrix as in the previous sections, we are now calculating the divergences among separate MSMs by summing the divergences of each corresponding pair of rows for each pair of MSM transition matrices. In this way, an aggregation of MSMs can be separated into groups based on ACS Paragon17 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the dynamics encoded in their transition matrices. Below, we demonstrate this method by analyzing transition matrices originating from nine different force fields.

B.

Comparing protein and water models Hierarchical clustering on the MSM transition matrices constructed from independent

simulations of a protein using various force fields can provide insight into modeling choices by demonstrating which protein and water models yield similar dynamics for the model system. To investigate this, we examined nine folding simulations of the ten-residue CLN025 miniprotein.63 The first simulation, performed by Lindorff-Larsen et al. 72 , is described in Sec. III C and uses the CHARMM22* protein force field74 and the mTIP3P water model.75 The remaining datasets, one of which was also analyzed in Sec. III C, contain 136 to 160 short trajectories for a total of 100 to 107 µs. The simulation details are provided in the supplementary material of Ref. 73. These datasets represent eight combinations of protein force field (AMBER ff99SB-ILDN65 or AMBER-FB1576 ) and water model (TIP3P,88 TIP3PFB,77 TIP4P-ew,89 or TIP4P-FB77 ). To approximate a situation in which the modeler has performed some simulations and is investigating dynamics to motivate further study, we assume we can use the CHARMM22* trajectory as a reference to investigate the dynamics of the eight AMBER simulations, since MSMs for long trajectories such as the Lindorff-Larsen dataset are expected to be well-behaved and suitable for benchmarking.90 To analyze our aggregate dataset, we first construct an unoptimized MSM for the CHARMM22* reference data by following recommendations for current modeling practices.30 We choose a MSM using the α-carbon contact distances which are transformed into all 28 tICs weighted by their eigenvalues with a tICA lag time of 100 ns.19,27,28 From this tICA representation, we discretize conformation space using the mini-batch k-means clustering algorithm with 100, 200, 300, 400, and 500 cluster centers and construct a MSM with a lag time of 50 ns for each number of microstates.43 We then use these pretrained tICA and clustering models to create “projected” MSMs, which we refer to in quotations to distinguish them from the PMM macrostating method. A description of the MSM “projection” procedure can be found in Ref. 73. “Projected” MSMs sometimes do not occupy all microstates used to create the baseline model. In our analysis, for 100 microstates, no states were unoccupied for any model. For ACS Paragon18 Plus Environment

Page 18 of 34

Page 19 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

22*, ILDN

FB15

ILDN 22* mTIP3P TIP3P, TIP3P-FB, TIP4P-ew, TIP4P-FB

TIP3P, TIP3P-FB, TIP4P-ew, TIP4P-FB FIG. 5. Agglomerative clustering of nine transition matrices using Ward’s method and the D√JS distance metric. The tree shows a hierarchy of which factors contribute to dissimilarity of CLN025 dynamics for all five sets of microstate models used. The transition matrices are distinguished by the protein force field (CHARMM22*=22*, AMBER ff99SB-ILDN=ILDN, AMBER-FB15=FB15) when the fewest clusters are used. As more clusters are used, the transition matrices separate according to water models in the AMBER-FB15 branch. Finally, water models in the AMBER ff99SB-ILDN are differentiated. In one case (100 microstates), the AMBER ff99SB-ILDN/TIP3P model was isolated before the last two AMBER-FB15 water models were separated. It is interesting that the dynamics under CHARMM22* and AMBER ff99SB-ILDN are more similar to each other than to the dynamics under AMBER-FB15. It is also noteworthy that the water models separate in the AMBER-FB15 branch before the AMBER ff99SB-ILDN branch. This illustrates the unique sensitivity of the AMBER-FB15 force field to the water model.

microstate numbers 200 through 500, at most 2 states were unoccupied for any “projected” model, and the transition matrices were resized for the MVCA analysis. For more complicated systems in which microstate coverage is not complete, a combined MSM can be created by aggregating all the datasets and constructing a single tICA model for the entire dataset, and then creating separate MSMs.85–87 For each number of k-means microstates, we now have nine transition matrices to cluster using the D√JS similarity function. By performing agglomerative clustering with Ward’s minimum variance objective function for all possible numbers of clusters we can see a hierarchical representation of which aspects of the simulations partition the dynamics and at what clustering step certain attributes are agglomerated (Fig. 5). Figure 5 represents ACS Paragon19 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the clustering results that are robust to the number of microstates used to create the “projected” models. For all five sets of “projected” microstate models, the coarsest clustering separates the simulations by protein model. When three clusters are used to group the 9 transition matrices, the clusters are characterized by the three protein force fields. When only two clusters are used, the tree shows that the CHARMM22* and AMBER ff99SB-ILDN dynamics are more similar to each other than to the dynamics in AMBER-FB15. At finer partitions of the nine transition matrices, the clusters are characterized by their water models. In the AMBER-FB15 branch, the four water models separate when additional clusters are used. The AMBER ff99SB-ILDN models separate according to water model only after the water models in the AMBER-FB15 branch have separated, except in one case (100 microstates) where one AMBER ff99SB-ILDN model separates before the last AMBERFB15 model. This analysis shows that the AMBER-FB15 simulations with different water models are more dissimilar to each other than the AMBER ff99SB-ILDN simulations with different water models. This provides further evidence that protein dynamics are uniquely sensitive to the water model when AMBER-FB15 is used for the protein model.76 Application of MVCA to these nine CLN025 folding datasets, varying only with respect to force field, thus allows for the comparison of each underlying free energy landscape. Furthermore, variable levels of detail can be unpacked when applying this method with a variable number of dynamical subgroups (Fig. 5, different layers of the tree). Broadly, using a small number of clusters, we find that the datasets are distinguishable by protein model. Using a larger number of clusters, the models become distinguishable by water model. This suggests that simulation dynamics are primarily guided by protein force field, and that the influence of water model is of a finer dynamical character. Overall we find that coarsegraining the force field-dependent MSMs via MVCA allows the quantitative comparison of the free energy landscapes modeled by each force field, and using a variable number of groups allows one to make dynamical comparisons at variable levels of scale. Comparing the MSMs generated by a set of force fields is a useful approach for comparing the landscape approximated by each force field. This is because a MSM summarizes the MD dataset using a landscape of energy basins (state populations) and barrier heights (transition probabilities). When a macrostate model is built, the coarse-graining process has the effect of smoothing this landscape, averaging over short-range noise. Comparison of these coarsegrained landscapes created with MVCA enables the comparison of the force fields according ACS Paragon20 Plus Environment

Page 20 of 34

Page 21 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

only to their description of only the most salient dynamical processes.

V.

CONCLUSIONS Coarse-graining MSMs for interpretability and further analysis is important for under-

standing what our models are telling us and for connecting theoretical models with experimental observables. Furthermore, the current state of the art in MSM methods and applications involves an aggregated analysis of many MSMs.73,85–87 Here, we adapt agglomerative hierarchical clustering with the square root of the Jensen-Shannon divergence and Ward’s minimum variance objective function to implement a method for comparing and clustering probability distributions within the transition matrix of a single MSM, or the transition matrices themselves from a dataset of many MSMs. In Sec. III, we introduce MVCA, which clusters probability densities within a transition matrix using div√JS as the similarity function and Ward’s method as the objective function in order to produce robust coarse-grained MSMs. We show that MVCA can be used to determine the appropriate number of macrostates for a model and revealed a misfolded state of CLN025 present in one force field but not another. Furthermore, MVCA is straightforward and easy to execute with available implementations of Ward’s method for agglomerative clustering.42,82,91 Finally, in Sec. IV, we show utility of the methods presented in this work for analyzing sets of many MSMs by comparing simulations of the same folding system in multiple force fields. Applying MVCA to groups of MSMs in order to cluster their transition matrices based on dynamical similarity provides a versatile tool for multi-model analyses. For example, a set of simulations including a wild type protein and several mutants can be clustered using MVCA when the same state decomposition is used for all models. The change in objective function can be monitored to determine the appropriate number of groups, and then the membership of those groups can aid further analysis: for example, determining what mutations are in the same dynamical group as the wild type, and what mutations form separate groups, can provide insight into interpreting and further analyzing of the mutated system. We note that other methods for analyzing these types of mutated datasets have been presented in Refs. 85–87 and 92. The minimum variance criterion that characterizes Ward’s method enables an intuitive grouping of microstates or entire models based on the variance in their dynamics, and the ACS Paragon21 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

variance within each cluster can lend insight into that group of data points. We anticipate that the methods in this paper, namely the union of the div√JS and D√JS similarity functions with agglomerative clustering using Ward’s minimum variance objective function, will provide an easy-to-implement analysis tool for coarse-graining MSM transition matrices. The same method can also be used for grouping many MSMs based on their dynamics, which will become increasingly common as MSM methods continue to be developed. Free, open source software implementing the methods used in this work is available in the MDTraj,93 MSMBuilder,82 and Osprey94 packages available from mdtraj.org and msmbuilder.org. Please see Appendix D for a note on different implementations of agglomerative clustering.

ACKNOWLEDGMENTS B.E.H. is extremely grateful to John Dabiri, Matt Harrigan, Xuhui Huang, Sukrit Singh, Greg Bowman, and Swetava Ganguli for insightful discussions. The authors thank Jade Shi and Nate Stanley for manuscript feedback, and the anonymous reviewers for their suggestions. We are grateful to Matt Harrigan for providing the tripeptide dataset. We acknowledge the National Institutes of Health under No. NIH R01-GM62868 for funding. H.K.W.S. acknowledges support from the NSF GRFP. We graciously acknowledge Folding@home donors who contributed to the AMBER force field simulations and D. E. Shaw Research for providing CHARMM22* simulation data. V.S.P. is a consultant & SAB member of Schrodinger, LLC and Globavir, sits on the Board of Directors of Apeel Inc, Freenome Inc, Omada Health, Patient Ping, Rigetti Computing, and is a General Partner at Andreessen Horowitz.

Appendix A: Computation time for coarse-graining analyses All coarse-graining was performed locally on a Mac mini (2.8 GHz Intel Core i5, 4 cores). Agglomerative clustering was performed using all data points without a landmark approximation. The coarse-graining of 172 microstates to two macrostates in Sec. III B were recorded in seconds as follows: PCCA, 0.11; PCCA+, 0.19; BACE, 0.38; MVCA, 20.40. To produce the macrostate models in Sec. III C, the coarse-graining of 704 microstates to 2 macrostates for the CHARMM22* model took 1589.35 seconds, and the coarse-graining of ACS Paragon22 Plus Environment

Page 22 of 34

Page 23 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

157 microstates to 3 macrostates for the AMBER-FB15 model took 32.37 seconds.

Appendix B: Comparison with BACE The BACE coarse-graining method34 also uses agglomerative clustering to produce a macrostate model, but it differs in both its similarity function and objective function. When using BACE, the most similar clusters are merged according to their Bayes factor, which is a form of the Jensen-Shannon divergence (3) weighted by the observed counts. At each agglomeration the Bayes factors are recomputed for all pairs involving the newly merged cluster. The MVCA approach is similar in that it also assesses similarity according to the (unweighted) Jensen-Shannon divergence; however, the minimum variance criterion optimized during agglomeration, which is a function of both the original div√JS distances and the cardinality of the clusters (recall (5)), differentiates MVCA from BACE in two key ways. First, MVCA is easier to execute in practice both because pairwise div√JS distances need only be computed once and because existing implementations of Ward’s method are agnostic to the distance metric and can be used out-of-the-box (see Appendix D). For example, given a MSM transition matrix, MVCA can be performed with various functions from the SciPy Python package91 and does not require custom software. Second, the similarity update between clusters is performed differently after an agglomeration. In MVCA, the div√JS values are computed at the outset and do not need to be updated at every step because the updated distance is calculated according to Ward’s method. This updated distance incorporates the original pairwise distances between all data points that were agglomerated. In contrast, at each agglomeration in the BACE approach, the original pairwise Bayes factors are not incorporated into further analysis, and subsequent agglomerative steps are performed without knowledge of the original separation between the merged states. In the implementation of BACE introduced in Ref. 34, the Bayes factors between states i and j are only updated if either i or j is agglomerated into another cluster. However, if two other states k and l have just been agglomerated, the Bayes factor between i and j will differ based on whether k and l are considered to be separate or merged. The choice of whether to update all Bayes factors at every agglomeration is a choice of whether to retain outdated finer-grained information or to remove that information entirely. MVCA represents a richer approach, wherein knowledge of the original distances ACS Paragon23 Plus Environment

Journal of Chemical Theory and Computation

between data points is maintained and incorporated throughout the clustering analysis. For comparison, we present the MVCA analysis performed in Sec. III C using BACE, and reproduce the plots in Fig. 4. We use the default settings, i.e. the unsymmetrized counts matrix and an undersampled threshold of 1.1.34 We see that the uncertainty in the Bayes factors are much higher than the corresponding uncertainties in the variances in Fig. 4, and that the macrostates are somewhat different for the AMBER-FB15 3-macrostate model; namely, the folded (green) and misfolded (blue) macrostates are more diffuse. We note that the BACE macrostates become more compact and resemble the MVCA macrostates when

600 400 200 0

2 3 4 5 6 7 8 9 10

0.1 0.0 0.1 0

2

Number of macrostates

Reaction coordinate 1

AMBER-FB15

BACE model 3 macrostates

10000 8000 6000 4000 2000 0

2 3 4 5 6 7 8 9 10

Number of macrostates

FIG. 6.

BACE model 2 macrostates

Reaction coordinate 2

Bayes factor

CHARMM22*

Reaction coordinate 2

an undersampled threshold of 0 is used (i.e., no states are considered to be undersampled).

Bayes factor

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

0 2 4 2

0

2

Reaction coordinate 1

Coarse-grained models for the folding free energy landscape of CLN025 in the

CHARMM22* force field (top) and AMBER-FB15 force field (bottom). Reaction coordinates 1 and 2 correspond to the first and second tICs for each model. The left panel shows the median change in Bayes factors with the number of macrostates from 200 rounds of bootstrapping with replacement. Error bars represent the 5th and 95th percentiles of the Bayes factors obtained from the bootstrapping analysis. These results can be contrasted with Fig. 4 in the main text.

ACS Paragon24 Plus Environment

Page 24 of 34

Page 25 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Appendix C: Analysis with the hierarchical Nystr¨ om extension graph The hierarchical Nystr¨om extension graph (HNEG)59,60 is a method for coarse-graining MSMs that is designed to avoid pathologies due to poorly sampled microsates by building a population-based hierarchy of microstates from the observed transition counts matrix and utilizing spectral clustering within each layer of the hierarchy. Since HNEG automatically chooses the optimal number of macrostates, we discuss its advantages and disadvantages separately. First, we implement the HNEG algorithm for the proline dataset in Sec. III B. The HNEG model finds two macrostates and does not mislabel the microstate near ψ ≈ 140 that is incorrectly assigned by PCCA, PCCA+, and BACE in Fig. 3. However, when the HNEG algorithm is implemented on the AMBER-FB15 dataset for CLN025 presented in Sec. III C, it recommends two macrostates and does not separate the folded state from the extended state (Fig. 4; green and yellow, respectively). The MVCA algorithm presented in this work offers a method for choosing how many macrostates best describe the data (Fig. 4, left) while also enabling a user-determined choice of macrostates so that knowledge of the system can be incorporated. Additionally, we tested the HNEG algorithm for robustness by incorporating an artifact with variable duration as described in Sec. III B. The HNEG macrostate model was robust to the artifact up to 3.4 ns. At greater durations, HNEG either produced a variety of twomacrostate models or recommended one of several three-macrostate models with the artifact isolated as a singleton third macrostate. At some artifact durations greater than 3.4 ns, the original two-macrostate clustering was unaffected. Whether HNEG returned two or three macrostates for the data and which model it returned varied inconsistently with the duration of the artifact in an apparently random fashion. It therefore may not be straightforward (or possible) to determine when HNEG would or would not be robust to similar artifacts. In contrast, MVCA is reliably robust to this type of artifact. The HNEG algorithm can be downloaded from simtk.org/home/hneg.

Appendix D: A note on agglomerative clustering code Clustering involves fitting a set of data points to clusters in an unsupervised fashion. Given a fit cluster model, the cluster assignments of new data points can then be predicted. ACS Paragon25 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For agglomerative hierarchical clustering, the “fit” cluster assignment of a data point may not be the same as the “predicted” cluster assignment for the exact same data point. This phenomenon occurs near cluster edges and is discussed for Ward’s method in an interactive example which can be found at github.com/msmbuilder/msmbuilder/blob/master/ examples/Ward-Clustering.ipynb. MSMBuilder algorithms are designed to be amenable to cross-validation;82 therefore, the MVCA output in the MSMBuilder implementation is the prediction output.21 This prediction algorithm was derived in Ref. 21 and is not implemented in SciPy, which returns the fit output.91 All results presented in this paper use the MSMBuilder convention. Though they are in general not identical, both the fit and predict outputs are non-stochastic.

REFERENCES 1

Dror, R. O.; Dirks, R. M.; Grossman, J.; Xu, H.; Shaw, D. E. Biomolecular Simulation: A Computational Microscope for Molecular Biology. Annu. Rev. Biophys. 2012, 41, 429–452.

2

Shaw, D. E.; Deneroff, M. M.; Dror, R. O.; Kuskin, J. S.; Larson, R. H.; Salmon, J. K.; Young, C.; Batson, B.; Bowers, K. J.; Chao, J. C.; Eastwood, M. P.; Gagliardo, J.; Grossman, J. P.; Ho, C. R.; Ierardi, D. J.; Kolossv´ary, I.; Klepeis, J. L.; Layman, T.; McLeavey, C.; Moraes, M. A.; Mueller, R.; Priest, E. C.; Shan, Y.; Spengler, J.; Theobald, M.; Towles, B.; Wang, S. C. Anton, a Special-purpose Machine for Molecular Dynamics Simulation. Commun. ACM 2008, 51, 91–97.

3

Shirts, M.; Pande, V. S. Screen Savers of the World Unite! Science 2000, 290, 1903–1904.

4

Buch, I.; Harvey, M. J.; Giorgino, T.; Anderson, D. P.; Fabritiis, G. D. High-Throughput All-Atom Molecular Dynamics Simulations Using Distributed Computing. J. Chem. Inf. Model. 2010, 50, 397–403.

5

Sch¨ utte, C.; Sarich, M. Metastability and Markov state models in molecular dynamics: modeling, analysis, algorithmic approaches; American Mathematical Soc., 2013; Vol. 24.

6

Bowman, G. R.; Pande, V. S.; No´e, F. An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation. 2014.

7

Chodera, J. D.; Singhal, N.; Pande, V. S.; Dill, K. A.; Swope, W. C. Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. J. Chem. Phys. 2007, 126 . ACS Paragon26 Plus Environment

Page 26 of 34

Page 27 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

8

No´e, F.; Horenko, I.; Sch¨ utte, C.; Smith, J. C. Hierarchical analysis of conformational dynamics in biomolecules: Transition networks of metastable states. J. Chem. Phys. 2007, 126 .

9

Voelz, V. A.; Bowman, G. R.; Beauchamp, K.; Pande, V. S. Molecular Simulation of ab Initio Protein Folding for a Millisecond Folder NTL9(1-39). J. Am. Chem. Soc. 2010, 132, 1526–1528.

10

Bowman, G. R.; Voelz, V. A.; Pande, V. S. Atomistic Folding Simulations of the Five-Helix Bundle Protein λ6-85. J. Am. Chem. Soc. 2011, 133, 664–667.

11

Deuflhard, P.; Weber, M. Robust Perron cluster analysis in conformation dynamics. Lin. Alg. Appl. 2005, 398, 161–184.

12

No´e, F.; Fischer, S. Transition networks for modeling the kinetics of conformational change in macromolecules. Curr. Opin. Struct. Biol. 2008, 18, 154–162.

13

Bowman, G. R.; Beauchamp, K. A.; Boxer, G.; Pande, V. S. Progress and challenges in the automated construction of Markov state models for full protein systems. J. Chem. Phys. 2009, 131 .

14

Pande, V. S.; Beauchamp, K.; Bowman, G. R. Everything you wanted to know about Markov State Models but were afraid to ask. Methods 2010, 52, 99 – 105.

15

Keller, B.; Daura, X.; van Gunsteren, W. F. Comparing geometric and kinetic cluster algorithms for molecular simulation data. J. Chem. Phys. 2010, 132 .

16

No´e, F.; N¨ uske, F. A Variational Approach to Modeling Slow Processes in Stochastic Dynamical Systems. Multiscale Model. Simul. 2013, 11, 635–655.

17

N¨ uske, F.; Keller, B. G.; P´erez-Hern´andez, G.; Mey, A. S. J. S.; No´e, F. Variational Approach to Molecular Kinetics. J. Chem. Theory Comput. 2014, 10, 1739–1752.

18

McGibbon, R. T.; Pande, V. S. Variational cross-validation of slow dynamical modes in molecular kinetics. J. Chem. Phys. 2015, 142, 124105.

19

No´e, F.; Clementi, C. Kinetic Distance and Kinetic Maps from Molecular Dynamics Simulation. J. Chem. Theory Comput. 2015, 11, 5002–5011.

20

Wu, H.; No´e, F. Variational approach for learning Markov processes from time series data. arXiv preprint arXiv:1707.04659 2017,

21

Husic, B. E.; Pande, V. S. Ward Clustering Improves Cross-Validated Markov State Models of Protein Folding. J. Chem. Theory Comput. 2017, 13, 963–967.

22

Schwantes, C. R.; McGibbon, R. T.; Pande, V. S. Perspective: Markov models for longACS Paragon27 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

timescale biomolecular dynamics. J. Chem. Phys. 2014, 141, 090901. 23

Lin, J. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 1991, 37, 145–151.

24

Ward, J. H. Hierarchical Grouping to Optimize an Objective Function. J. Amer. Statist. Assoc. 1963, 58, 236–244.

25

Negative eigenvalues correspond to dynamical processes that oscillate instead of decay to equilibrium. When computing the timescales in practice, we can take the absolute value of the argument of the logarithm to convert these numerical artifacts to real-valued, albeit meaningless, timescales.

26

Naritomi, Y.; Fuchigami, S. Slow dynamics in protein fluctuations revealed by timestructure based independent component analysis: The case of domain motions. J. Chem. Phys. 2011, 134 .

27

Schwantes, C. R.; Pande, V. S. Improvements in Markov State Model Construction Reveal Many Non-Native Interactions in the Folding of NTL9. J. Chem. Theory Comput. 2013, 9, 2000–2009.

28

P´erez-Hern´andez, G.; Paul, F.; Giorgino, T.; De Fabritiis, G.; No´e, F. Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 2013, 139, 015102.

29

Scherer, M. K.; Trendelkamp-Schroer, B.; Paul, F.; P´erez-Hern´andez, G.; Hoffmann, M.; Plattner, N.; Wehmeyer, C.; Prinz, J.-H.; No´e, F. PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. J. Chem. Theory Comput. 2015, 11, 5525–5542.

30

Husic, B. E.; McGibbon, R. T.; Sultan, M. M.; Pande, V. S. Optimized parameter selection reveals trends in Markov state models for protein folding. J. Chem. Phys. 2016, 145, 194103.

31

Prinz, J.-H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J. D.; Sch¨ utte, C.; No´e, F. Markov models of molecular kinetics: Generation and validation. J. Chem. Phys. 2011, 134, 174105.

32

Bowman, G. R.; Ensign, D. L.; Pande, V. S. Enhanced Modeling via Network Theory: Adaptive Sampling of Markov State Models. J. Chem. Theory Comput. 2010, 6, 787–794.

33

Endres, D. M.; Schindelin, J. E. A new metric for probability distributions. IEEE Transactions on Information Theory 2003, 49, 1858–1860. ACS Paragon28 Plus Environment

Page 28 of 34

Page 29 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

34

Bowman, G. R. Improved coarse-graining of Markov state models via explicit consideration of statistical uncertainty. J. Chem. Phys. 2012, 137, 134111.

35

McClendon, C. L.; Hua, L.; Barreiro, G.; Jacobson, M. P. Comparing Conformational Ensembles Using the Kullback–Leibler Divergence Expansion. J. Chem. Theory Comput. 2012, 8, 2115–2126.

36

Dixit, P. Introducing prescribed biases in out of equilibrium Markov models. bioRxiv 2017,

37

Dixit, D., Purushottam; Dill, K. A. Caliber Corrected Markov Modeling (C2M2): Correcting Equilibrium Markov models. arXiv preprint arXiv:1711:03043 2017,

38

Olsson, S.; Wu, H.; Paul, F.; Clementi, C.; No´e, F. Combining experimental and simulation data of molecular processes via augmented Markov models. Proc. Natl. Acad. Sci. 2017,

39

Voelz, V. A.; Elman, B.; Razavi, A. M.; Zhou, G. Surprisal Metrics for Quantifying Perturbed Conformational Dynamics in Markov State Models. J. Chem. Theory Comput. 2014, 10, 5716–5728.

40

Wu, H.; Paul, F.; Wehmeyer, C.; No´e, F. Multiensemble Markov models of molecular thermodynamics and kinetics. Proc. Natl. Acad. Sci. 2016, 113, E3221–E3230.

41

M¨ ullner, D. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378 2011,

42

M¨ ullner, D. fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python. J. Stat. Soft. 2013, 53, 1–18.

43

Beauchamp, K. A.; McGibbon, R.; Lin, Y.-S.; Pande, V. S. Simple few-state models reveal hidden complexity in protein folding. Proc. Natl. Acad. Sci. 2012, 109, 17807–17813.

44

Dickson, A.; Brooks, C. L. Native States of Fast-Folding Proteins Are Kinetic Traps. J. Am. Chem. Soc. 2013, 135, 4729–4734.

45

Chang, H.-W.; Bacallado, S.; Pande, V. S.; Carlsson, G. E. Persistent Topology and Metastable State in Conformational Dynamics. PLoS ONE 2013, 8, 1–10.

46

Weber, J. K.; Jack, R. L.; Pande, V. S. Emergence of Glass-like Behavior in Markov State Models of Protein Folding Dynamics. J. Am. Chem. Soc. 2013, 135, 5501–5504.

47

Lapidus, L. J.; Acharya, S.; Schwantes, C. R.; Wu, L.; Shukla, D.; King, M.; DeCamp, S. J.; Pande, V. S. Complex Pathways in Folding of Protein G Explored by Simulation and Experiment. Biophys. J. 2014, 107, 947 – 955.

48

We differentiate the objective function (“Ward’s minimum variance criterion”) from the full method (“Ward’s method”, i.e. agglomerative hierarchical clustering using Euclidean ACS Paragon29 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

distances and the objective function in (5)) to reflect the understanding that (5) no longer corresponds to Euclidean variance, which requires the l2 -norm, when using a non-Euclidean pairwise similarity function. 49

See Ref. 21, Fig. 1 for an example.

50

De Paris, R.; Quevedo, C. V.; Ruiz, D. D. A.; Norberto de Souza, O. An Effective Approach for Clustering InhA Molecular Dynamics Trajectory Using Substrate-Binding Cavity Features. PLoS ONE 2015, 10, 1–25.

51

Michel, V.; Gramfort, A.; Varoquaux, G.; Eger, E.; Keribin, C.; Thirion, B. A supervised clustering approach for fMRI-based inference of brain states. Pattern Recogn. 2012, 45, 2041 – 2049, Brain Decoding.

52

Varoquaux, G.; Gramfort, A.; Thirion, B. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, GB, June 25-July 1, 2012; Langford, J., Pineau, J., Eds.; Omnipress: New York, NY, USA: New York, NY, USA, 2012; pp 1375– 1382.

53

Thirion, B.; Varoquaux, G.; Dohmatob, E.; Poline, J.-B. Which fMRI clustering gives good brain parcellations? Front. Neurosci. 2014, 8, 167.

54

Deuflhard, P.; Huisinga, W.; Fischer, A.; Sch¨ utte, C. Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains. Linear Algebra Appl. 2000, 315, 39–59.

55

Buchete, N.-V.; Hummer, G. Coarse Master Equations for Peptide Folding Dynamics. J. Phys. Chem. B 2008, 112, 6057–6069.

56

Weber, M. Improved Perron Cluster Analysis. Technical Report ZIB 03-04, 2003.

57

Kube, S.; Weber, M. A coarse graining method for the identification of transition rates between molecular conformations. J. Chem. Phys. 2007, 126, 024103.

58

R¨oblitz, S. Statistical Error Estimation and Grid-free Hierarchical Refinement in Conformation Dynamics. Ph.D. thesis, Freie Universit¨at, Berlin, 2008.

59

Huang, X.; Yao, Y.; Bowman, G. R.; Sun, J.; Guibas, L. J.; Carlsson, G. E.; Pande, V. S. Constructing multi-resolution Markov state models (MSMs) to elucidate RNA hairpin folding mechanisms. Pac. Symp. Biocomput. 2010, 228–239.

60

Yao, Y.; Cui, R. Z.; Bowman, G. R.; Silva, D.-A.; Sun, J.; Huang, X. Hierarchical Nystr¨om methods for constructing Markov state models for conformational dynamics. J. Chem. Phys. 2013, 138, 174106. ACS Paragon30 Plus Environment

Page 30 of 34

Page 31 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

61

No´e, F.; Wu, H.; Prinz, J.-H.; Plattner, N. Projected and hidden Markov models for calculating kinetics and metastable states of complex molecules. J. Chem. Phys. 2013, 139, 184114.

62

Honda, S.; Yamasaki, K.; Sawada, Y.; Morii, H. 10 Residue Folded Peptide Designed by Segment Statistics. Structure 2004, 12, 1507 – 1518.

63

Honda, S.; Akiba, T.; Kato, Y. S.; Sawada, Y.; Sekijima, M.; Ishimura, M.; Ooishi, A.; Watanabe, H.; Odahara, T.; Harata, K. Crystal Structure of a Ten-Amino Acid Protein. J. Am. Chem. Soc. 2008, 130, 15327–15331.

64

Vitalini, F.; No´e, F.; Keller, B. Molecular dynamics simulations data of the twenty encoded amino acids in different force fields. Data in Brief 2016, 7, 582 – 590.

65

Lindorff-Larsen, K.; Piana, S.; Palmo, K.; Maragakis, P.; Klepeis, J. L.; Dror, R. O.; Shaw, D. E. Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins: Struct., Funct., Bioinf. 2010, 78, 1950–1958.

66

Swope, W. C.; Pitera, J. W.; Suits, F. Describing Protein Folding Kinetics by Molecular Dynamics Simulations. 1. Theory. J. Phys. Chem. B 2004, 108, 6571–6581.

67

For this implementation of BACE we have used the observed transition counts, but the misassignment also occurs when the transition counts are symmetrized by averaging the raw matrix with its transpose. When no states are considered to be undersampled, different mislabeled states appear.

68

Wedemeyer, W. J.; Welker, E.; Scheraga, H. A. Proline CisTrans Isomerization and Protein Folding. Biochemistry 2002, 41, 14637–14644.

69

Banushkina, P. V.; Krivov, S. V. Nonparametric variational optimization of reaction coordinates. J. Chem. Phys. 2015, 143 .

70

McGibbon, R. T.; Husic, B. E.; Pande, V. S. Identification of simple reaction coordinates from complex dynamics. J. Chem. Phys. 2017, 146, 044109.

71

At this stage, the trajectory is represented by a list of integers 0 through 172. To introduce the new microstate we choose a random point in the trajectory and insert state 173 for a set number of time steps.

72

Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; Shaw, D. E. How Fast-Folding Proteins Fold. Science 2011, 334, 517–520.

73

McKiernan, K. A.; Husic, B. E.; Pande, V. S. Modeling the mechanism of CLN025 betahairpin formation. J. Chem. Phys. 2017, 147, 104107. ACS Paragon31 Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

74

Piana, S.; Lindorff-Larsen, K.; Shaw, D. E. How Robust Are Protein Folding Simulations with Respect to Force Field Parameterization? Biophys. J. 2011, 100, L47 – L49.

75

MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wi´orkiewicz-Kuczera, J.; Yin, D.; Karplus, M. All-Atom Empirical Potential for Molecular Modeling and Dynamics Studies of Proteins. J. Phys. Chem. B 1998, 102, 3586–3616.

76

Wang, L.-P.; McKiernan, K. A.; Gomes, J.; Beauchamp, K. A.; Head-Gordon, T.; Rice, J. E.; Swope, W. C.; Mart´ınez, T. J.; Pande, V. S. Building a More Predictive Protein Force Field: A Systematic and Reproducible Route to AMBER-FB15. J. Phys. Chem. B 2017, 121, 4023–4039.

77

Wang, L.-P.; Martinez, T. J.; Pande, V. S. Building Force Fields: An Automatic, Systematic, and Reproducible Approach. J. Phys. Chem. Lett. 2014, 5, 1885–1891.

78

Note that the variances obtained from the entire dataset, analyzed without boostrapping, are expected to be higher than the bootstrapped samples, since multiple instances of the same part of the data will generally decrease the variance.

79

Garc´ıa, A. E.; Sanbonmatsu, K. Y. Exploring the energy landscape of a hairpin in explicit solvent. Proteins: Struct., Funct., Bioinf. 2001, 42, 345–354.

80

McGibbon, R. T.; Ramsundar, B.; Sultan, M. M.; Kiss, G.; Pande, V. S. Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models. Proceedings of The 31st International Conference on Machine Learning. 2014; pp 1197–1205.

81

Wu, H.; Prinz, J.-H.; No´e, F. Projected metastable Markov processes and their estimation with observable operator models. J. Chem. Phys. 2015, 143, 144101.

82

Harrigan, M. P.; Sultan, M. M.; Hern´andez, C. X.; Husic, B. E.; Eastman, P.; Schwantes, C. R.; Beauchamp, K. A.; McGibbon, R. T.; Pande, V. S. MSMBuilder: Statistical Models for Biomolecular Dynamics. Biophys. J. 2017, 112, 10–15.

83

Plattner, N.; Doerr, S.; De Fabritiis, G.; No´e, F. Complete protein–protein association kinetics in atomic detail revealed by molecular dynamics simulations and Markov modelling. Nat. Chem. 2017, advance online publication, –.

84

Zhou, G.; Pantelopulos, G. A.; Mukherjee, S.; Voelz, V. A. Bridging Microscopic and Macroscopic Mechanisms of p53-MDM2 Binding with Kinetic Network Models. Biophys. ACS Paragon32 Plus Environment

Page 32 of 34

Page 33 of 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

J. 2017, 113, 785–793. 85

Razavi, A. M.; Voelz, V. A. Kinetic Network Models of Tryptophan Mutations in -Hairpins Reveal the Importance of Non-Native Interactions. J. Chem. Theory Comput. 2015, 11, 2801–2812.

86

Zhou, G.; Voelz, V. A. Using Kinetic Network Models To Probe Non-Native Salt-Bridge Effects on -Helix Folding. J. Phys. Chem. B 2016, 120, 926–935.

87

Wan, H.; Zhou, G.; Voelz, V. A. A Maximum-Caliber Approach to Predicting Perturbed Folding Kinetics Due to Mutations. J. Chem. Theory Comput. 2016, 12, 5768–5776.

88

Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 1983, 79, 926–935.

89

Horn, H. W.; Swope, W. C.; Pitera, J. W.; Madura, J. D.; Dick, T. J.; Hura, G. L.; Head-Gordon, T. Development of an improved four-site water model for biomolecular simulations: TIP4P-Ew. J. Chem. Phys. 2004, 120, 9665–9678.

90

N¨ uske, F.; Wu, H.; Prinz, J.-H.; Wehmeyer, C.; Clementi, C.; No´e, F. Markov state models from short non-equilibrium simulations—Analysis and correction of estimation bias. J. Chem. Phys. 2017, 146, 094104.

91

others,, et al. SciPy: Open source scientific tools for Python. 2001; http://www.scipy. org/.

92

Hart, K. M.; Ho, C. M. W.; Dutta, S.; Gross, M. L.; Bowman, G. R. Modelling proteins’ hidden conformations to predict antibiotic resistance. Nat. Commun. 2016, 7, 12965.

93

McGibbon, R. T.; Beauchamp, K. A.; Harrigan, M. P.; Klein, C.; Swails, J. M.; Hern´andez, C. X.; Schwantes, C. R.; Wang, L.-P.; Lane, T. J.; Pande, V. S. MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories. Biophys. J. 2015, 109, 1528 – 1532.

94

McGibbon, R. T.; Hern´andez, C. X.; Harrigan, M. P.; Kearnes, S.; Sultan, M. M.; Jastrzebski, S.; Husic, B. E.; Pande, V. S. Osprey: Hyperparameter Optimization for Machine Learning. J. Open Source Software 2016, 1, 34.

ACS Paragon33 Plus Environment

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Variance

Journal of Chemical Theory and Computation

# Macrostates

0.7, 0.1, 0. , 0. , 0. , 0. , 0. , 0. ,

0.3, 0. , 0.9, 0. , 0. , 0.8, 0. , 0.1, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,

0. , 0. , 0.2, 0.9, 0. , 0. , 0. , 0. ,

0. , 0. , 0. , 0. , 0.9, 0.1, 0. , 0. ,

0. , 0. , 0. , 0. , 0.1, 0.9, 0.1, 0. ,

Coarse-graining with MVCA

FIG. 7. TOC graphic.

ACS Paragon34 Plus Environment

0. , 0. , 0. , 0. , 0. , 0. , 0.9, 0.3,

0. 0. 0. 0. 0. 0. 0. 0.7

Page 34 of 34