Machine Learning of Biomolecular Reaction Coordinates

defying variance-optimizing methods) and ex- hibit hierarchically coupled processes on various .... trees, a loss function is optimized that grows wit...
0 downloads 7 Views 1MB Size
Subscriber access provided by JAMES COOK UNIVERSITY LIBRARY

Biophysical Chemistry, Biomolecules, and Biomaterials; Surfactants and Membranes

Machine Learning of Biomolecular Reaction Coordinates Simon Brandt, Florian Sittel, Matthias Ernst, and Gerhard Stock J. Phys. Chem. Lett., Just Accepted Manuscript • DOI: 10.1021/acs.jpclett.8b00759 • Publication Date (Web): 09 Apr 2018 Downloaded from http://pubs.acs.org on April 9, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 10

Machine Learning of Biomolecular Reaction Coordinates Simon Brandt1 , Florian Sittel1 , Matthias Ernst and Gerhard Stock∗ Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany ( 1 contributed equally) E-mail: [email protected]

Abstract

MD coordinates



metastable states

machine learning model

We present a systematic approach to reduce the dimensionality of a complex molecular system. Starting with a data set of molecular coordinates (obtained from experiment or simulation) and an associated set of metastable conformational states (obtained from clustering the data), a supervised machine learning model is trained to assign unknown molecular structures to the set of metastable states. In this way, the model learns to determine the features of the molecular coordinates that are most important to discriminate the states. Using a new algorithm that exploits this feature importance via an iterative exclusion principle, we identify the essential internal coordinates (such as specific interatomic distances or dihedral angles) of the system, which are shown to represent versatile reaction coordinates that account for the dynamics of the slow degrees of freedom and explain the mechanism of the underlying processes. Moreover, these coordinates give rise to a free energy landscape that may reveal previously hidden intermediate states of the system. As it is neither possible nor desirable to follow the motion of a large molecule along its 3N atomic coordinates, a low-dimensional representation is needed which in some sense describes the system’s essential dynamics. For example, when we consider biomolecular processes such as protein folding, binding or aggregation, we want to explain the mechanism and

important coordinates

essential internal coordinates

exclusion loop

1.00 0.75 accuracy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

0.50 0.25 0.00 50

55

60

65

iteration

the underlying structural rearrangements by using a low-dimensional reaction coordinate. 1–8 To this end, a number of efficient and systematic strategies of dimensionality reduction have been developed, 9–12 which aim to identify a few collective variables. Popular methods include principal component analysis 13,14 which represents a linear transformation to coordinates that maximize the variance of the first components, time-lagged independent component analysis 15 which aims to maximize the timescales of the first components, and full correlation analysis which tries to minimize the mutual information between components. 16 Assuming a timescale separation between the slow motion of the first few components (representing the “system”) and the fast motion of the remaining components (representing the “bath”), the first components of the transformation may serve as a multidimensional reaction coordinate, which can be used to calculate transition rates, 1–5 to construct a Langevin model, 17,18 or in various enhanced sampling techniques. 19

To whom correspondence should be addressed

ACS Paragon Plus Environment

1

The Journal of Physical Chemistry Letters

While these systematic dimensionality reduction methods are indispensable to achieve a general preprocessing of high-dimensional data, they often fall short to directly yield appropriate reaction coordinates when complex conformational rearrangements in macromolecules are considered. This is because these processes may involve small structural changes (thus defying variance-optimizing methods) and exhibit hierarchically coupled processes on various timescales 20 (that hamper methods maximizing timescales). Given as linear combinations of high-dimensional input coordinates, moreover, collective variables often do not necessarily point to the essential internal coordinates, i.e., important specific interatomic distances or dihedral angles. Nonetheless, given some reasonable set of collective variables, a clustering of the data in this reduced representation may be performed, 21–23 which indicates the metastable conformational states of the system. By calculating the transition probabilities between these states, a Markov model of the dynamics can be constructed that describes protein dynamics in terms of memory-less jumps. 24–27 Assuming that the metastable conformational states provide a well-defined representation of the lowenergy regions of the free energy landscape, they should also contain valuable information on the essential internal coordinates of the system. The question is just how to retrieve this information. In this work, we propose a systematic approach to identify a low-dimensional set of essential internal coordinates (ECs). The method is based on a combination of state-of-theart clustering and supervised machine learning techniques, see Fig. 1. We start with a data set of 3N -dimensional molecular coordinates, which can be obtained from experiment or molecular dynamics (MD) simulation. Due to inevitable mixing of overall and internal motion, Cartesian coordinates are in general not suitable for dimensionality reduction and are therefore converted to internal coordinates such as interatomic distances and angles. 14 In the following, this data set will be referred to as MD coordinates. By using a suitable dimensionality reduction technique like princi-

pal component analysis, we next construct a set of collective variables, which facilitate a low-dimensional representation of the dynamics. Employing density-based and dynamical clustering techniques, the metastable conformational states of the system are obtained. In the subsequent pivotal step, a supervised machine learning model 28 is trained to assign “new” MD structures (i.e., structures not used for training) to the set of metastable states. Hence the model learns to identify the features of the MD coordinates which are most important to discriminate the states. Extending previous works, 29–31 we propose a new algorithm that exploits this “feature importance” via an iterative exclusion principle (see below) in order to finally obtain the desired ECs. To demonstrate the potential and the performance of the approach, we adopt two well-established model problems, folding of villin headpiece and hinge-bending functional motion of T4 lysozyme. For both systems long MD trajectories and a detailed characterization of their metastable states are available. 32–34 MD coordinates

metastable states

machine learning model

important coordinates

exclusion loop

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 10

essential internal coordinates

Figure 1: Scheme of the machine learning algorithm to identify essential internal coordinates (ECs). Given a data set of MD coordinates and a partitioning into metastable states, a machine learning model assesses the importance of the MD coordinates to discriminate these states. Sorting all MD coordinates by their importance, we remove the coordinate with highest (or lowest) importance from the training set and reiterate the procedure (exclusion loop). The resulting most important coordinates constitute the ECs.

ACS Paragon Plus Environment

2

Page 3 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

Machine learning of essential internal coordinates

model and the model parameters employed. The overall accuracy of the model (i.e., the success rate to correctly assign MD structures to states) can be estimated by dividing the available data into a training set and a test set. While the former is used to train the model, the test set (containing new structures that were not used for training) is used to check whether the model is able to correctly classify the data. The model’s state-dependent accuracy is then the relative number of correctly identified structures for that state. A major advantage of decision tree algorithms is the simplicity by which the importance of input features (i.e., the MD coordinates) can be determined. (This is in clear contrast to, e.g., deep learning networks, whose typically high numbers of intermediate nodes, weighted sums and nonlinear transformations render this task extremely challenging. 36 ) For every class, the importance of a coordinate is given by the gain of the loss function value that is obtained from decision rules involving the coordinate. 35 Thus, a MD coordinate that is used in decisions that lead to high gains in the loss function –and therefore to a better classification of structures– is clearly more important for characterizing the state than others. This importance criterion, however, has the problem that it is weighted relative to the importance of all other coordinates. As a consequence, a coordinate might be considered far less important to distinguish one state than another coordinate, simply because the second one was used for the decision prior to the first one, albeit both might equally well describe the state difference. Thus, to attribute a more meaningful importance index to a given coordinate, it must be evaluated independently from the other coordinates considered as highly important. In extension of previous works employing machine learning techniques for the construction of reaction coordinates, 29,30 here we introduce a novel algorithm, that provides an improved estimate of the overall importance of a coordinate. The approach is straightforward: Given a trained model, sort all coordinates by their importance [as given by the XGBoost algorithm, see Eq. (9) in the SI], remove the coordinate with high-

Given a trajectory of MD coordinates and a set of metastable states defined by these coordinates, it is our goal to train a machine learning model that assigns new MD data to the state they most likely belong to. To this end, the model learns classification rules based on certain features of the MD coordinates, e.g., a certain distance should be bigger than a trained cut-off value, or some angles should lie in a certain region. Provided that these features can be retrieved from the model, we may get direct information about the importance of specific coordinates for discriminating the metastable states. A machine learning strategy particularly suited for this task is the decision tree family, which represents a group of algorithms with hierarchical decision rules to classify coordinates. 28 Here we use the XGBoost algorithm 35 which constructs for every given metastable state an ensemble of decision trees, that allow to assign new input data to the metastable state they most likely belong to. A decision tree can be represented by a tree-like graph, that consists of nodes associated with classification rules of the MD coordinates, edges (or branches) corresponding to different outcomes of the classifications, and end nodes (or leafes) which assign some score to the corresponding path. Starting at the tree’s root, a decision path is found by following the branches of the tree, until the leaf nodes finally yield the probability of a set of coordinates to belong to the state. In this way, coordinates can be classified by simply assigning them to the respective state of highest probability. 28 For training the trees, a loss function is optimized that grows with both the number of wrongly classified structures and the tree size (i.e., the number of decisions necessary to classify a state). While we obviously want to minimize the former, the latter is important as regularization to prevent overfitting, since the tree size directly reflects the number of fitting parameters which should be small to reduce bias in the model. See the SI for a more detailed description of the XGBoost

ACS Paragon Plus Environment

3

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

est (or lowest) importance from the training set and reiterate the procedure (Fig. 1) by retraining the model based on all remaining coordinates. At every iteration, recompute the importance of the remaining coordinates and estimate the accuracy of the trained model. This approach leads to important insights: When discarding the coordinates of highest importance, we can estimate their overall importance independent from the other coordinates. If the accuracy remains the same, discarded and remaining coordinates more or less describe the same dynamics. Furthermore, we can infer from the states whose accuracy change upon discarding a coordinate the dynamics described by the coordinate (and thus identify dynamical features that are important to separate states). On the other hand, when discarding the coordinate of least importance first, one can easily filter out all nonessential coordinates from the data set. These are the coordinates that, when discarded, do not lead to significant changes in the accuracy of the model. In this way, we identify the intrinsic dimensionality of the essential internal coordinates that is necessary to unambiguously discriminate the states. In effect, we have achieved a dimensionality reduction that does not rely on linear or nonlinear combinations of coordinates, but works by simply identifying the most important coordinates of the data.

Page 4 of 10

states, the “Ramacolor” plot in Fig. 2b assigns a unique color code to all (φ, ψ) configurations of a residue and averages over the color values of these configurations. 23 We find that most metastable states exhibit well-defined α-helices, while they clearly differ in the structure of loops and terminal residues. Given these states, we trained a XGBoost model on the full MD data set of dihedral angles, in order to identify those angles which best describe the state’s structural differences. Figure 2c shows the resulting feature importance plot that quantifies how much a single dihedral angle affects the classification of a given metastable state. The resemblance to the Ramacolor plot is remarkable. The most important dihedral angles identified by XGBoost are clearly the ones also highlighted as statedefining by Ramacolor. As explained above, the importance of the coordinates obtained from the feature importance plot are weighted relative to the importance of all other coordinates, which may introduce some bias to the selection scheme. To avoid this problem, the new algorithm sorts all coordinates by their importance, removes the coordinate with lowest importance from the training set and reiterates the procedure. Plotted as a function of the number of discarded coordinates, Fig. 2d shows the accuracy loss of the XGBoost classifier to identify the 12 metastable states. While the accuracy is found to be remarkably stable for all states when discarding up to 60 coordinates (with exception of state 11 whose accuracy drops faster than for the other states), it decreases sharply for most states when the remaining six coordinates are removed. These six variables, given by the backbone dihedral angles φ3 , φ2 , ψ13 , ψ2 , ψ10 and φ33 (ordered by decreasing importance), represent the desired ECs. Because they are necessary and sufficient to discriminate all metastable states of the system, these angles indeed describe the essential dynamics of the system. Interestingly, the accuracy loss plot in Fig. 2d shows that already the single coordinate φ3 is sufficient to identify the most populated states 1 and 2 with high accuracy and state 5 at least roughly. To verify that the six ECs are

The metastable states of HP-35 are determined by six backbone dihedral angles Villin headpiece (HP-35) is a 35-residues protein fragment that represents a standard model of ultrafast protein folding. It consists of a hydrophobic core with three helices that are connected via two unstructured loops (Fig. 2a). As detailed in the SI, we used a 300 µs trajectory of Piana et al., 32 performed a principal component analysis on all complete backbone dihedral angle pairs (dPCA+), 33 and used densitybased clustering 23 and the most probable path algorithm 37 to obtain twelve metastable states. To visualize the secondary structure of these

ACS Paragon Plus Environment

4

Page 5 of 10

a h3

residue

h2 13 3 10

33

h1

b

c

34 30 26 22 18 14 10 6 2

Φ33 0.6 0.4 0.2

Ψ13 Ψ10

0.0

Φ3 1

3

5

7

9

11

1

3

5

state

f

1.00

1.00

1.00

2 1

9

0.50

7

12 4

11

0.25

5 3

6

10 8

0.00 50

55

60

iteration

65

Ψ13

0.10

Ψ2 Φ 33 Φ3

Φ2

0.01 0

1

2

3

Ψ10

autocorrelation

e

0.75

7

9

11

state

d autocorrelation

accuracy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

4

time/µs

PC 1 PC 4

0.10

PC 3

PC 7 0.01 PC 2

0

1

2

PC 5

3

4

time/µs

Figure 2: Identification of the essential internal coordinates (ECs) of HP-35. (a) Structure of the native state of HP-35, indicating three α-helices h1, h2 and h3 with residues 3-10, 14-19 and 22-32, respectively. (b) Ramacolor plot describing the secondary structure of the 12 metastable states, where green, red and blue indicate α-helical, β-extended and left-handed conformations, respectively. (c) Feature importance plot that indicates the importance of all MD coordinates on the classification of a given metastable state. (d) Accuracy loss of the XGBoost classifier plotted as a function of the number of discarded coordinates. While the 60 least important coordinates hardly matter, the accuracy to identify the 12 metastable states drops drastically for the remaining six coordinates, which therefore constitute the ECs of the system. The thick black line indicates the state-averaged accuracy. The autocorrelation functions of these coordinates shown in (e) overall decay slower than the autocorrelations of the first six principal components shown in (f). indeed sufficient to discriminate all metastable states of HP-35, we performed a reclustering of the MD data using only these coordinates. Comparing the structures of the resulting 12 metastable states to the structures of the initial metastable states, the Ramacolor plots in Fig. S1 reveal that the principal components and the ECs yield virtually the same states. We are now in a position to discuss to what extent the above found ECs represent suitable reaction coordinates to describe the folding of HP-35. Discriminating the metastable states of the system, for one, these coordinates are expected to elucidate the mechanism of the considered process. In fact, we find that φ3 , φ2 and ψ2 account for the structure of the N-terminus that is known to be crucial to discriminate the folded and the intermediate basins, 23 ψ13 and

ψ10 describe the structure of the loop between helices h1 and h2 that is important to discriminate the intermediate and the unfolded basin, and φ33 reports on C-terminal motion. While the functional relevance of the N- and C-termini as well as the h1-h2 loop was recognized in previous work, 23 these studies did not reveal the ECs. This is because collective variables such as the first principal components do not point to specific internal coordinates, but only yield linear combinations of numerous coordinates (Fig. S3). We note that our machine-learning based construction of ECs can also be applied to the case of nonlinear dimensionality reduction approaches such as diffusion maps, 38 whose collective variables are not straightforwardly described in terms of the input coordinates. As explained in the Introduction, further

ACS Paragon Plus Environment

5

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 10

Detection of a hidden locking mechanism of T4L

common requirements on reaction coordinates include to cover a significant part of the overall variance of the data (which is optimized in principal component analysis, here dPCA+ 33 ) and to achieve a timescale separation with respect to the fast bath fluctuations (which is the aim of time-lagged independent component analysis, TICA 15 ). To test if the ECs of HP-35 satisfy the former requirement, we calculated the variance of the six most important coordinates in the various approaches. We find that dPCA+ covers ∼ 45 % of the total variance, TICA ∼ 13 %, and the ECs ∼ 25 %. Since these numbers are quite different while the structural information given by the various coordinates is rather similar (Fig. S3), we conclude that a selection of ECs based purely on cumulative variance may be misleading. As a measure of the timescales described by the most important coordinates, Fig. 2 compares the autocorrelation functions obtained for (e) the ECs and (f) the first principal components of dPCA+. Remarkably, we find that the decay times of the ECs are in part significantly longer than the results for the principal components, and are overall also longer than the TICA decay times (Fig. S2). Finally we wish to highlight that the timescales of the ECs can be directly related to a corresponding physical process. For example, ψ13 is associated with the relative positions of helices h1 and h2 and as such describes the overall folding/unfolding transition, which corresponds to the slowest process of the system. 23 On the other hand, φ2 and ψ2 account for relatively fast fluctuations of the N-terminal which effect a twisting motion described by φ3 that leads to the destabilization of the protein. 39 The latter is associated with relatively slow rearrangements of ψ10 and φ33 and finally leads to unfolding involving again ψ13 . Highlighting the hierarchical coupling of fast and slow motions, ECs provide a direct view of the dynamics, which is very difficult to achieve from collective coordinates that exclusively maximize variance or timescales.

As an example of complex functional motion we consider T4 lysozyme (T4L), for which a 50 µs MD trajectory at 300 K is available, 34 see SI for details. T4L is a 164-residue enzyme whose interaction with the substrate involves a prominent hinge-bending motion of its two domains, which resembles the opening and closing of the mouth of a ”Pac-Man” (Fig. 3a). Employing some coordinate that reflects this motion (e.g., the distance d21,142 between residues Thr21 and Thr142), the associated free energy profile exhibits two minima corresponding to the open and closed states, that are separated by an energy barrier ∆G . 2kB T (Fig. 3b). However, when modeling the transition rate by the standard expression k = 1/τ = k0 e−∆G/kB T , it becomes obvious that the small energy barrier cannot account for the long (∼ 10 µs) observed transition time τ of T4L. In fact, a recent targeted MD simulation study revealed that trying to directly enforce the open-closed transition does not recover the two-state behavior of T4L, meaning that the opening distance does not represent a suitable reaction coordinate. 34 The comprehensive simulations and analyses rather showed that this transition is triggered by a locking mechanism, by which the side chain of Phe4 changes from a solvent-exposed to a hydrophobically-buried state. 34 This mechanism, which, e.g., can be represented by the distance d4,60 between residues Phe4 and Lys60, leads to a barrier of ∼ 6 kB T between the locked and unlocked states (Fig. 3c) and results in a four-state description of T4L. 34 It should be stressed that systematic dimensionality reduction approaches such as various types of PCA did not indicate this mechanism. In this respect, the hinge-bending motion of T4L represents a prime example of complex functional dynamics whose underlying mechanism is not straightforwardly described by some obvious reaction coordinate. To challenge the machine learning algorithm, here we intentionally adopt the naive view of T4L as a two-state problem and study if the approach can provide additional insight.

ACS Paragon Plus Environment

6

Page 7 of 10

b

>

>

opening distance

g kin loc nce a dist

>

∆G/kB T

a

c ∆G/kB T

>

is key residue of the locking mechanism mentioned above. Although these distances are not evidently associated with the open and closed states used to train the model, they are apparently almost as descriptive as the state defining opening distance.

8 6 4 2 0 0

10

20

A opening distance/˚ 8

Table 1: XGBoost ranking of the most important coordinates describing the open-closed transition of T4L. While the first three iterations of the algorithm yield distances that directly measure this transition, the following iterations reveal distances involving residue Phe4, which is located in the hinge of the protein. The last column lists the feature importance obtained by XGBoost in the first iteration, which ranks the relevance of distances involving Phe4 much lower.

6 4 2 0 5

10

15

locking distance/˚ A

d A locking distance/˚

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

15

10

4

4

3

2

2

0

5

1 5

10

15

20

opening distance/˚ A

iteration 1 2 3 4 5 6 7

Figure 3: (a) Structure of T4L in the open (orange) and closed (blue) state, indicating distances d21,142 and d4,60 which report on the opening and the locking of T4L, respectively. Free energy along (b) the opening distance d21,142 and (c) the locking distance d4,60 . (d) Two-dimensional representation of the free energy landscape (in units of kB T ), indicating four metastable conformational states.

discarded distance Thr21-Gln141 Glu22-Thr142 Glu22-Gln141 Phe4-Lys60 Phe4-Ala63 Phe4-Ile29 Arg8-Phe67

feature importance 1 2 10 50 42 56 65

It should be stressed that these non-obvious coordinates would be hardly discovered when only the standard feature importance is considered. To demonstrate this, Table 1 also shows the feature importance obtained by XGBoost in the first iteration. According to this criterion, the distances that measured directly the open-closed transition are ranked first (rank 1, 2, and 10), while the distances involving Phe4 only show up much later (rank 50, 42 and 56) and therefore would be most likely overlooked. The relevance of these coordinates is only evident from the iterative training of the model excluding the more important observables. Motivated by the findings of Table 1, we now investigate the relevance of the highlighted coordinates. Since the three most important coordinates are redundant as they account more or less for the same distance, in the following we only consider the distance between Thr21 and Thr142 as coordinate describing directly the open-closed transition. Showing the two-

To this end, we characterize the system using distance coordinates of all residue-residue distances which are considered contacts (i.e., A), define open have a distance of less than 4.5 ˚ and closed states of T4L based on the opening distance d21,142 and train an XGBoost model. In particular, at every iteration we remove the most important coordinate from the data set to force XGBoost to reclassify the importance of alternative descriptors. Table 1 shows the discarded distances of the first few iterations of the algorithm. As expected, the first few most important distances connect the two sides of the binding pocket (see Fig. 3a) and therefore directly monitor the open-closed transition of T4L. Surprisingly, though, the next three distances all involve residue Phe4, which is located in the “hinge” region of the protein and

ACS Paragon Plus Environment

7

The Journal of Physical Chemistry Letters

a 1.00

1 4

0.75 0.50

2

0.25

3

0.00

b

1.00

accuracy

dimensional free energy landscapes along opening distance d21,142 and the locking distance d4,60 , Fig. 3d reveals that the dynamics of T4L actually amounts to a four-state (rather than a two-state) system. Besides the main open and closed states, we find two intermediate states associated with the locking transition. Although these states are only weakly recognized in Fig. 3d, they are clearly identified by density-based clustering. 23 Comparison of the structures of these states with the structures of the four metastable states found in the targeted MD study by Ernst et al. 34 yields virtually perfect agreement (Fig. S4). However, while the latter authors spend considerable effort, the XGBoost algorithm found these hidden conformational states semiautomatically, starting from the naive assumption of a two-state system. Let us finally mention another useful application of the XGBoost algorithm, that is, to decide on the capability of some given set of coordinates to discriminate the metastable states of the system. As an example, Fig. 4 shows the accuracy loss of two XGBoost models assuming a four-state system, which were trained using either only backbone dihedral angles or only interresidue distances. While the latter plot clearly shows that a few distances are required to distinguish the intermediate states 2 and 3, the former readily reveals that dihedral angles are not well suited to describe the conformational dynamics of T4L, because they describe the intermediate states only with very low accuracy. The same conclusion was reached in Ref. 34 after a detailed discussion of PCA and clustering results using both types of coordinates.

accuracy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 10

4 3

0.75 0.50

1

2

0.25 0.00

0

100

200

iteration

300

392

396

400

iteration

Figure 4: Accuracy loss plots of T4L obtained by XGBoost models assuming a four-state system, discarding the least important (a) backbone dihedral angles and (b) interresidue distances. Fig. 2), and the XGBoost-based algorithm using an iterative scheme to determine the most (or least) important coordinates (see Table 1). The final outcome of the procedure is an accuracy loss plot that reveals the essential internal coordinates (ECs) such as specific interatomic distances or dihedral angles. We note that commonly used collective variables in general cannot point out important internal coordinates. Using two well-established models of protein dynamics, HP-35 and T4L, we have shown that the ECs provide versatile reaction coordinates, which account for the dynamics of the slow degrees of freedom and directly describe the mechanism of the process. In the case of HP-35, for example, the ECs have been found to exhibit slow timescales (overall slower than available collective variables), which could be directly related to some functional rearrangement of the system. Finally we wish to point out that our machine-learning based approach is able to discover new features that are not obvious from the initially provided collective variables or metastable states. In the case of T4L, for example, the free energy landscape along the ECs reveals additional metastable states that account for a new reaction mechanism, which has only recently been detected independently with considerable effort. 34 Due to the general structure of the machine learning ansatz, our approach allows for a variety of extensions, including use of dynamical information in the training of the model, modeling of nonequilibrium data, and semiautomatic

Concluding remarks We have described a supervised machine learning approach to reduce the dimensionality of a complex molecular system. The method relies on three components: choice of appropriate input coordinates that allow to discriminate metastable conformational states (see Fig. 4), accurate definition of these states as obtained by density based and dynamical clustering (see

ACS Paragon Plus Environment

8

Page 9 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

detection of reaction coordinates by screening a catalogue of typical order parameters used in protein dynamics analysis.

(17) Micheletti, C.; Bussi, G.; Laio, A. Optimal Langevin modeling of out-of-equilibrium molecular dynamics simulations. J. Chem. Phys. 2008, 129, 074105. (18) Hegger, R.; Stock, G. Multidimensional Langevin modeling of biomolecular dynamics. J. Chem. Phys. 2009, 130, 034106. (19) Chipot, C.; Pohorille, A. Free Energy Calculations; Springer: Berlin, 2007. (20) Buchenberg, S.; Schaudinnus, N.; Stock, G. Hierarchical biomolecular dynamics: Picosecond hydrogen bonding regulates microsecond conformational transitions. J. Chem. Theo. Comp. 2015, 11, 1330–1336. (21) Keller, B.; Daura, X.; van Gunsteren, W. F. Comparing geometric and kinetic cluster algorithms for molecular simulation data. J. Chem. Phys. 2010, 132, 074110. (22) Bowman, G. R.; Meng, L.; Huang, X. Quantitative comparison of alternative methods for coarse-graining biological networks. J. Chem. Phys. 2013, 139, 121905. (23) Sittel, F.; Stock, G. Robust Density-Based Clustering to Identify Metastable Conformational States of Proteins. J. Chem. Theo. Comp. 2016, 12, 2426–2435. (24) Chodera, J. D.; Swope, W. C.; Pitera, J. W.; Dill, K. A. Obtaining long-time protein folding dynamics from shorttime molecular dynamics simulations. Multiscale Modeling & Simulation 2006, 5, 1214–1226. (25) Buchete, N.-V.; Hummer, G. Coarse master equations for peptide folding dynamics. J. Phys. Chem. B 2008, 112, 6057–6069. (26) Bowman, G. R.; Beauchamp, K. A.; Boxer, G.; Pande, V. S. Progress and challenges in the automated construction of Markov state models for full protein systems. J. Chem. Phys. 2009, 131, 124101. (27) Prinz, J.-H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J. D.; Sch¨ utte, C.; Noe, F. Markov models of molecular kinetics: generation and validation. J. Chem. Phys. 2011, 134, 174105. (28) Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning: data mining, inference and prediction; Springer: New York, 2009. (29) Ma, A.; Dinner, A. R. Automatic method for Identifying Reaction Coordinates in Complex Systems. J. Phys. Chem. B 2005, 109, 6769–6779. (30) Sultan, M. M.; Kiss, G.; Shukla, D.; Pande, V. S. Automatic selection of order parameters in the analysis of large scale molecular dynamics simulations. J. Chem. Theory Comput. 2014, 10, 5217–5223. (31) Ballard, A. J.; Das, R.; Martiniani, S.; Mehta, D.; Sagun, L.; Stevenson, J. D.; Wales, D. J. Energy landscapes for machine learning. Phys. Chem. Chem. Phys. 2017, 19, 12585–12603. (32) Piana, S.; Lindorff-Larsen, K.; Shaw, D. E. Protein folding kinetics and thermodynamics from atomistic simulation. Proc. Natl. Acad. Sci. USA 2012, 109, 17845–17850. (33) Sittel, F.; Filk, T.; Stock, G. Principal component analysis on a torus: Theory and application to protein dynamics. J. Chem. Phys. 2017, 147, 244101. (34) Ernst, M.; Wolf, S.; Stock, G. Identification and validation of reaction coordinates describing protein functional motion: Hierarchical dynamics of T4 Lysozyme. J. Chem. Theo. Comp. 2017, 13, 5076 – 5088. (35) Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. CoRR 2016, abs/1603.02754 . (36) LeCun Yann,; Bengio Yoshua,; Hinton Geoffrey, Deep Learning. Nature 2015, 521, 436. (37) Jain, A.; Stock, G. Identifying metastable states of folding proteins. J. Chem. Theo. Comp. 2012, 8, 3810 – 3819. (38) Das, P.; Moll, M.; Stamati, H.; Kavraki, L. E.; Clementi, C. Low-dimensional, free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction. Proc. Natl. Acad. Sci. USA 2006, 103, 9885–9890.

Acknowledgment We thank Benjamin Lickert for instructive and helpful discussions and D. E. Shaw Research for sharing their trajectories of HP-35. This work has been supported by the Deutsche Forschungsgemeinschaft (Sto 247/11).

Supplementary Material MD and XGBoost methods, details of dPCA+ and TICA of HP-35, comparison of metastable states obtained from principal components and from XGBoost.

References (1) Bolhuis, P. G.; Dellago, C.; Chandler, D. Reaction coordinates of biomolecular isomerization. Proc. Natl. Acad. Sci. USA 2000, 97, 5877 – 5882. (2) Faradjian, A. K.; Elber, R. Computing time scales from reaction coordinates by milestoning. J. Chem. Phys. 2004, 120, 10880–10889. (3) Best, R. B.; Hummer, G. Reaction coordinates and rates from transition paths. Proc. Natl. Acad. Sci. USA 2005, 102, 6732–6737. (4) Krivov, S. V.; Karplus, M. Diffusive reaction dynamics on invariant free energy profiles. Proc. Natl. Acad. Sci. USA 2008, 105, 13841 – 13846. (5) E, W.; Vanden-Eijnden, E. Transition-Path Theory and Path-Finding Algorithms for the Study of Rare Events. Annu. Rev. Phys. Chem. 2010, 61, 391–420. (6) Rohrdanz, M. A.; Zheng, W.; Clementi, C. Discovering Mountain Passes via Torchlight: Methods for the Definition of Reaction Coordinates and Pathways in Complex Macromolecular Reactions. Annu. Rev. Phys. Chem. 2013, 64, 295–316. (7) Bittracher, A.; Koltai, P.; Klus, S.; Banisch, R.; Dellnitz, M.; Sch¨ utte, C. Transition Manifolds of Complex Metastable Systems. J. Nonlin. Science 2018, 28, 471–512. (8) McGibbon, R. T.; Husic, B. E.; Pande, V. S. Identification of simple reaction coordinates from complex dynamics. J. Chem. Phys. 2017, 146, 044109. (9) Hyv¨ arinen, A.; Karhunen, J.; Oja, E. Independent Component Analysis; John Wiley & Sons: New York, 2001. (10) Jolliffe, I. T. Principal Component Analysis; Springer: New York, 2002. (11) Benner, P.; Mehrmann, V.; Sorensen, D. C. Dimension reduction of large-scale systems; Springer: New York, 2005. (12) Lee, J. A.; Verleysen, M. Nonlinear dimensionality reduction; Springer: New York, 2007. (13) Amadei, A.; Linssen, A. B. M.; Berendsen, H. J. C. Essential dynamics of proteins. Proteins 1993, 17, 412–425. (14) Mu, Y.; Nguyen, P. H.; Stock, G. Energy Landscape of a Small Peptide Revealed by Dihedral Angle Principal Component Analysis. Proteins 2005, 58, 45 – 52. (15) Perez-Hernandez, G.; Paul, F.; Giorgino, T.; De Fabritiis, G.; Noe, F. Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 2013, 139, 015102. (16) Lange, O. F.; Grubm¨ uller, H. Full correlation analysis of conformational protein dynamics. Proteins 2008, 70, 1294 – 1312.

ACS Paragon Plus Environment

9

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(39) Jain, A.; Stock, G. Hierarchical folding free energy landscape of HP35 revealed by most probable path clustering. J. Phys. Chem. B 2014, 118, 7750 – 7760.

ACS Paragon Plus Environment

10

Page 10 of 10