Effect of Clustering Algorithm on Establishing Markov State Model for

Jun 1, 2016 - Our results indicate that geometric and kinetic clustering perform equally well. The construction and outcome of Markov models are heavi...
0 downloads 7 Views 2MB Size
Article pubs.acs.org/jcim

Effect of Clustering Algorithm on Establishing Markov State Model for Molecular Dynamics Simulations Yan Li* and Zigang Dong* The Hormel Institute, University of Minnesota, Austin, Minnesota 55912, United States ABSTRACT: Recently, the Markov state model has been applied for kinetic analysis of molecular dynamics simulations. However, discretization of the conformational space remains a primary challenge in model building, and it is not clear how the space decomposition by distinct clustering strategies exerts influence on the model output. In this work, different clustering algorithms are employed to partition the conformational space sampled in opening and closing of fatty acid binding protein 4 as well as inactivation and activation of the epidermal growth factor receptor. Various classifications are achieved, and Markov models are set up accordingly. On the basis of the models, the total net flux and transition rate are calculated between two distinct states. Our results indicate that geometric and kinetic clustering perform equally well. The construction and outcome of Markov models are heavily dependent on the data traits. Compared to other methods, a combination of Bayesian and hierarchical clustering is feasible in identification of metastable states.



INTRODUCTION In living organisms, proteins work as miniaturized machines in complex interaction networks designed to execute their functions in response to environmental stimuli. Computational and experimental studies support that protein dynamics and conformational changes play critical roles in regulating the biological activity.1 There is universal agreement that proteins sample an ensemble of conformational states (metastable states or macrostates) and cross the energy barriers between them over a spectrum of time scales. Generally, proteins take much time to dwell in a metastable state before switching to another one, and thereby, the transition is a rare event. Within each metastable state, there may exist a great number of microstates corresponding to energy minima on the free energy landscape. The transitions between microstates are fast and frequent due to low energy barriers. All-atom molecular dynamics (MD) simulation is a powerful tool and has been widely employed to investigate the relationship between dynamical personality and functionality of proteins in microscopic detail.2 Even with high performance computing currently available, however, it is still computationally intractable to reach the time scale on which the transition occurs. To overcome the sampling problem, the Markov state model (MSM) has been applied for analysis of large-scale MD simulations to acquire quantitative description of protein dynamics.3,4 With MSM, it is not necessary to run an extremely long MD simulation which is comparable with the time scale of interest. Consequently, the conformational changes happening on millisecond or longer time scales can be studied by stitching together swarms of MD trajectories in nanosecond or microsecond regime. The quality of an MSM relies on correctly © XXXX American Chemical Society

recognizing metastable states and estimating interconversion between them. However, the identification of metastable states still remains a primary challenge in MSM building.5−8 Essentially, decomposition of the conformational space is a clustering question. A variety of clustering methods have been applied for analysis in conformational search,9 docking poses,10 and MD trajectories.11−16 The most popular approach is geometric clustering including the classic algorithms of kmeans, hierarchical clustering (HC), and Bayesian clustering (BC), which groups similar conformations together by use of the structural peculiarity. An alternative method is kinetic clustering, which categorizes data on the basis of space discretization and transition probability estimation. The representative algorithm is Perron-cluster cluster analysis (PCCA) and its improvement PCCA+. The k-means and PCCA+ are implemented in the MSM software package Emma17,18 and MSMBuilder2.19 A combination of the BC and HC has been utilized to establish MSM from extensive MD simulations.20 In kinetic clustering, the dynamic information stored in data is taken into account, and thereby, the kinetically similar conformations are lumped together. When geometric clustering is adopted for MSM construction, it assumes that separation of the conformational space by geometric similarity is also appropriate for kinetic analysis.16 Previously the statistical error introduced by discretization,21 coarse-graining methods,22 and force fields23 in MSM has been well discussed. Here, we survey the effect of different clustering algorithms on MSM construction and output. Three k-means Received: March 31, 2016

A

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

updated, the SSE is computed. If the SSE decreases, the data point is assigned to the new cluster. The iteration is repeated until the cluster is not changed. The Lloyd/Forgy algorithm is a batch (or offline) centroid model. The difference between them is that the data distribution is considered to be discrete in the Lloyd algorithm but continuous in the Forgy algorithm. They search for the partition of the data space with the optimal solution

algorithms (Hartigan−Wong, Lloyd/Forgy, and MacQueen), the BC, and the PCCA+ are compared. These methods are employed to divide the conformational space of two biomolecular systems earlier reported, fatty acid binding protein 4 (FABP4)24 and epidermal growth factor receptor (EGFR).25 With metastable states identified, MSMs are set up, and then, the total net flux and transition rate are estimated.



METHODS Molecular Dynamics. The details of MD simulations were depicted in prior reports.24,25 The MD simulations were carried out with Amber 11.26 The equations of motion were solved with the leapfrog integration algorithm with a time step of 2 fs. The lengths of all bonds involving hydrogen atoms were kept constrained with the SHAKE algorithm. The particle mesh Ewald (PME) method was applied for treating long-range electrostatic interactions. Periodic boundary conditions were used in all simulations. Three EGFR structures were taken from the Protein Data Bank (PDB): 2ITW,27 the DFG-in active form; 3W32,28 the Src-like inactive form; and 4I20,29 the DFG-out inactive form. The FABP4 structure was extracted from the PDB entry 1ALB.30 The protein was modeled using the Amber ff03 force field.31 The starting structure was explicitly solvated in a rectangular box of TIP3P water molecules with a minimal distance of 10 Å from the protein to the edges of the box. Counter ions (Na+/Cl−) were added to neutralize uncompensated charges. After equilibration, the production run was carried out in the NVT ensemble with the Langevin thermostat at 300 K using the parallel CUDA version of PMEMD32 on two GPUs. The trajectories were sampled at a time interval of 10 ps. For FABP4, the production run lasted for 1.2 μs. For EGFR, each of 47 MD trajectories ran for 200 ns. Bayesian Clustering (BC). Bayesian clustering is an unsupervised method that seeks the optimal solution by calculating the maximum posterior parameters and the most probable classification. The clustering process starts with a serial of seed numbers which give an initial guess of the number of clusters. Then, a random classification is generated and refined until a local maximum is found. Thus, it eliminates the requirement to specify the number of clusters beforehand.33 The Bayesian clustering was performed using AutoClass C 3.3.634 with default seed numbers. In the FABP4 case, the clustering was carried out with seed numbers randomly generated between 1 and 100, too. The maximal iteration step was 200 in the FABP4 case and 1000 in the EGFR case. The iteration step was elevated in order to guarantee convergence so that the consistency of distinct runs could be obtained. When the clustering was not converged, classifications from independent runs would be quite different. The data were assumed to follow the normal distribution. The classification with the highest probability was chosen the optimal solution. k-Means Clustering. The k-means method aims to partition a data set M = [x1, x2, ..., xn] into K homogeneous clusters with centers C = [c1, c2, ..., ck] that is a solution to a minimization problem. The number K needs to be specified in advance as a parameter. The Hartigan−Wong algorithm tries to find a set of C with the minimized within-cluster sum of squares of errors (SSE): SSE = Ni ∑j ∥xi − ci∥2/(Ni − 1). It first chooses the initial centroids. The data points are then assigned to the centroid nearest them, and the centroid is calculated as the mean of the assigned data. When the centroid has been

E=

∑ ∑ d(ci , xij), discrete distribution i

E=

j

∑ ∫ ρ(x)d(ci , xij), continuous distribution i

where ρ(x) is the probability density function and d is the distance function. The MacQueen algorithm is an incremental (or online) model. It is more efficient as it updates the centroids more often and usually needs to perform one complete pass through the data to converge while computationally expensive on a large data set.35 The k-means clustering was performed using the statistical software R 3.1,36 in which the Lloyd and Forgy algorithms were implemented with the same procedure. The maximal iteration step was set to 100 in the FABP4 case and 500 in the EGFR case. Nearly identical classifications were achieved from the three algorithms after clustering was converged. Hierarchical Clustering. The hierarchical clustering groups a set of data with a sequence of nested partitions. The agglomerative algorithm starts with each individual data as a singleton cluster. A proximity matrix between clusters is calculated based on a distance function. In the proximity matrix, the minimal distance is searched when combining two clusters into a new one. Then, the proximity matrix is updated by calculating the distances between the new cluster and the other clusters. This procedure is repeated until all data are assigned into one cluster.35 The hierarchical clustering was carried out using the Ward algorithm implemented in R 3.1.36 The dissimilarity between the classes obtained from Bayesian clustering was measured by the Euclidean distance between class means. The optimal number of clusters was determined by variation analysis of the height that measures the closeness of clusters. PCCA+. The PCCA+ lumping has been performed using Emma 1.4.17 In the PCCA+ method, the stationary distribution and transition matrix are computed, given an initial Markov model. Representative microstates are recognized with maximum pairwise distances in the first eigenvectors. The similarity between each microstate and the representative ones is measured by the convex coordinates. Each microstate is then assigned to the closest representative, and the metastable states are formed.37 Like the k-means, the number of metastable states needs to be specified beforehand as a parameter. Markov State Model (MSM). The MSM calculation was carried out using the software Emma 1.417 as described previously.20 The transition probability matrix was computed with a sliding window of the lag time from 5 to 120 ns on each MD trajectory. The reversible MSMs were constructed by means of maximum likelihood estimate. An alternative approach of the Bayesian posterior and natural priors was suggested to improve the statistical uncertainty.38,39 The Markovianity of models was checked by analyzing the behaviors of the implied time scale: If the implied time scale is B

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

solutions have the cluster number of 4 or 5. Therefore, the optimal number of classes is determined to be 4 or 5 by the BC. The best solutions of the five runs resemble each other (Table 1). In each of them, there are two clusters centering around 6.0 and 8.0 Å, and another two clusters have the adjacent average of about 10.0−11.0 Å, while the standard deviations indicate diverse data distribution. The difference among them is that the five-cluster classification includes one cluster with the mean of 5.0 Å, which does not emerge in the four-cluster solution. Furthermore, we investigate if the seed numbers exert influence over the clustering. Five independent BC runs have been performed with the seed numbers randomly chosen. The optimal number of clusters is 5 in three runs and 4 in the other two runs. Similarly, the cluster numbers of 4 and 5 are preferred in the top 10 classifications of each run. In the best solutions listed in Table 2, the clusters revolve around 5.0, 6.0, 8.0, and 10.0−11.0 Å, respectively, which is consistent with the classification in Table 1. Accordingly, the seed numbers have little impact on the clustering. In our testing, distinct BC runs may generate the uniform classification. The clustering has been repeated until the optimal solutions are different, though they still look alike. In a variety of independent runs, no matter if the seed numbers are random or not, the optimal number of classes decided by BC is always 4 or 5, and the clusters have the similar mean value, though the size alters. The results clearly indicate that the BC is pretty stable in partitioning MD data. Our conclusions disagree with the prior report possibly because of the implementation of the algorithm. In this work, the latest version of AutoClass C 3.3.634 is employed. Given that the seed numbers’ influence is negligible, we only focus on the clustering with the default seed numbers in the following. 1.2. BC/HC and k-Means. 1.2.1. BC/HC. According to Figure 1, there should be three metastable states in our intuition, while the optimal number of classes obtained from the BC is 4 or 5. So, the clusters created by the BC are assembled into three classes using the HC. In the HC trees (Figure 2), clusters alike in the mean are gathered together. For example, the two clusters surrounding 10.0−11.0 Å are aggregated with each other, and the two clusters with the average of 5.0 and 6.0 Å are congregated. In the three-state solution (Table 3), the classes in all five BC runs center round 6.0, 8.0, and 10.0 Å, respectively, which agrees very well with Figure 1. 1.2.2. k-Means. The FABP4 data are then divided into 3, 4, and 5 classes using three k-means algorithms of Hartigan− Wong, Lloyd/Forgy, and MacQueen (Table 4). In the threeclass solution, the classes generated by the three algorithms are almost identical in the size and mean (6.1, 8.3, and 11.1 Å). In the four- and five-class solution, various classes are available. We compare the clustering results from the BC/HC and kmeans. When the data are separated into three classes, both

independent of the lag time, the model is Markovian; otherwise, the model is not Markovian. The implied time scale was calculated by τi = −τ/ln λi, where λi is the eigenvalue of the transition matrix with the lag time of τ. After the lag time was determined, the transition probability matrix was calculated. The transition rates between the opening and closing state (FABP4) as well as between the DFG-in active and DFG-out inactive state (EGFR) were estimated by the mean first passage time (MFPT). The MFPT, defined as the mean time (f i) needed to reach the final state for the first time from the initial state, was obtained by solving the transition probability matrix as suggested.40



RESULTS 1. FABP4. FABP4, responsible for transportation of fatty acids and other lipids, has been identified as a potential target for diabetes and cancer therapy.41,42 It has a buried binding pocket, and the opening/closing of the binding cavity has been investigated.24 The openness of FABP4 is measured by the distance between Thr29 and Phe57. From a 1.2 μs MD simulation, it is observed that there are three populated conformations (Figure 1). The peaks at 6.0, 8.0, and 11.0 Å

Figure 1. Distribution of the distance between Thr29 and Phe57 obtained from a 1.2 μs MD simulation of an apo FABP4. The density curve is colored red.

denote the closed, half-open, and open state, respectively. Intuitively, the FABP4 data should be divided into three classes. In this section, we evaluate the performance of geometric and kinetic clustering on the simple one-dimensional data. 1.1. Stability of BC. In an earlier report it has been pointed out that the BC results may be inconsistent and influenced by the seed numbers.13 Here, we examine the stability of the BC algorithm first. Five independent BC runs have been carried out with the default seed numbers. The optimal number of clusters is 4 in three runs and 5 in the other two runs. Inspection of the top 10 classifications in each run shows that almost all of the

Table 1. Size and Mean Distance between Thr29 and Phe57 with Standard Deviation in the Optimal Classification Obtained from Five Independent Runs of Bayesian Clustering Using Default Seed Numbers 1 class

size

0 1 2 3 4

36,159 37,818 14,514 20,540 10,969

2 mean (Å)

size

± ± ± ± ±

37,659 31,084 26,961 13,970 10,326

6.3 8.1 11.0 10.6 5.3

0.4 0.5 1.9 0.5 0.2

3 mean (Å)

size

± ± ± ± ±

48,056 28,463 32,011 11,470

9.8 6.3 8.1 5.4 10.7

1.8 0.3 0.3 0.3 0.2

C

4 mean (Å)

size

± ± ± ±

45,477 40,083 16,344 18,096

6.1 10.4 8.1 10.6

0.6 1.7 0.4 0.2

5 mean (Å)

size

± ± ± ±

48,315 25,740 33,342 12,603

6.0 8.1 11.0 10.6

0.5 0.5 1.8 0.4

mean (Å) 6.1 10.6 8.1 10.6

± ± ± ±

0.6 1.6 0.4 0.3

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Table 2. Size and Mean Distance between Thr29 and Phe57 with Standard Deviation in the Optimal Classification Achieved from Five Independent Runs of Bayesian Clustering Using Random Seed Numbers 1 class

size

0 1 2 3 4

34,398 28,322 28,860 17,176 11,244

2 mean (Å)

size

± ± ± ± ±

47,760 20,463 36,014 15,763

10.0 6.4 8.1 5.5 10.7

1.8 0.3 0.3 0.3 0.2

3 mean (Å)

size

± ± ± ±

47,965 25,760 33,981 12,294

6.1 10.7 8.1 10.6

0.6 1.7 0.5 0.3

4 mean (Å)

size

± ± ± ±

41,099 32,549 13,905 13,712 18,735

6.1 10.6 8.1 10.6

0.6 1.6 0.4 0.3

5 mean (Å)

size

± ± ± ± ±

20,791 38,136 34,995 17,341 8737

8.2 6.3 11.3 5.4 10.5

0.6 0.4 1.8 0.3 0.4

mean (Å) 10.5 6.2 8.1 10.6 5.3

± ± ± ± ±

1.9 0.4 0.4 0.4 0.2

due to matrix singularity. The three-state solution (Table 5) is quite different to Figure 1 possibly because Figure 1 does not express the kinetic information. Thus, the metastable states generated by geometric and kinetic clustering are inconsistent. 1.4. Effect on MSM. After various classifications have been acquired from the BC/HC, k-means, and PCCA+, MSMs are constructed and assessed by the total net flux between the open and closed state as well as the ratio of opening and closing rate (ropen/rclose). With the BC (Figure 6A), distinct BC runs lead to similar results even though they have different classes (4 or 5 states). In five independent runs, the average flux is 0.7 ± 0.1 × 106 s−1 and the average ropen/rclose is 0.5 ± 0.1. After lumping into three states with the HC, the flux rises to 1.3 ± 0.1 × 106 s−1 and the rate proportion increases to 1.1 ± 0.1. Therefore, the uncertainty in the MSM outcome induced by diversity of the BC results is not significant. With the k-means (Figure 6A), the MSM output from the three algorithms is nearly identical when the number of states is uniform, though the four- or fivestate classifications are dissimilar as aforementioned. The flux ascends from 0.9 to 1.1 × 106 s−1 and the ratio of ropen/rclose mounts from 0.75 to 0.88 when the number of classes decreases from five to three. Thus, the choice of k-means algorithms has little effect on the MSM. With the PCCA+ (Figure 6B), the flux ranges from 0.3 to 1.8 × 106 s−1 and the rate ratio varies from 0.1 to 1.9, depending on the number of microstates/ macrostates and clustering method. With either the three- or four-state model, in most cases, the flux is over 1.0 × 106 s−1 and the ratio is below 1.0. According to Figure 6, the MSM outcome is affected by both the number of states and the clustering algorithm. With geometric clustering (BC/HC and k-means), generally speaking, the more states the MSM has the less flux and opening events we may observe. With kinetic clustering (PCCA+), the relationship between the MSM output and the state number is promiscuous. The flux and rate proportion from the three-state models generated by different clustering methods, which we believe in, are compared. Using the BC/HC and k-means, the flux is 1.3 ± 0.1 and 1.1 × 106 s−1, respectively. With the PCCA +, in most cases the flux is between 1.0−2.0 × 106 s−1. Thus, the net flux based on distinct clustering algorithms is fairly close. On the other hand, the rate ratio is 1.1 ± 0.1 and 0.88 using the BC/HC and k-means; with the PCCA+, the ratio

Figure 2. Hierarchical clustering of classes obtained from independent Bayesian clustering runs. The runs 1−5 are arranged from left to right. The class number in the upper panel corresponds to that in Table 1. The three-state solution is drawn in red box. The height measures the closeness of the cluster. Grouping into two clusters is not a good choice because of big height values.

methods create clusters with very similar centroid (Tables 3 and 4), which is accordant with the observation in Figure 1. When the data are partitioned into 4 or 5 clusters (Tables 1 and 4), the two methods display their own characteristics. With the k-means, the classes are evenly distributed, while with the BC some classes are closer to each other than others. It is intriguing to find that with the class number of three different algorithms lead to nearly the same results, which may suggest that the three-state assumption is reasonable. 1.3. Lag Time and PCCA+. To inspect the influence of space decomposition on the PCCA+, the FABP4 data are discretized into 32, 76, and 146 bins manually and grouped into 32, 76, and 146 classes by the three k-means algorithms, which are regarded as microstates. Then, the behaviors of implied time scales as a function of lag time (τ) are scrutinized. As shown in Figures 3−5, the model can be considered to be Markovian after 110 ns with any method (BC/HC, k-means, and bins). Therefore, the lag time is determined to be 110 ns, which is hardly affected by the clustering algorithm and the number of classes. With the lag time of 110 ns, the 32, 76, and 146 microstates are congregated into 3, 4, and 5 macrostates by means of the PCCA+, respectively. In most cases, the microstates cannot be assembled into 5 macrostates. With the MacQueen, the 146 microstates cannot be further lumped into 3 or 4 macrostates

Table 3. Size and Mean Distance between Thr29 and Phe57 with Standard Deviation in Three Metastable States Obtained from Five Bayesian Clustering Runs and Hierarchical Clustering 1

2

3

4

5

class

size

mean (Å)

size

mean (Å)

size

mean (Å)

size

mean (Å)

size

mean (Å)

1 2 3

47,128 37,818 35,054

6.1 ± 0.6 8.1 ± 0.5 10.7 ± 1.3

45,054 26,961 47,985

6.0 ± 0.5 8.1 ± 0.3 10.0 ± 1.7

48,056 32,011 39,933

6.1 ± 0.6 8.1 ± 0.4 10.5 ± 1.4

45,477 40,083 34,440

6.0 ± 0.5 8.1 ± 0.5 10.8 ± 1.3

48,315 33,342 38,343

6.1 ± 0.6 8.1 ± 0.4 10.6 ± 1.3

D

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 4. Size and Mean Distance between Thr29 and Phe57 of Classes Obtained from Three k-Means Algorithms Hartigan−Wong no. of classes

class

size

3

1 2 3 1 2 3 4 1 2 3 4 5

47,645 27,655 44,700 26,554 6019 46,319 41,108 21,745 22,532 29,064 42,299 4360

4

5

Lloyd/Forgy

mean (Å)

size

± ± ± ± ± ± ± ± ± ± ± ±

28,067 44,634 47,299 25,583 33,956 35,645 24,816 19,700 29,847 26,517 23,971 19,965

6.1 11.1 8.3 10.4 13.0 6.0 8.2 10.7 9.0 7.7 6.0 13.4

0.6 1.2 0.6 0.6 1.3 0.5 0.6 0.6 0.4 0.4 0.5 1.3

MacQueen

mean (Å)

size

± ± ± ± ± ± ± ± ± ± ± ±

44,712 27,630 47,658 46,241 26,493 40,977 6289 23,977 28,511 6681 35,579 25,252

11.1 8.3 6.1 7.2 8.8 5.8 11.2 11.5 7.9 6.4 9.4 5.5

1.2 0.6 0.6 0.5 0.6 0.4 1.2 1.2 0.4 0.3 0.6 0.3

mean (Å) 8.3 11.1 6.1 6.0 10.4 8.2 12.9 6.8 5.7 12.8 8.3 10.4

± ± ± ± ± ± ± ± ± ± ± ±

0.6 1.2 0.6 0.5 0.6 0.5 1.3 0.4 0.4 1.3 0.5 0.6

DFG-in active form and the DFG-out inactive form via the Srclike inactive form is more energetically preferred than the direct transition between the DFG-in and DFG-out form. Structural features of these forms have been discussed.25 The conformational changes of EGFR during this process can be depicted by the root-mean-square deviations (RMSDs) of the activation loop and the motif DFG as well as the distance between the ion pair (Lys745 and Glu762). Figure 7 plots the population of the variables during the transition simulated by multiple MD trajectories. For the EGFR data, it becomes infeasible to easily make a decision how many metastable states there should be. 2.1. BC/HC and k-Means. Since the BC can automatically determine the optimal number of clusters, the data are first partitioned by the BC. In the top 10 classifications, the number of clusters varies from 52 to 55, and the optimal number is 55. The 55 classes from the best solution are aggregated using the HC. In Figure 8, the closeness within the cluster (that is, the height) starts to decline slowly after the class number of 5. To examine the influence of diverse classifications on the MSM, eight solutions are selected with 5, 6, 7, 8, 9, 17, 23, and 55 states, respectively. According to the above results, the EGFR data are then grouped into 5, 6, 7, 8, 9, 17, 23, and 55 classes using the three k-means algorithms (Hartigan−-Wong, Lloyd/Forgy, and MacQueen). The classifications acquired are nearly identical when there are 5, 6, 7, and 8 clusters. With the class number of 9, 17, 23, and 55, various clusters are obtained. It is consistent with the observation in the FABP4 case that the three k-means algorithms tend to generate almost the same classification with a small class number. As an example, the five-state solution from both methods is listed in Table 6. In five-state classifications, the BC/HC and k-means results resemble each other. Since the classifications from the three kmeans algorithms are alike in size and mean, we only discuss the BC/HC and Lloyd/Forgy algorithm here. As shown in Table 6, the DFG-in active state is represented by class 3 (loop RMSD 3.7 Å, DFG RMSD 2.3 Å, and ion distance 8.1 Å) from the BC/HC and class 4 (4.2, 2.5, and 8.3 Å) from the Lloyd/ Forgy. The Src-like inactive state is described by class 1 (11.0, 4.6, and 15.2 Å) from the BC/HC and class 2 (10.5, 4.1, and 15.4 Å) from the Lloyd/Forgy. The DFG-out inactive state is depicted by class 5 (4.2, 6.4, and 14.2 Å) from the BC/HC and class 3 (4.4, 6.4, and 14.2 Å) from the Lloyd/Forgy. Classes 2 and 4 from the BC/HC as well as classes 1 and 5 from the Lloyd/Forgy are regarded as intermediate states. Class 4 (4.1, 3.5, and 13.6 Å) from the BC/HC corresponds to class 1 (5.1,

Figure 3. Evolution of the slowest time scales as a function of the lag time based on classifications obtained from the BC/HC. Runs 1−5 are arranged from top to bottom. The right column is the two slowest time scales from the three-state classifications. The left column is the four/three slowest time scales from the five-/four-state solutions.

fluctuates around 1.0 and usually below 1.0. We deduce that the actual ratio may be 1; that is, the opening and closing rate is equal. If so, the opening rate may be overestimated by the BC/ HC and underestimated by the k-means and PCCA+. 2. EGFR. In this section, we further assess the impact of clustering algorithms on the MSM by a complicated case of EGFR. Abnormal activation of EGFR is closely related with cancer development, and EGFR has been a clinically valid target for cancer treatment. The activation/inactivation pathway of EGFR has been compared,25 and the path between the E

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 4. Evolution of the slowest time scales as a function of the lag time based on classifications achieved from the k-means. The three algorithms (Hartigan−Wong, Lloyd/Forgy, and MacQueen) are arranged from left to right. The three-, four-, and five-state solutions are arranged from top to bottom.

Figure 5. Evolution of the slowest time scales as a function of the lag time based on microstates for the PCCA+. The discretization method (bins, Hartigan−Wong, Lloyd/Forgy, and MacQueen) are arranged from top to bottom. For 32, 76, and 146 microstates arranged from left to right, only the four slowest time scales are plotted.

3.5, and 13.6 Å) from the Lloyd/Forgy while class 2 (8.2, 4.6, and 15.0 Å) from the BC/HC and class 5 (10.7, 6.4, and 15.1 Å) from the Lloyd/Forgy are dissimilar. Sometimes the Hartigan−Wong algorithm does not converge because of data proximity. In the FABP4 case, this problem can be solved by slight rounding of the data. However, in the EGFR case, rounding does not help to avoid the problem. In contrast, the Lloyd/Forgy and MacQueen algorithm will converge by simply increasing the iteration step. In our opinion, the Lloyd/Forgy and MacQueen algorithm are more appropriate for MD data than the

Hartigan−Wong algorithm. Therefore, the Hartigan−Wong algorithm is excluded in the following MSM analysis. 2.2. Lag Time and PCCA+. The EGFR data are discretized into 81, 194, and 1013 bins manually and divided into 64, 125, 216, 512, and 1000 classes by means of the Lloyd/Forgy and MacQueen algorithm. In almost all cases, the PCCA+ fails in agglomeration, suggesting that the PCCA+ is sensitive to decomposition of the phase space for high-dimensional data like the EGFR case. Also the PCCA+ is employed to assemble the 55 states generated by the BC, Lloyd/Forgy, and MacQueen. Analysis of the implied time scales (Figure 9) F

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Table 5. Size and Mean Distance between Thr29 and Phe57 with Standard Deviation in Three Metastable States Obtained from the PCCA+ Algorithm bin microstate

macrostate

size

32

1 2 3 1 2 3 1 2 3

62,986 40,337 16,677 2 66,331 53,667 63,835 22,751 33,414

76

146

Hartigan−Wong mean (Å)

size

± ± ± ± ± ± ± ± ±

45,844 47,594 26,562 54,541 56,292 9167 24,599 48,533 46,868

6.5 10.0 9.6 4.3 9.6 6.2 7.4 6.5 10.5

0.9 1.3 1.9 0.0 1.6 0.7 1.3 1.7 1.4

Lloyd/Forgy

mean (Å)

size

± ± ± ± ± ± ± ± ±

39,787 69,465 10,748 44,385 52,869 22,746 68,797 37,474 13,729

8.8 6.5 9.7 9.4 6.5 10.0 9.4 9.1 6.3

2.2 0.7 1.7 1.8 1.0 1.8 2.5 1.5 0.7

MacQueen

mean (Å)

size

± ± ± ± ± ± ± ± ±

45,604 57,719 16,677 14,424 52,010 53,566

9.8 6.7 10.5 6.6 8.8 9.5 7.5 10.2 5.4

1.5 1.1 1.9 0.8 2.3 1.6 1.5 1.5 0.5

mean (Å) 6.6 8.8 9.5 9.5 9.6 6.2

± ± ± ± ± ±

0.7 2.3 1.5 1.5 1.6 0.7

Figure 6. Total net flux and ratio of ropen/rclose calculated from a variety of MSMs. (A) Geometric clustering. The MSM with 3, 4, and 5 metastable states is colored black, red, and green, respectively. (B) Kinetic clustering. The MSM with 32, 76, and 146 microstates is colored black, red, and green. The microstates are lumped into 3 (left column) and 4 (right column) macrostates by the PCCA+. With the MacQueen, the 146 microstates cannot be grouped into 3 or 4 macrostates due to matrix singularity. HW, LF, and MQ represent the three k-means algorithms (Hartigan−Wong, Lloyd/Forgy, and MacQueen).

Figure 8. Hierarchical clustering of 55 classes obtained from the best solution using Bayesian clustering. The red points denote the solution with 5, 6, 7, 8, 9, 17, 23, and 55 clusters.

Figure 7. Distribution of loop RMSD, DFG RMSD, and the distance between the ion pair (Lys745 and Glu762) extracted from 47 MD simulations of EGFR, each lasting for 200 ns. The sampling is along the pathway between the DFG-in active form and the DFG-out inactive form via the Src-like inactive form.

11.9 Å) is close to the DFG-in active state. Class 4 (4.2, 6.4, and 14.2 Å) is the DFG-out inactive state. Class 5 (10.0, 4.1, and 14.7 Å) is the Src-like inactive state. Out of two intermediate states (classes 1 and 3), class 3 (10.3, 6.6, and 15.7 Å) is close to class 5 (10.7, 6.4, 15.1 Å) from the Lloyd/Forgy, while class 1 has no equivalent from the BC/HC and k-means. Figure 9 indicates that for the three-dimensional EGFR data the lag time depends on the clustering algorithm and the number of states, which is opposite to the one-dimensional FABP4 data. With the BC/HC, the model is Markovian after

shows that the model is Markovian in 120 ns only when the data are divided into 55 clusters with the BC. Then, the 55 clusters from the BC are lumped together by use of the PCCA+. The agglomerative 5, 6, 7, 8, and 9 states are successfully produced, while the PCCA+ cannot congregate the data into 17 or 23 states. The five-state model is summarized in Table 6. Similar with the BC/HC and k-means, the three conformational states are recognizable. Class 2 (3.4, 2.8, and G

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

PCCA+. The rate proportion increases rapidly from 34.6 to 332.2, which is usually much higher than that with the BC/HC and BC/PCCA+. For the three-dimensional EGFR data, decomposition of the conformational space is an obstacle in building a valid MSM, especially with kinetic clustering (PCCA+). Most solutions generated by various clustering methods are improper to the PCCA+ analysis. The convergence of implied time scales is also a problem in many classifications. Comparatively speaking, the BC/HC is the most effective method for MSM construction. Similar to the FABP4 case, using any method, the net flux has an inclination to decrease with the state number increasing. When there are only a few states, the ratio of transition rate is reasonably stable with the BC/HC and BC/PCCA+ but varies greatly with the k-means. In this sense, the BC/HC and BC/ PCCA+ are more reliable than the k-means. However, the stability of the MSM output is necessary, but insufficient, for correctly estimating the transition rate. Although the BC/HC and BC/PCCA+ both can lead to stable output, the values are substantially different. As mentioned in the FABP4 case, the ratio might be overestimated by the BC/HC and underestimated by the PCCA+. In the EGFR case, we infer that the actual ratio may be between ∼40.0 (the BC/HC) and ∼4.0 (the BC/PCCA+). In spite of the uncertainty, our results suggest that the inactivation of EGFR is faster than the activation.

Table 6. Size and Mean with Standard Deviations of Loop RMSD, DFG RMSD, and Distance between the Ion Pair in the Five-State Model Constructed Using the BC, k-Means, and PCCA+a BC

HW

LF

MQ

PC

class

size

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

529,349 143,316 66,631 91,949 108,755 76,457 478,420 143,347 114,131 127,645 127,666 478,411 114,131 76,446 143,346 127,644 478,420 76,458 143,347 114,131 20,348 111,624 100,555 108,755 598,718

loop 11.0 8.2 3.7 4.1 4.2 4.2 10.5 10.7 4.4 5.1 5.1 10.5 4.4 4.2 10.7 5.1 10.5 4.2 10.7 4.4 11.2 3.4 10.3 4.2 10.0

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

DFG 1.2 1.0 1.5 0.9 1.0 2.0 1.4 1.6 1.2 1.9 1.9 1.4 1.2 2.0 1.6 1.9 1.4 2.0 1.6 1.2 0.5 0.6 1.6 1.0 2.1

4.6 4.6 2.3 3.5 6.4 2.5 4.1 6.4 6.4 3.5 3.5 4.1 6.4 2.5 6.4 3.5 4.1 2.5 6.4 6.4 6.6 2.8 6.6 6.4 4.1

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

1.1 1.2 0.8 0.6 0.5 0.9 0.5 0.6 0.5 0.6 0.6 0.5 0.5 0.9 0.6 0.6 0.5 0.9 0.6 0.5 0.2 0.9 0.5 0.5 0.7

ion 15.2 15.0 8.1 13.6 14.2 8.3 15.4 15.1 14.2 13.6 13.6 15.4 14.2 8.3 15.1 13.6 15.4 8.3 15.1 14.2 12.9 11.9 15.7 14.2 14.7

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

1.6 1.2 1.0 1.7 1.3 1.0 1.3 1.7 1.3 1.5 1.5 1.3 1.3 1.0 1.7 1.5 1.3 1.0 1.7 1.3 0.5 3.2 1.5 1.3 2.1



CONCLUSION In this report, geometric and kinetic clustering methods are employed to partition the conformational space of FABP4 opening/closing and EGFR activation/inactivation. Among the clustering algorithms, the k-means assigns the nearest data points into clusters whose number has been predefined. The BC searches for the optimal classification on the basis of posterior probability. After discretization of the space, the PCCA+ lumps kinetically similar clusters together by calculating the stationary distribution and transition matrix. Despite the difference in algorithm theory, comparison of clustering algorithms indicates that geometric clustering performs as well as kinetic clustering in both FABP4 and EGFR cases. Moreover, geometric clustering displays better adaptability than kinetic clustering. The results are a little unanticipated, while supporting the assumption that conformations geometrically similar are also kinetically close to each other. This work provides helpful evidence for application of clustering algorithms in MSM analysis of MD simulations. The influence of clustering algorithms on MSM is closely related with the data traits. With the one-dimensional FABP4 data, MSMs are set up easily using any clustering algorithm. Although the flux and rate proportion vary with the clustering method, they fluctuate in the uniform order of magnitude. Therefore, the clustering algorithm has a limited impact on the MSM outcome, especially when an appropriate number of states is predefined. With the three-dimensional EGFR data, it becomes difficult to build valid MSMs due to convergence and matrix issues, especially with the PCCA+. The estimated flux and rate proportion based on distinct clustering algorithms vary greatly, up to 3 orders of magnitude. Thus, the clustering method needs to be carefully chosen. In addition, the MSM outcomes are affected by the number of states to some degree. For either the FABP4 or EGFR case, the overall trend is that the net flux reduces and the ratio of transition rate (ropen/rclose and rina/ract) rises with the number of

a

BC, HW, LF, MQ, and PC represent the Bayesian clustering, Hartigan−Wong, Lloyd/Forgy, MacQueen, and PCCA+ algorithm, respectively.

110 ns when there are 5, 6, 7, 8, 9, and 55 states. With the kmeans, the model looks Markovian after 100 ns when there are 5, 6, and 8 states. When the number of states is 7, the time scale converges after 110 ns with the MacQueen algorithm but not with the Lloyd/Forgy. When there are 9, 17, 23, and 55 states, the model acquired from both k-means algorithms cannot be regarded as Markovian within 120 ns. The implied time scale does not converge in 120 ns when the data are separated into 17 or 23 classes using any method. 2.3. Effect on MSM. On the basis of classifications obtained from diverse clustering approaches, MSMs are built and the total net flux is calculated as well as the ratio (rina/ract) of the inactivation and activation rate (Table 7). When the number of states rises, using any clustering method, the flux tends to decrease, while the rate proportion has a tendency to climb up. With the BC/HC, when the state number increases from 5 to 8, the flux diminishes from 90.0 to 50.5 s−1 and the ratio (rina/ract) stabilizes around 40. With the BC/PCCA+, when the number of states ranges between 5 and 9, the flux reduces from 33.9 to 12.3 × 103 s−1 except for a sharp rise at the number of 8 and the ratio fluctuates slightly between 3.0 and 4.3. The flux with the BC/HC is almost 3 orders of magnitude lower than that with the BC/PCCA+, while the rate ratio with the BC/HC is 1 order of magnitude higher than that with the BC/PCCA+. The two k-means algorithms (Lloyd/Forgy and MacQueen) lead to almost the same outcomes of MSM. With the number of states ascending, the flux descends from 1.12 to 0.18 × 103 s−1, which is at least 1 order of magnitude lower than that with the BC/ H

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 9. Evolution of the four slowest time scales as a function of the lag time (τ). (A) The implied time scale is calculated from classifications with 5, 6, 7, 8, and 9 clusters (from top to bottom) generated by the BC, Lloyd/Forgy, and MacQueen algorithm (from left to right). (B) The implied time scale is calculated from classifications with 17, 23, and 55 clusters (from top to bottom) generated by the BC, Lloyd/Forgy, and MacQueen algorithm (from left to right). The Hartigan−Wong algorithm is excluded due to the convergence problem. For clarity, only the four slowest time scales are plotted.

number, while with the BC/HC and BC/PCCA+ the rate proportion is relatively stable when the number of states is between 5 and 9. Here, we survey the clustering performance with lowdimensional (one and three) data using the distance and RMSD as the metric. For high-dimensional (≫3) data,

states increasing. An exception is that, for the FABP4 data with kinetic clustering, the tendency is not discernible. On the other hand, we notice that when a few states are defined in the FABP4 case the MSM output does not alter much using different clustering method. In the EGFR case, with the kmeans, the rate proportion is greatly influenced by the state I

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 7. Total Net Flux (×103 s−1) and Ratio of rina/ract Computed from Various MSMs BC/HC

LF

MQ

flux

rina/ract

flux

rina/ract

flux

rina/ract

Flux

rina/ract

5 6 7 8 9 55

0.090 0.076 0.073 0.050 0.044 0.0002

38.9 41.7 41.5 39.8 47.3 1988.6

1.12 0.20 − 0.18 − −

34.6 192.2 − 332.2 − −

1.12 0.20 0.19 0.18 − −

34.6 192.2 190.6 331.5 − −

33.9 33.2 12.5 80.4 12.3 −

3.0 3.2 3.0 4.3 4.2 −

(8) Schutte, C.; Sarich, M. A Critical Appraisal of Markov State Models. Eur. Phys. J.: Spec. Top. 2015, 224, 2445−2462. (9) Shenkin, P. S.; McDonald, D. Q. Cluster Analysis of Molecular Conformations. J. Comput. Chem. 1994, 15, 899−916. (10) Bottegoni, G.; Cavalli, A.; Recanatini, M. A Comparative Study on the Application of Hierarchical-Agglomerative Clustering Approaches to Organize Outputs of Reiterated Docking Runs. J. Chem. Inf. Model. 2006, 46, 852−862. (11) Li, Y. Bayesian Model Based Clustering Analysis: Application to A Molecular Dynamics Trajectory of the HIV-1 Integrase Catalytic Core. J. Chem. Inf. Model. 2006, 46, 1742−1750. (12) Noe, F.; Horenko, I.; Schutte, C.; Smith, J. C. Hierarchical Analysis of Conformational Dynamics in Biomolecules: Transition Networks of Metastable States. J. Chem. Phys. 2007, 126, 155102. (13) Shao, J. Y.; Tanner, S. W.; Thompson, N.; Cheatham, T. E. Clustering Molecular Dynamics Trajectories: 1. Characterizing the Performance of Different Clustering Algorithms. J. Chem. Theory Comput. 2007, 3, 2312−2334. (14) Keller, B.; Daura, X.; van Gunsteren, W. F. Comparing Geometric and Kinetic Cluster Algorithms for Molecular Simulation Data. J. Chem. Phys. 2010, 132, 074110. (15) Haack, F.; Fackeldey, K.; Roblitz, S.; Scharkoi, O.; Weber, M.; Schmidt, B. Adaptive Spectral Clustering with Application to Tripeptide Conformation Analysis. J. Chem. Phys. 2013, 139, 194110. (16) McGibbon, R. T.; Pande, V. S. Learning Kinetic Distance Metrics for Markov State Models of Protein Conformational Dynamics. J. Chem. Theory Comput. 2013, 9, 2900−2906. (17) Senne, M.; Trendelkamp-Schroer, B.; Mey, A. S. J. S.; Schutte, C.; Noe, F. EMMA: A Software Package for Markov Model Building and Analysis. J. Chem. Theory Comput. 2012, 8, 2223−2238. (18) Scherer, M. K.; Trendelkamp-Schroer, B.; Paul, F.; PerezHernandez, G.; Hoffmann, M.; Plattner, N.; Wehmeyer, C.; Prinz, J. H.; Noe, F. PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. J. Chem. Theory Comput. 2015, 11, 5525−5542. (19) Beauchamp, K. A.; Bowman, G. R.; Lane, T. J.; Maibaum, L.; Haque, I. S.; Pande, V. S. MSMBuilder2: Modeling Conformational Dynamics on the Picosecond to Millisecond Scale. J. Chem. Theory Comput. 2011, 7, 3412−3419. (20) Li, Y.; Li, X.; Dong, Z. Exploration of Gated Ligand Binding Recognizes An Allosteric Site for Blocking FABP4-Protein Interaction. Phys. Chem. Chem. Phys. 2015, 17, 32257−67. (21) Prinz, J. H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J. D.; Schutte, C.; Noe, F. Markov Models of Molecular Kinetics: Generation and Validation. J. Chem. Phys. 2011, 134, 174105. (22) Bowman, G. R.; Meng, L. M.; Huang, X. H. Quantitative Comparison of Alternative Methods for Coarse-Graining Biological Networks. J. Chem. Phys. 2013, 139, 121905. (23) Vitalini, F.; Mey, A.; Noe, F.; Keller, B. G. Dynamic Properties of Force Fields. J. Chem. Phys. 2015, 142, 084101. (24) Li, Y.; Li, X.; Dong, Z. Concerted Dynamic Motions of An FABP4Model and Its Ligands Revealed by Microsecond Molecular Dynamics Simulations. Biochemistry 2014, 53, 6409−6417. (25) Li, Y.; Li, X.; Ma, W.; Dong, Z. Conformational Transition Pathways of Epidermal Growth Factor Receptor Kinase Domain from Multiple Molecular Dynamics Simulations and Bayesian Clustering. J. Chem. Theory Comput. 2014, 10, 3503−3511.

dimension reduction usually needs be performed to choose principle components, and then clustering can be carried out to partition data in the low-dimensional space. Other structural features such as the change of secondary structures are also important measurements, depending on the system studied. Selection of appropriate components and metric is a major challenge in analysis of MD simulations, too. In clustering, it is well known that determining the optimal number of classes in advance is notoriously hard. Both the kmeans and PCCA+ need the state number as an input parameter. In a previous report, a cost function, similar to the height value in the HC (Figure 8), is introduced in the k-means to choose the optimal number.43 In this work, we examine the stability of the BC in automatically determining the microstates by posterior probability, and subsequently, the HC is utilized to group the microstates into macrostates. The cost function and BC/HC both look reasonable and practicable to settle the question.



AUTHOR INFORMATION

Corresponding Authors

*(Yan Li) E-mail: [email protected]. Phone: (507) 4379600. Fax: (507) 437-9606. *(Zigang Dong) E-mail: [email protected]. Phone: (507) 437-9600. Fax: (507) 437-9606. Funding

This work was supported by The Hormel Foundation and National Institutes of Health, Grants CA172457, CA166011, and R37 CA081064. Notes

The authors declare no competing financial interest.



BC/PCCA+

no. of states

REFERENCES

(1) Henzler-Wildman, K.; Kern, D. Dynamic Personalities of Proteins. Nature 2007, 450, 964−972. (2) Hospital, A.; Goni, J. R.; Orozco, M.; Gelpi, J. L. Molecular Dynamics Simulations: Advances and Applications. Adv. Appl. Bioinf. Chem. 2015, 10, 37−47. (3) Malmstrom, R. D.; Lee, C. T.; Van Wart, A. T.; Amaro, R. E. Application of Molecular-Dynamics Based Markov State Models to Functional Proteins. J. Chem. Theory Comput. 2014, 10, 2648−2657. (4) Shukla, D.; Hernandez, C. X.; Weber, J. K.; Pande, V. S. Markov State Models Provide Insights into Dynamic Modulation of Protein Function. Acc. Chem. Res. 2015, 48, 414−422. (5) Chodera, J. D.; Noe, F. Markov State Models of Biomolecular Conformational Dynamics. Curr. Opin. Struct. Biol. 2014, 25, 135−144. (6) Klippenstein, S. J.; Pande, V. S.; Truhlar, D. G. Chemical Kinetics and Mechanisms of Complex Systems: A Perspective on Recent Theoretical Advances. J. Am. Chem. Soc. 2014, 136, 528−546. (7) Schwantes, C. R.; McGibbon, R. T.; Pande, V. S. Perspective: Markov Models for Long-Timescale Biomolecular Dynamics. J. Chem. Phys. 2014, 141, 090901. J

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling (26) Case, D. A.; Darden, T. A.; Cheatham, T. E., III; Simmerling, C. L.; Wang, J.; Duke, R. E.; Luo, R.; Walker, R. C.; Zhang, W.; Merz, K. M.; Roberts, B.; Wang, B.; Hayik, S.; Roitberg, A.; Seabra, G.; Kolossváry, I.; Wong, K. F.; Paesani, F.; Vanicek, J.; Liu, J.; Wu, X.; Brozell, S. R.; Steinbrecher, T.; Gohlke, H.; Cai, Q.; Ye, X.; Wang, J.; Hsieh, M.-J.; Cui, G.; Roe, D. R.; Mathews, D. H.; Seetin, M. G.; Sagui, C.; Babin, V.; Luchko, T.; Gusarov, S.; Kovalenko, A.; Kollman, P. A. AMBER 11, University of California, San Francisco, 2010. (27) Yun, C. H.; Boggon, T. J.; Li, Y. Q.; Woo, M. S.; Greulich, H.; Meyerson, M.; Eck, M. J. Structures of Lung Cancer-Derived EGFR Mutants and Inhibitor Complexes: Mechanism of Activation and Insights into Differential Inhibitor Sensitivity. Cancer Cell 2007, 11, 217−227. (28) Kawakita, Y.; Seto, M.; Ohashi, T.; Tamura, T.; Yusa, T.; Miki, H.; Iwata, H.; Kamiguchi, H.; Tanaka, T.; Sogabe, S.; Ohta, Y.; Ishikawa, T. Design and Synthesis of Novel Pyrimido[4,5-B]Azepine Derivatives as HER2/EGFR Dual Inhibitors. Bioorg. Med. Chem. 2013, 21, 2250−2261. (29) Gajiwala, K. S.; Feng, J. L.; Ferre, R.; Ryan, K.; Brodsky, O.; Weinrich, S.; Kath, J. C.; Stewart, A. Insights into the Aberrant Activity of Mutant EGFR Kinase Domain and Drug Recognition. Structure 2013, 21, 209−219. (30) Xu, Z. H.; Bernlohr, D. A.; Banaszak, L. J. Crystal-Structure of Recombinant Murine Adipocyte Lipid-Binding Protein. Biochemistry 1992, 31, 3484−3492. (31) Duan, Y.; Wu, C.; Chowdhury, S.; Lee, M. C.; Xiong, G. M.; Zhang, W.; Yang, R.; Cieplak, P.; Luo, R.; Lee, T.; Caldwell, J.; Wang, J. M.; Kollman, P. A Point-Charge Force Field for Molecular Mechanics Simulations of Proteins Based on Condensed-Phase Quantum Mechanical Calculations. J. Comput. Chem. 2003, 24, 1999−2012. (32) Salomon-Ferrer, R.; Götz, A. W.; Poole, D.; Le Grand, S.; Walker, R. C. Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. J. Chem. Theory Comput. 2013, 9, 3878−3888. (33) Cheeseman, P.; Stutz, J. Bayesian Classification (AutoClass): Theory and Results. In Advances in Knowledge Discovery and Data Mining; Usama, M. F., Gregory, P.-S., Padhraic, S., Ramasamy, U., Eds.; American Association for Artificial Intelligence, 1996; pp 153−180. (34) Cook, D.; Potts, J.; Taylor, W. AutoClass C. National Aeronautics and Space Administration. http://ti.arc.nasa.gov/tech/ rse/synthesis-projects-applications/autoclass/autoclass-c/ (accessed May 12, 2016). (35) Xu, R.; Wunsch, D. Clustering; Wiley-IEEE Press, 2009; p 358. (36) R. The R Project for Statistical Computing. http://www.rproject.org/ (accessed May 12, 2016). (37) Deuflhard, P.; Weber, M. Robust Perron Cluster Analysis in Conformation Dynamics. Linear Algebra Appl. 2005, 398, 161−184. (38) Bacallado, S.; Chodera, J. D.; Pande, V. Bayesian Comparison of Markov Models of Molecular Dynamics with Detailed Balance Constraint. J. Chem. Phys. 2009, 131, 045106. (39) Trendelkamp-Schroer, B.; Wu, H.; Paul, F.; Noe, F. Estimation and Uncertainty of Reversible Markov Models. J. Chem. Phys. 2015, 143, 174101. (40) Singhal, N.; Pande, V. S. Error Analysis and Efficient Sampling in Markovian State Models for Molecular Dynamics. J. Chem. Phys. 2005, 123, 204909−1. (41) Furuhashi, M.; Hotamisligil, G. S. Fatty Acid-Binding Proteins: Role in Metabolic Diseases and Potential as Drug Targets. Nat. Rev. Drug Discovery 2008, 7, 489−503. (42) Nieman, K. M.; Kenny, H. A.; Penicka, C. V.; Ladanyi, A.; BuellGutbrod, R.; Zillhardt, M. R.; Romero, I. L.; Carey, M. S.; Mills, G. B.; Hotamisligil, G. S.; Yamada, S. D.; Peter, M. E.; Gwin, K.; Lengyel, E. Adipocytes Promote Ovarian Cancer Metastasis and Provide Energy for Rapid Tumor Growth. Nat. Med. 2011, 17, 1498−1503. (43) Decherchi, S.; Berteotti, A.; Bottegoni, G.; Rocchia, W.; Cavalli, A. The Ligand Binding Mechanism to Purine Nucleoside Phosphorylase Elucidated via Molecular Dynamics and Machine Learning. Nat. Commun. 2015, 6, 6155. K

DOI: 10.1021/acs.jcim.6b00181 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX