Performance Comparison of Systematic Methods for Rigorous

Jan 27, 2017 - Some research groups have attempted to develop rigorous methods to derive the CG representations for proteins.(21, 34-37) Voth and co-w...
1 downloads 21 Views 3MB Size
Article pubs.acs.org/jcim

Performance Comparison of Systematic Methods for Rigorous Definition of Coarse-Grained Sites of Large Biomolecules Yuwei Zhang,†,‡ Zexing Cao,† John Zenghui Zhang,‡,§ and Fei Xia*,‡,§ †

State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemistry Engineering, Xiamen University, Xiamen 361005, China ‡ School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China § NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China S Supporting Information *

ABSTRACT: Construction of coarse-grained (CG) models for large biomolecules used for multiscale simulations demands a rigorous definition of CG sites for them. Several coarse-graining methods such as the simulated annealing and steepest descent (SASD) based on the essential dynamics coarse-graining (EDCG) or the stepwise local iterative optimization (SLIO) based on the fluctuation maximization coarse-graining (FM-CG), were developed to do it. However, the practical applications of these methods such as SASD based on ED-CG are subject to limitations because they are too expensive. In this work, we extend the applicability of ED-CG by combining it with the SLIO algorithm. A comprehensive comparison of optimized results and accuracy of various algorithms based on ED-CG show that SLIO is the fastest as well as the most accurate algorithm among them. ED-CG combined with SLIO could give converged results as the number of CG sites increases, which demonstrates that it is another efficient method for coarse-graining large biomolecules. The construction of CG sites for Ras protein by using MD fluctuations demonstrates that the CG sites derived from FM-CG can reflect the fluctuation properties of secondary structures in Ras accurately.



INTRODUCTION The development of coarse-grained (CG) models1−7 for the multiscale simulation of biomolecules8−10 is a promising way to study their functions, including the protein folding,11 conformational change,12 and protein−protein interaction,13 due to the high efficiency of CG models in computation. Some CG models such as these in the MARTINI force field14−16 have already been developed and successfully applied to the investigations of functions and mechanisms of membrane proteins. In these higher resolution CG models, each residue of protein is usually represented by a few CG beads.13,14,17−20 In order to simulate the function of biological macromolecules such as ribosome, Zhang et al.21 have developed low resolution CG models to characterize its dynamic and structural features, where each CG bead represents several tens of residues in ribosomal proteins. In general, construction of CG models based on the relatively fine-grained models such as the all-atom models or the coarser elastic network models22−29 requires a scheme of CG mapping30 to be defined. The CG mapping regulates how to transform the configuration of a fine-grained model to the configuration of a CG model, which has an impact on simulation results. The recent studies performed by Noid and co-workers31,32 reveal that a proper CG mapping could render © 2017 American Chemical Society

CG models to reproduce the properties of all-atom models accurately. The CG mapping of the small molecules such as lipids is usually determined according to their functional groups.33 However, it is more difficult to decide the CG mapping properly for large biomolecules, since too many atoms are involved in them. Therefore, it requires the systematic methods to be developed for the definition of CG sites of large biomolecules. Some research groups have attempted to develop rigorous methods to derive the CG representations for proteins.21,34−37 Voth and co-workers proposed a systematic method called essential-dynamics coarse-graining (ED-CG) method,38 which could construct the CG sites for proteins with their sequences known. The physical spirit of the ED-CG scheme lies in the fact that the atoms that move in the highly similar directions in protein are grouped into the same CG site. A schematic illustration of the principle of ED-CG is shown in Figure 1a. The vectors Δri and Δrj denote the displacements of atoms i and j around their equilibrium positions and the similarity of two vectors is measured by the magnitude of their difference Δri − Δrj. The ED-CG method counts the sum of all the Received: November 7, 2016 Published: January 27, 2017 214

DOI: 10.1021/acs.jcim.6b00683 J. Chem. Inf. Model. 2017, 57, 214−222

Article

Journal of Chemical Information and Modeling

system with nearly 1800 residues into 500 CG sites only cost 28 s. By using SLIO based on FM-CG, we even could derive the CG sites for the huge P22 portal protein with 8304 residues at the computational cost of 19 min, while it is a mission impossible by SASD. However, several limitations about these developed methods and algorithms remain so far. For instance, SASD based on EDCG could be only applied to a target system with the number N of CG sites determined, which extremely limits the application of ED-CG to large biomolecules. In this work, we implement and combine the SASD, SOBC and SLIO algorithms with the ED-CG method and make a systematic comparison between them. We mainly focus on answering the several questions shown below. First, how about the performance of ED-CG compared to FM-CG if the function defined in ED-CG is optimized by SLIO? Second, which algorithm among SASD, SOBC and SLIO can yield the best results if combined with ED-CG or FM-CG? Third, how about the rank of three algorithms as far as the computational cost and the accuracy are concerned? In addition, we used the fluctuations calculated from elastic network models to construct the CG sites of biomolecules in the previous study. In order to construct the CG sites in a more reasonable way, we attempt to use the fluctuations generated from MD trajectories of Ras protein43−45 to do it in this work.

Figure 1. Schematic illustration of the principles of (a) ED-CG and (b) FM-CG methods. The blue solid curves represent contours of proteins, the red dashed ellipses represent the divided CG sites, and the hollow circles denote the positions of atoms in the proteins. In ED-CG, the vectors Δrx (x = i, j, k, l) denote the displacement vectors of atoms i, j, k, and l at equilibrium positions and the Δri − Δrj denotes the difference of motion directions of atoms i and j in a certain CG site. In FM-CG, sij and si′j′ denote the equilibrium and instantaneous distances of atoms i and j. The notation Δsij denotes the mean-square distance fluctuation that is calculated as si′j′ − sij.



displacement differences of pairwise atoms in each CG site to ensure that the total sum over all CG sites possesses a small value. In order to obtain the minimum of the sum, Zhang et al. employed the combined optimization algorithm of simulated annealing39 and steepest descent (SASD).38 The SASD algorithm works very well for the small-size proteins such as the G-actin monomer to define their CG representations. However, the number of samplings for large proteins by SASD is enormous, which leads to a considerable computational cost at the time scale of hours.40 In order to speed up the process of deriving CG sites for large biomolecules, Xia and co-workers developed a new algorithm called stepwise optimization imposed with the boundary-constraint (SOBC),41 based on the ED-CG. The SOBC algorithm reduces the computational cost by a half of that consumed by SASD for large systems such as IgG1 b12, as well as gives better results meanwhile. Further, Xia and coworkers proposed a new coarse-graining method, which is called fluctuation maximization coarse-graining (FM-CG).42 As shown in Figure 1b, sij and si′j′ denote the equilibrium and instantaneous distances of atoms i and j, respectively. In FMCG, if the Δsij between the atoms i and j is large, the i and j are naturally separated into different CG sites. FM-CG aims at maximizing the sum of distance fluctuations of pairwise atoms between all CG sites. Thus, the maximization of the sum defined in FM-CG gives rise to a suite of CG sites. It is emphasized that the physical meanings of ED-CG and FM-CG for coarse-graining actually are completely different in nature. As shown in Figure 1, the coarse-graining results of ED-CG reflect the property that the atoms in the same function domain of protein have the similar motion directions, while that of FMCG reflects that the different function domain in protein have large relative motions as executing their functions. Based on FM-CG, Xia and co-workers designed a robust algorithm called stepwise local iterative optimization (SLIO)42 to speed up the coarse-graining process of large biomolecules. By using SLIO, the computational cost for large systems reduces to the time scale of seconds. For instance, coarse-graining the F-actin

THEORY AND METHODS ED-CG and FM-CG Methods. In the ED-CG method,38 a residual χED‑CG2 is defined to count the sum of difference of displacement vectors of pairwise atoms i and j in the finegrained model, as shown in eq 1: χED ‐ CG 2 =

1 3N

N

∑∑ ∑

[⟨Δri 2⟩ + ⟨Δrj 2⟩

I=1 i∈I j≥i∈I

− 2⟨Δri·Δrj⟩]

(1)

where N is the total number of CG sites, I represents the Ith CG site, and Δri as well as Δrj correspond to the displacement vectors of atoms i and j, respectively. The first and second terms represent the averaged mean-square fluctuations of atoms i and j around their equilibrium positions and the third term ⟨Δri·Δrj⟩ denotes their cross-correlation. When the residual χED‑CG2 is optimized by employing the algorithms such as SASD to get its minimum, the CG division for the fine-grained model is then determined. In the FM-CG method,42 a residual χFM‑CG2 is defined as the sum of mean-square distance fluctuations (MSDFs) of pairwise atoms i and j between all CG sites, with its expression shown in eq 2: N

χFM ‐ CG 2 =

N

∑ ∑ (∑ ∑ ⟨Δsij 2⟩) I=1 J>I

i∈I j∈J

(2)

where N is the total number of CG sites, i and j represent the atoms belong to the ith and jth CG sites, respectively. The notation ⟨Δsij2⟩ denotes the averaged MSDF between the pairwise i and j, the term ∑i∈I∑j∈J⟨Δsij2⟩ represents the sum of MSDFs between the pairwise CG sites and χFM‑CG2 counts the total MSDFs of N CG sites. Opposite to the residual χED‑CG2, χFM‑CG2 possesses a maximum of the function defined in eq 2. Thus, the coarse-graining by FM-CG is actually a process of maximizing eq 2. We have developed the SLIO algorithm to 215

DOI: 10.1021/acs.jcim.6b00683 J. Chem. Inf. Model. 2017, 57, 214−222

Article

Journal of Chemical Information and Modeling

Figure 2. Schematic illustration of the principles of SASD, SOBC, and SLIO algorithms. (a) In SASD, the protein sequence is divided into N segments by the randomly generated N − 1 boundaries. The simulated annealing is used to locate the global minimum of target function, which is followed by a steepest descent optimization. (b) In SOBC, a first boundary b1 is optimized and located with the steepest descent algorithm. Then, the b2 is optimized with b1 fixed and the b2 is determined by the smaller residual. The N − 1 boundaries are located step by step in such a way. (c) In SLIO, the target function is scanned by evaluating all possible values and the position of b1 corresponds to the extremum. In this manner, b2 also can be determined with b1 fixed. Subsequently, an iterative optimization for the positions of b1 and b2 is performed until the maximum of target function is obtained.

SASD Algorithm. SASD38 is a combination of simulated annealing39 and steepest descent algorithms, which was originally used by Zhang et al. to optimize N CG sites. The principle of SASD is schematically illustrated in Figure 2a. In the process of simulated annealing, the N − 1 boundaries are randomly generated, with an estimated χnew2. Whether χnew2 is rejected or accepted is determined by Metropolis51 criterion 2 with the probability e−Δχ /T, where Δχ2 = χnew2 − χold2 and T is a pseudotemperature to control the ratio of acceptance. The temperature T is initially assigned with a reasonable value and then reduces gradually, until the minimum of target function is located. Subsequently, the steepest descent optimization is performed with 2000 steps to ensure better results obtained further. In order to sample sufficiently and accurately, SASD optimization needs to be executed repeatedly for samplings and the default number is set to be 100 in our work. Finally, the CG division with the minimal residual value is chosen as the CG representation of target system. SOBC Algorithm. In contrast to SASD, SOBC41 is a boundary-constrained optimization algorithm that attempts to determine the N − 1 boundaries stepwise. As shown in Figure 2b, the first boundary b1 is generated randomly to divide the sequence into two segments and its optimal position is determined by executing a steepest descent optimization. Next, the temporary boundaries b2′ and b2″ for b2 are generated

maximize the residual in eq 2, and more details are provided in the previous study by Li et al.42 Displacement Vectors and Distance Fluctuations. When construct the CG sites with low resolution, a finegrained model on which it is based needs to be defined. By using ED-CG for coarse-graining, the displacement vectors of atoms in the fine-grained model should be calculated accurately. In ED-CG, we choose the anisotropic network model (ANM)22 as the fine-grained model to represent the coarse-grained biological systems. The details about ANM are further provided in the section S1 of the Supporting Information. The displacement vectors of Cα atoms in ANM are obtained by performing the normal-mode analysis46 on ANM. The mean-square distance fluctuations ⟨Δsij2⟩ between the pairwise Cα atoms i and j used in FM-CG could be also evaluated from the eigenvectors generated from the normalmode analysis on ANM, by employing the source code written by Brooks et al.47 Alternatively, ⟨Δsij2⟩ can be also evaluated from all-atom MD simulations numerically, which is analogous to what have been done by Lyman et al.48 In this work, an allatom MD simulation for Ras protein is carried out in the NPT ensembles with the Gromacs 4.5.5 software.49,50 The Cartesian coordinates of all Cα atoms in Ras are extracted from the MD trajectory for the evaluation of ⟨Δsij2⟩ that is used for coarsegraining in FM-CG. 216

DOI: 10.1021/acs.jcim.6b00683 J. Chem. Inf. Model. 2017, 57, 214−222

Article

Journal of Chemical Information and Modeling and then optimized under the condition of b1 being fixed. At this step, the residual values χ′b22 and χ″b2 2 for the divisions determined by b2′ and b2″ are evaluated and the final position of b2 is determined by the smaller one. Subsequently, the next boundary b3 is optimized, with both b1 and b2 fixed. The process mentioned above needs to be repeated until all the N-1 boundaries are obtained. It is emphasized that, if SOBC is used for the optimization of eq 1 in ED-CG and eq 2 in FM-CG, the increments of eqs 1 and 2 are always negative and positive respectively, as a certain CG site is divided into two parts. Thus, the SOBC algorithm can guarantee that the residuals in eqs 1 and 2 vary monotonically with the progression of number of CG sites. Because SOBC needs not to optimize N − 1 variables simultaneously, its computational cost decreases drastically compared to that of SASD, especially for large system such as IgG1 b12. SLIO Algorithm. The SLIO algorithm42 can be regarded as an improved version of SOBC. It also follows the idea of stepwise optimization similar to SOBC. The first step of SLIO in Figure 2c shows that all the possibilities of b1 that divides the sequence into two segments are considered by moving along the sequence and evaluating the corresponding residual values. The final position of b1 is precisely determined by the maximum of function. In the second step, b2 is first temporarily determined by the scanned maximum with b1 fixed. Then, SLIO optimizes the positions of boundaries b1 and b2 iteratively by taking into account their correlations.42 The positions of b1 and b2 keep on changing, until the value of target function reaches its maximum. In such a way, the total N − 1 boundaries for N CG sites could be determined step by step eventually. Also, we emphasize that SLIO algorithm optimizes these boundaries by increasing the residual value of eq 2 in FM-CG or decreasing that of eq 1 in ED-CG, so that the optimized residual values of FM-CG and ED-CG evolve monotonically with respect to the number of CG sites. We have implemented the SASD, SOBC, and SLIO algorithms with the ED-CG method by means of using C++ language, as well as SLIO based on FM-CG.

Figure 3a presents the results of ED-CG and FM-CG for the Gactin monomer that is coarse-grained into 40 CG sites. In

Figure 3. (a) Residuals ΔχED‑CG2 and ΔχFM‑CG2 vary with the number of CG sites. (inset) Calculated residual values in the first three steps of minimization by SLIO based on ED-CG. The arrows indicate the positions of minima located by SLIO at each step. (b) Increments ΔχED‑CG2 and ΔχFM‑CG2 vary with the number of CG sites. (inset) 40 CG sites derived by SLIO based ED-CG.



RESULTS AND DISCUSSION Convergence of FM-CG and ED-CG. In the previous studies, Zhang et al.38,40 and Li et al.41,42 derived the CG representations of biomolecules using SASD based on ED-CG or SLIO based on FM-CG, respectively. It should be noted that the number of CG sites N has to be predefined in employing SASD based on ED-CG. In most practical applications, the definition of N is somewhat arbitrary and its magnitude depends on the target systems studied by researchers. However, the number N does not need to be defined in SLIO based on FM-CG. The advantage of SLIO algorithm lies in that, it could generate the relationship of the residual values varying with the CG number N, which exhibits us a curve in which the function values approach a maximum as the CG number N increases. Based on this relationship, an optimal number n* could be defined for the coarse-grained system instead of an arbitrary N, whereas SASD cannot derive this relationship since SASD is difficult to get a converged result and also expensive even to perform once for a specific N. An interesting question about ED-CG is that, how about the performance of ED-CG if it works with SLIO? To explore it, we performed coarse-graining tests on the G-actin monomer52,53 by using SLIO based on ED-CG and FM-CG, respectively.

contrast to the maximization of eq 2 in FM-CG, SLIO executes the minimization of eq 1 in ED-CG. The solid black squares and red dots show how the residual values of ED-CG and FMCG vary with the increased N. The inset in Figure 3a shows the evaluated residual values of ED-CG in the first three steps of optimization by SLIO. In the first step, SLIO evaluates all possible values of eq 1 and locates the minimum at b1, as indicated by the black arrow. Next step, SLIO optimizes the position of b2 and b1 iteratively, which leads to the new position of b1, as pointed by the red arrow. Similarly, the third boundary b3 could be located iteratively by considering the correlations between b1, b2, and b3. The minimization processes continue in such a way until the total 39 boundaries along the protein sequence are determined. Similarly for FM-CG, the coarsegraining process is a maximization of eq 2, where the details have been reported in the previous study.42 Figure 3b shows how the increments of residuals of eqs 1 and 2 vary with the number of CG sites, denoted as ΔχED‑CG2 and ΔχFM‑CG2 respectively. Both the black and red curves indicate that the increments of ΔχED‑CG2 and ΔχFM‑CG2 change slightly after the number of CG sites is beyond a certain number. For 217

DOI: 10.1021/acs.jcim.6b00683 J. Chem. Inf. Model. 2017, 57, 214−222

Article

Journal of Chemical Information and Modeling example, the increments of residuals approach zero beyond N = 20. Both of the two functions of eqs 1 and 2 exhibit the good behavior of convergence as N increases. Thus, it is also possible to define an optimal number n* of CG sites in ED-CG by setting a threshold using a ratio of Δχ2/χ2 to control the coarsegraining process, analogous to what we have done in FM-CG. For example, if the threshold for the convergence of ED-CG is set as 0.01%, we then obtained the 40-site model for G-actin, with its cartoon representation shown in the inset of Figure 3b. The test demonstrates that the combination of SLIO instead of SASD with ED-CG makes the application of the ED-CG to large systems possible. Another coarse-graining test on the large protein Formolase54 by using SLIO based on ED-CG is shown in Figure S1 of the Supporting Information. Comparison of Iterative Values of SLIO and SOBC in Optimization. Besides SASD and SLIO, there exists another coarse-graining algorithm called SOBC41 that was proposed by Xia and co-workers previously. The common feature of SOBC and SLIO is that both of them follow the idea of optimizing N CG sites in a stepwise way. It was expected that the optimized results of SLIO should be better than that of SOBC, due to two remarkable aspects: first, the positions of old boundaries located by SOBC in last steps are kept fixed in the new steps, whereas SLIO performs an iterative optimization between them to get more optimal values. Second, the accuracy of SOBC relies on the random samplings, but SLIO can locate the extrema during the optimization processes in FM-CG and EDCG. In order to show the difference between the SOBC and SLIO clearly, we plot the optimized results of six proteins53,55−59 of first ten steps of SOBC and SLIO in Figure 4 and make a direct

regularly for the proteins with different sizes. For the large proteins such as 3L76,58 4XEI,59 and 3J8I,55 the optimized residual values appear to be slightly larger than that of SLIO, which means that SOBC is less efficient in locating the minima than SLIO for large proteins. The reason might be the fact that the potential energy surfaces of large proteins are rugged and complicated, so that it is difficult to locate the global minima of target functions. Thus, SLIO is the better choice than SOBC for larger systems. Comparison of Optimized Results and Time Cost. Further, we wonder how about the results of SOBC and SLIO compared to that of SASD? It is known that SASD is a global optimization method, while both SOBC and SLIO are probably categorized as constrained methods. In principle, SASD should be better than SOBC and SLIO in searching for the extrema of optimized functions. However, the previous results of the IgG1 b12 protein optimized by SOBC appear to be better than those of SASD. It raises the question which algorithm is the best among SASD, SOBC and SLIO for optimization? In order to answer it, we coarse-grained a series of proteins using the three algorithms above based on ED-CG and compared their optimal values with each other. In order to make the comparison clear, we set the values of SLIO as the references and calculated the relative errors of SASD and SOBC to them. Figure 5a−f shows the calculated relative errors of SOBC and SASD for six proteins respectively, with the different CG divisions from N = 4−45. The detailed values optimized by SASD, SOBC, and SLIO for the example of F-actin with N = 5 are listed in the Table S1 of the Supporting Information. It can be judged that the positive relative errors indicate the larger values obtained by SASD or SOBC, while the negative values mean smaller ones. At the first glance, it can be seen that all the calculated relative errors of SOBC and SASD for each protein are almost positive, which indicate that SLIO yields the better optimization results. With the more CG sites to be divided, the relative errors of SASD become larger and larger, increasing linearly with N. SOBC also performs better than SASD in optimization almost for all the adopted numbers N, with the values slightly larger than SLIO. However, the exceptions exist among the optimized results. For instance, we found that the results of SASD and SOBC for N = 4 in Figure 5b are better than that for SLIO, as well as the result of SOBC for N = 30 in Figure 5c. The computational cost of coarse-graining is another important property to verify an efficient method except the optimized values mentioned above. Table 1 lists the counted computational cost for SASD, SOBC, and SLIO to coarse-grain six proteins into 20 CG sites randomly, respectively. It should be noticed that SLIO only needs to be performed once in principle, while both SOBC and SASD require a great number of samplings. The data in Table 1 reveal that SASD is the most inefficient method among them. Even for the smallest G-actin with PDB ID 1ATN, it takes more than 1 min to get 20 CG sites. For larger systems, the time-consuming is increasing linearly. For instance, the F-actin with PDB ID 3J8I consumes nearly 26 min to finish the job of CG division, which is so expensive compared to other two algorithms SOBC and SLIO. SOBC speeds up the coarse-graining process faster than SASD by several times. Even the cost of F-actin is less than 8 min. SLIO is the fastest algorithm among them and faster than SASD by 2 orders of magnitude. For the F-actin, the coarsegraining only takes 2.4 s by SLIO.

Figure 4. Calculated differences χSOBC2 − χSLIO2 in each iteration of SOBC and SLIO as coarse-graining six proteins into 20 CG sites. (inset) Detailed PDB IDs of tested proteins.

comparison with each other. Here, we set the results of SLIO as references and plotted the differences χSOBC2 − χSLIO2 with the CG number N. At the first glance, we notice that the differences χSOBC2 − χSLIO2 in each step for each protein are always greater than zero, which mean that the values optimized by SOBC are always greater than these obtained by SLIO. It demonstrates the fact that SLIO did yield better results than SOBC in the iterations due to the correlated optimization of boundaries adopted in SLIO. The variations in the curves indicate that the differences χSOBC2 − χSLIO2 for every protein do not change 218

DOI: 10.1021/acs.jcim.6b00683 J. Chem. Inf. Model. 2017, 57, 214−222

Article

Journal of Chemical Information and Modeling

Figure 5. Calculated relative errors of optimized values for the six proteins with their PDB IDs (a) 1ATN,53 (b) 4XE7,56 (c) 5ET0,57 (d) 3L76,58 (e) 4XEI,59 and (f) 3J8I,55 at different CG numbers. The relative errors are calculated as (χi2 − χSLIO2)/χSLIO2 (i = SASD or SOBC), where χi2 denotes the best result among the one hundred samplings of SASD or SOBC based on ED-CG. The hollow red dots and black squares denote the data of SASD and SOBC, with their linear fittings in red and black, respectively.

1. For instance, the best value of SOBC for 1ATN is 0.62 nm2 in Table 1 compared to 0.60 nm2 of SLIO, while the best of SASD is 0.74 nm2. The further tests indicate that the enhancement with 1000 samplings of SASD leads to the slightly better value 0.73 nm2, still much less accurate than that of SOBC. The comprehensive comparison of optimal values as well as the time cost in Table 1 demonstrates that SLIO is the best method among them, with the least time cost and the most accurate results. Definition of CG Sites from MD Fluctuations. In the previous study,42 we constructed the CG sites using the MSDFs evaluated from normal-mode analysis on ANM. ANM is actually a structure-based model which cannot account for the anharmonic effect of atoms far away from equilibrium. In this work, we attempt to construct the CG sites directly using the MSDFs evaluated from MD trajectories. The MD trajectories incorporate the anharmonic effects of atoms and could yield the more realistic CG division. We chose the native structure of the oncogenic Ras protein that has 166 residues as a model system,60 and performed the NPT simulation to get a 100 ns equilibrium trajectory. Then, we calculated the MSDFs

Table 1. Computational Costs of Coarse-Graining Six Proteins into 20 CG Sites, by Using SLIO, SOBC, and SASD Algorithms Based on ED-CGa PDB IDs

number of residues

SLIO

SOBC

SASD

1ATN53 4XE756 5ET057 3L7658 4XEI59 3J8I55

371 419 693 1190 1721 1875

0.3(0.60) 0.4(0.85) 0.9(1.60) 1.4(6.53) 2.0(9.78) 2.4(12.14)

8.4(0.62) 11.1(0.86) 35.6(1.61) 170.5(6.68) 395.8(10.02) 482.7(12.36)

82.3(0.74) 99.2(1.06) 226.5(1.96) 799.5(9.422) 1311.0(12.36) 1595.9(14.99)

The first two columns list the PDB IDs of six proteins and their corresponding number of residues. The estimated computational costs are in the units of seconds and followed by the minimal values shown in the parentheses. Notice that SLIO is only performed once in principle, while SOBC and SASD are run with 100 samplings.

a

On the other hand, we found that even if we performed 100 times samplings by SOBC and SASD, the best optimized results obtained by SOBC and SASD for each protein are still worse than SLIO, as shown by the values in the parentheses of Table 219

DOI: 10.1021/acs.jcim.6b00683 J. Chem. Inf. Model. 2017, 57, 214−222

Article

Journal of Chemical Information and Modeling

Figure 6. (a) Comparison of CG division by FM-CG and the secondary structure division by the DSSP program. The first row shows the consecutive numbering of derived 20 three CG sites and each rectangle in the second row means a residue of Ras protein. The different colors denote the different secondary structures such as α-helices, β-sheets, and other loops or turns which are defined in the DSSP program. (b) Yellow CG sites and the cartoon representation of Ras protein, with the switch I and switch II regions colored in green and red, respectively. (c) Yellow arrows denote the first primary principle component calculated from the principle component analysis on the MD trajectory of Ras.

60−63 in the switch II are represented by CG9, and residues 64−67 are grouped into the bead of CG10. The CG results based on MD fluctuations demonstrate that the CG sites constructed by FM-CG accurately reflect the functionally relevant fluctuations in Ras, as well as preserve the important information on secondary structures.

numerically from the Ras trajectory and derived the CG sites for Ras protein by FM-CG. To verify the validity of CG sites reflecting relative fluctuations, we performed the secondary structure analysis on Ras with the DSSP program61 and compared it with the derived CG division. Figure 6a shows that the optimal number of CG sites to represent the Ras protein is n* = 23. We notice that nearly each crucial secondary structure in Ras protein such as the α-helices, β-sheets, and other turns or loops explicitly correspond to a or more CG sites. For example, the first β-sheet, loop, and α-helix with the residues 2−25 are mainly coarse-grained as CG1, CG2, and CG3, respectively, which indicate that these derived CG sites reflect the important fluctuations incorporated in the allatom trajectory. In the analysis of Ras, we mainly focus on the crucial functional regions switch I and switch II with the residues 30−36 and 61−72, respectively. The switch I and switch II correspond to the green loop and red α-helix shown in Figure 6b. They are coarse-grained into the beads CG5, CG9, CG10, and CG11. In Ras, the regions of switch I, switch II and the residues 11−15 constitute the active pocket for GTP hydrolysis and undergo remarkable conformational changes as executing its function. In order to show the motions of secondary structure in Ras clearly, we performed the principle component analysis on the 100 ns trajectory of Ras and plotted the first primary principle component (PC) in Figure 6c. The arrows of PC in switch I and switch II of Figure 6c show their motion directions, which are in good agreement with the experimental observations. For instance, the loop in switch I fluctuates drastically in solution to switch Ras between the “on” and “off” states for signaling.62 The switch II also moves off the active center after the GTP hydrolysis occurs. Correspondingly in Figure 6b, the residues 32−35 in the switch I are coarse-grained into CG5, residues



CONCLUSIONS In this work, we apply the SASD, SOBC, and SLIO algorithms based on the ED-CG method to coarse-grain large systems and make a comprehensive comparison of their optimized results and accuracy. The coarse-graining test on G-actin monomer shows that SLIO based on ED-CG yields the converged results as the number of CG sites increases, analogous to the convergence exhibited by SLIO based on FM-CG. However, it is noted that the residual defined in FM-CG approaches a maximum in the infinity of N, whereas the residual in ED-CG reaches a minimum. Similarly, the optimal number of CG sites can be derived in the optimization process. The test on G-actin monomer manifests that, both ED-CG and FM-CG combined with SLIO are systematically efficient and fast methods for coarse-graining large biomolecules. Further, we compare the residual values optimized by SOBC and SLIO in the iterations. The comparison of results in each iteration of optimization indicates that the optimized residuals by SLIO are always smaller than these obtained by SOBC, since the strong correlations of boundaries have been considered in the iterative optimization of SLIO. The comprehensive comparison of optimized results by SASD, SOBC, and SLIO over six proteins reveals that SLIO is the most accurate algorithm for locating the minima of residuals and SOBC is the second accurate one. The comparison of computational cost of 220

DOI: 10.1021/acs.jcim.6b00683 J. Chem. Inf. Model. 2017, 57, 214−222

Article

Journal of Chemical Information and Modeling

(3) Peter, C.; Kremer, K. Multiscale Simulation of Soft Matter Systems − from the Atomistic to the Coarse-Grained Level and Back. Soft Matter 2009, 5, 4357−4366. (4) Riniker, S.; Allison, J. R.; van Gunsteren, W. F. On Developing Coarse-Grained Models for Biomolecular Simulation: A Review. Phys. Chem. Chem. Phys. 2012, 14, 12423−12430. (5) Sherwood, P.; Brooks, B. R.; Sansom, M. S. Multiscale Methods for Macromolecular Simulations. Curr. Opin. Struct. Biol. 2008, 18, 630−640. (6) Trylska, J. Coarse-Grained Models to Study Dynamics of Nanoscale Biomolecules and Their Applications to the Ribosome. J. Phys.: Condens. Matter 2010, 22, 453101. (7) Kmiecik, S.; Gront, D.; Kolinski, M.; Wieteska, L.; Dawid, A. E.; Kolinski, A. Coarse-Grained Protein Models and Their Applications. Chem. Rev. 2016, 116, 7898−7936. (8) Takada, S. Coarse-Grained Molecular Simulations of Large Biomolecules. Curr. Opin. Struct. Biol. 2012, 22, 130−137. (9) Saunders, M. G.; Voth, G. A. Coarse-Graining of Multiprotein Assemblies. Curr. Opin. Struct. Biol. 2012, 22, 144−150. (10) Izvekov, S.; Voth, G. A. A Multiscale Coarse-Graining Method for Biomolecular Systems. J. Phys. Chem. B 2005, 109, 2469−2473. (11) Hills, R. D., Jr.; Brooks, C. L., 3rd Insights from Coarse-Grained Go Models for Protein Folding and Dynamics. Int. J. Mol. Sci. 2009, 10, 889−905. (12) Chu, J. W.; Voth, G. A. Coarse-Grained Free Energy Functions for Studying Protein Conformational Changes: A Double-Well Network Model. Biophys. J. 2007, 93, 3860−3871. (13) Baaden, M.; Marrink, S. J. Coarse-Grain Modelling of ProteinProtein Interactions. Curr. Opin. Struct. Biol. 2013, 23, 878−886. (14) Monticelli, L.; Kandasamy, S. K.; Periole, X.; Larson, R. G.; Tieleman, D. P.; Marrink, S. J. The Martini Coarse-Grained Force Field: Extension to Proteins. J. Chem. Theory Comput. 2008, 4, 819− 834. (15) Marrink, S. J.; Risselada, H. J.; Yefimov, S.; Tieleman, D. P.; De Vries, A. H. The Martini Force Field: Coarse Grained Model for Biomolecular Simulations. J. Phys. Chem. B 2007, 111, 7812−7824. (16) Lopez, C. A.; Bellesia, G.; Redondo, A.; Langan, P.; Chundawat, S. P.; Dale, B. E.; Marrink, S. J.; Gnanakaran, S. Martini CoarseGrained Model for Crystalline Cellulose Microfibers. J. Phys. Chem. B 2015, 119, 465−473. (17) Rader, A. J. Coarse-Grained Models: Getting More with Less. Curr. Opin. Pharmacol. 2010, 10, 753−759. (18) Wu, C.; Shea, J.-E. Coarse-Grained Models for Protein Aggregation. Curr. Opin. Struct. Biol. 2011, 21, 209−220. (19) Saunders, M. G.; Voth, G. A. Coarse-Graining Methods for Computational Biology. Annu. Rev. Biophys. 2013, 42, 73−93. (20) Morriss-Andrews, A.; Shea, J.-E. Simulations of Protein Aggregation: Insights from Atomistic and Coarse-Grained Models. J. Phys. Chem. Lett. 2014, 5, 1899−1908. (21) Zhang, Z.; Sanbonmatsu, K. Y.; Voth, G. A. Key Intermolecular Interactions in the E. Coli 70s Ribosome Revealed by Coarse-Grained Analysis. J. Am. Chem. Soc. 2011, 133, 16828−16838. (22) Atilgan, A. R.; Durell, S. R.; Jernigan, R. L.; Demirel, M. C.; Keskin, O.; Bahar, I. Anisotropy of Fluctuation Dynamics of Proteins with an Elastic Network Model. Biophys. J. 2001, 80, 505−515. (23) Bahar, I.; Lezon, T. R.; Bakan, A.; Shrivastava, I. H. Normal Mode Analysis of Biomolecular Structures: Functional Mechanisms of Membrane Proteins. Chem. Rev. 2010, 110, 1463−1497. (24) Bahar, I.; Lezon, T. R.; Yang, L. W.; Eyal, E. Global Dynamics of Proteins: Bridging between Structure and Function. Annu. Rev. Biophys. 2010, 39, 23−42. (25) Haliloglu, T.; Bahar, I.; Erman, B. Gaussian Dynamics of Folded Proteins. Phys. Rev. Lett. 1997, 79, 3090−3093. (26) Kondrashov, D. A.; Cui, Q.; Phillips, G. N., Jr. Optimization and Evaluation of a Coarse-Grained Model of Protein Motion Using X-Ray Crystal Data. Biophys. J. 2006, 91, 2760−2767. (27) Tirion, M. M. Large Amplitude Elastic Motions in Proteins from a Single-Parameter, Atomic Analysis. Phys. Rev. Lett. 1996, 77, 1905− 1908.

them reveals that SLIO is much faster than SOBC and SASD in coarse-graining, which highlights the speed of SLIO. Finally, we construct the CG sites by means of using the fluctuations generated from MD trajectory in FM-CG. The pairwise fluctuations of atoms evaluated numerically from MD trajectory incorporate the anharmonic effects of dynamics of proteins in solution, which is more realistic and reasonable than that calculated from ANM. The derived 23-site model for Ras reflects its characters of secondary structures properly, especially for the reasonable CG representation of the functionally important regions of switch I and switch II. Therefore, the test on Ras demonstrates that the CG representation of biomolecules should reflect the properties of all-atom model accurately, rather than being determined arbitrarily and empirically. In particular, the accurate CG representation of secondary structures of proteins should be considered as constructing the corresponding CG model for the description of structurally related functions of proteins. Through the comprehensive comparison of the coarsegraining methods FM-CG and ED-CG as well as the SASD, SOBC, and SLIO algorithms, we draw the conclusions that both FM-CG and ED-CG methods could give rise to physically meaningful results for CG division efficiently, as they are combined with the SLIO algorithm. With the robustness of SLIO, we can obtain the optimal CG representation for target systems. Further, we suggest that the CG sites derived from FM-CG method can be parametrized using the multiscale fluctuation matching method to construct the low-resolution CG models for the functional study of large biological systems.



ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00683. Details of ANM; coarse-grained results obtained by SLIO based on ED-CG; optimized function values by SASD, SOBC, and SLIO for F-actin (PDF)



AUTHOR INFORMATION

Corresponding Author

*Tel: +86-(0)21-20596009. E-mail: [email protected]. ORCID

Fei Xia: 0000-0001-9458-9175 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China (Grants No. 21473056, 21433004, 21373164, and 21133007) and Natural Science Foundation of Shanghai (14ZR1411800). We acknowledge the support of the NYU-ECNU Center for Computational Chemistry at NYU Shanghai. We also thank the supercomputer center of ECNU for providing computer time.



REFERENCES

(1) Clementi, C. Coarse-Grained Models of Protein Folding: Toy Models or Predictive Tools? Curr. Opin. Struct. Biol. 2008, 18, 10−15. (2) Kamerlin, S. C.; Vicatos, S.; Dryga, A.; Warshel, A. CoarseGrained (Multiscale) Simulations in Studies of Biophysical and Chemical Systems. Annu. Rev. Phys. Chem. 2011, 62, 41−64. 221

DOI: 10.1021/acs.jcim.6b00683 J. Chem. Inf. Model. 2017, 57, 214−222

Article

Journal of Chemical Information and Modeling (28) Xia, F.; Tong, D.; Lu, L. Robust Heterogeneous Anisotropic Elastic Network Model Precisely Reproduces the Experimental BFactors of Biomolecules. J. Chem. Theory Comput. 2013, 9, 3704− 3714. (29) Xia, F.; Tong, D.; Yang, L.; Wang, D.; Hoi, S. C.; Koehl, P.; Lu, L. Identifying Essential Pairwise Interactions in Elastic Network Model Using the Alpha Shape Theory. J. Comput. Chem. 2014, 35, 1111− 1121. (30) Arkhipov, A.; Freddolino, P. L.; Schulten, K. Stability and Dynamics of Virus Capsids Described by Coarse-Grained Modeling. Structure 2006, 14, 1767−1777. (31) Dunn, N. J.; Noid, W. G. Bottom-up Coarse-Grained Models That Accurately Describe the Structure, Pressure, and Compressibility of Molecular Liquids. J. Chem. Phys. 2015, 143, 243148. (32) Rudzinski, J. F.; Noid, W. G. Investigation of Coarse-Grained Mappings Via an Iterative Generalized Yvon-Born-Green Method. J. Phys. Chem. B 2014, 118, 8295−8312. (33) Marrink, S. J.; De Vries, A. H.; Mark, A. E. Coarse Grained Model for Semiquantitative Lipid Simulations. J. Phys. Chem. B 2004, 108, 750−760. (34) Chu, J. W.; Voth, G. A. Allostery of Actin Filaments: Molecular Dynamics Simulations and Coarse-Grained Analysis. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 13111−13116. (35) Chu, J. W.; Voth, G. A. Coarse-Grained Modeling of the Actin Filament Derived from Atomistic-Scale Simulations. Biophys. J. 2006, 90, 1572−1582. (36) Pfaendtner, J.; Lyman, E.; Pollard, T. D.; Voth, G. A. Structure and Dynamics of the Actin Filament. J. Mol. Biol. 2010, 396, 252−263. (37) Sinitskiy, A. V.; Saunders, M. G.; Voth, G. A. Optimal Number of Coarse-Grained Sites in Different Components of Large Biomolecular Complexes. J. Phys. Chem. B 2012, 116, 8363−8374. (38) Zhang, Z.; Lu, L.; Noid, W. G.; Krishna, V.; Pfaendtner, J.; Voth, G. A. A Systematic Methodology for Defining Coarse-Grained Sites in Large Biomolecules. Biophys. J. 2008, 95, 5073−5083. (39) Kirkpatrick, S.; Gelatt, C. D.; Vecchi, M. P. Optimization by Simmulated Annealing. Science 1983, 220, 671−680. (40) Zhang, Z.; Pfaendtner, J.; Grafmuller, A.; Voth, G. A. Defining Coarse-Grained Representations of Large Biomolecules and Biomolecular Complexes from Elastic Network Models. Biophys. J. 2009, 97, 2327−2337. (41) Li, M.; Zhang, J. Z.; Xia, F. A New Algorithm for Construction of Coarse-Grained Sites of Large Biomolecules. J. Comput. Chem. 2016, 37, 795−804. (42) Li, M.; Zhang, J. Z.; Xia, F. Constructing Optimal CoarseGrained Sites of Huge Biomolecules by Fluctuation Maximization. J. Chem. Theory Comput. 2016, 12, 2091−2100. (43) Downward, J. Targeting Ras Signalling Pathways in Cancer Therapy. Nat. Rev. Cancer 2003, 3, 11−22. (44) Goodsell, D. S. The Molecular Perspective: The Ras Oncogene. Stem Cells 1999, 17, 235−236. (45) Malumbres, M.; Barbacid, M. Ras Oncogenes: The First 30 Years. Nat. Rev. Cancer 2003, 3, 459−465. (46) Bahar, I.; Rader, A. J. Coarse-Grained Normal Mode Analysis in Structural Biology. Curr. Opin. Struct. Biol. 2005, 15, 586−592. (47) Brooks, B. R.; Janežič, D.; Karplus, M. Harmonic Analysis of Large Systems. I. Methodology. J. Comput. Chem. 1995, 16, 1522− 1542. (48) Lyman, E.; Pfaendtner, J.; Voth, G. A. Systematic Multiscale Parameterization of Heterogeneous Elastic Network Models of Proteins. Biophys. J. 2008, 95, 4183−4192. (49) Van Der Spoel, D.; Lindahl, E.; Hess, B.; Groenhof, G.; Mark, A. E.; Berendsen, H. J. Gromacs: Fast, Flexible, and Free. J. Comput. Chem. 2005, 26, 1701−1718. (50) Pronk, S.; Pall, S.; Schulz, R.; Larsson, P.; Bjelkmar, P.; Apostolov, R.; Shirts, M. R.; Smith, J. C.; Kasson, P. M.; van der Spoel, D.; Hess, B.; Lindahl, E. Gromacs 4.5: A High-Throughput and Highly Parallel Open Source Molecular Simulation Toolkit. Bioinformatics 2013, 29, 845−854.

(51) Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 1953, 21, 1087−1092. (52) Graceffa, P.; Dominguez, R. Crystal Structure of Monomeric Actin in the Atp State. Structural Basis of Nucleotide-Dependent Actin Dynamics. J. Biol. Chem. 2003, 278, 34172−34180. (53) Kabsch, W.; Mannherz, H. G.; Suck, D.; Pai, E. F.; Holmes, K. C. Atomic Structure of the Actin:Dnase I Complex. Nature 1990, 347, 37−44. (54) Siegel, J. B.; Smith, A. L.; Poust, S.; Wargacki, A. J.; Bar-Even, A.; Louw, C.; Shen, B. W.; Eiben, C. B.; Tran, H. M.; Noor, E.; Gallaher, J. L.; Bale, J.; Yoshikuni, Y.; Gelb, M. H.; Keasling, J. D.; Stoddard, B. L.; Lidstrom, M. E.; Baker, D. Computational Protein Design Enables a Novel One-Carbon Assimilation Pathway. Proc. Natl. Acad. Sci. U. S. A. 2015, 112, 3704−3709. (55) Galkin, V. E.; Orlova, A.; Vos, M.; Schröder, G.; Egelman, E. Near-Atomic Resolution for One State of F-Actin. Structure 2015, 23, 173−182. (56) Jiang, S.; Narita, A.; Popp, D.; Ghoshdastider, U.; Lee, L. J.; Srinivasan, R.; Balasubramanian, M. K.; Oda, T.; Koh, F.; Larsson, M.; Robinson, R. C. Novel Actin Filaments from Bacillus Thuringiensis Form Nanotubules for Plasmid DNA Segregation. Proc. Natl. Acad. Sci. U. S. A. 2016, 113, E1200−E1205. (57) Liu, H.; Li, J.; Raval, M. H.; Yao, N.; Deng, X.; Lu, Q.; Nie, S.; Feng, W.; Wan, J.; Yengo, C. M.; Liu, W.; Zhang, M. Myosin IiiMediated Cross-Linking and Stimulation of Actin Bundling Activity of Espin. eLife 2016, 5, e12856. (58) Robin, A. Y.; Cobessi, D.; Curien, G.; Robert-Genthon, M.; Ferrer, J.-L.; Dumas, R. A New Mode of Dimerization of Allosteric Enzymes with Act Domains Revealed by the Crystal Structure of the Aspartate Kinase from Cyanobacteria. J. Mol. Biol. 2010, 399, 283− 293. (59) Robinson, R. C.; Turbedsky, K.; Kaiser, D. A.; Marchand, J.-B.; Higgs, H. N.; Choe, S.; Pollard, T. D. Crystal Structure of Arp2/3 Complex. Science 2001, 294, 1679−1684. (60) Rosnizeck, I. C.; Spoerner, M.; Harsch, T.; Kreitner, S.; Filchtinski, D.; Herrmann, C.; Engel, D.; Konig, B.; Kalbitzer, H. R. Metal-Bis(2-Picolyl)Amine Complexes as State 1(T) Inhibitors of Activated Ras Protein. Angew. Chem., Int. Ed. 2012, 51, 10647−10651. (61) Kabsch, W.; Sander, C. Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers 1983, 22, 2577−2637. (62) Kalbitzer, H. R.; Spoerner, M.; Ganser, P.; Hozsa, C.; Kremer, W. Fundamental Link between Folding States and Functional States of Proteins. J. Am. Chem. Soc. 2009, 131, 16714−16719.

222

DOI: 10.1021/acs.jcim.6b00683 J. Chem. Inf. Model. 2017, 57, 214−222