Prediction of Protein Configurational Entropy ... - ACS Publications

tion/evaluation in which configurational entropy is currently neglected for performance reasons. Software of ... Consequently, software tools based on...
0 downloads 10 Views 2MB Size
Subscriber access provided by READING UNIV

Article

Prediction of Protein Configurational Entropy (Popcoen) Martin Goethe, Jan Gleixner, Ignacio Fita, and J. Miguel Rubi J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.7b01079 • Publication Date (Web): 19 Jan 2018 Downloaded from http://pubs.acs.org on January 22, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Prediction of Protein Configurational Entropy (Popcoen) Martin Goethe,∗,† Jan Gleixner,‡ Ignacio Fita,¶ and J. Miguel Rubi Department of Condensed Matter Physics, University of Barcelona, Carrer Mart´ı i Franqu´es 1, Barcelona, Spain, Faculty of Biosciences, Heidelberg University, Im Neuenheimer Feld 234, 69120 Heidelberg, Germany, and Molecular Biology Institute of Barcelona (IBMB-CSIC, Maria de Maeztu Unit of Excellence), Carrer Baldiri Reixac 4-8, 08028 Barcelona, Spain E-mail: [email protected]



To whom correspondence should be addressed University of Barcelona ‡ Heidelberg University ¶ Molecular Biology Institute of Barcelona (IBMB-CSIC) †

1

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract TOC picture:

A knowledge-based method for configurational entropy prediction of proteins is presented which is extremely fast compared to previous approaches because it does not involve any kind of configurational sampling. Instead, the configurational entropy of a query fold is estimated by evaluating an artificial neural network which was trained on molecular-dynamics simulations of about 1000 proteins. The predicted entropy can be incorporated into a large class of protein software based on cost-function minimization/evaluation in which configurational entropy is currently neglected for performance reasons. Software of this type is used for all major protein tasks such as structure predictions, proteins design, NMR and x-ray refinement, docking, and mutation effect predictions. Integrating the predicted entropy can yield a significant accuracy increase as we show exemplarily for native-state identification with the prominent protein software FoldX. The method has been termed Popcoen for Prediction of Protein Configurational Entropy. An implementation is freely available at http://fmc.ub.edu/popcoen/.

Introduction Predicting the protein structure from the amino-acid (AA) sequence is a challenging task, especially for larger proteins without sequence homologues in the pdb database. 1 For these cases, ab initio methods are employed to derive the protein structure from first principles. ˆ relating a A major class of such software is based on some cost (or scoring) function G given structure (or conformation) of the protein to a number that is intended to take lower 2

ACS Paragon Plus Environment

Page 2 of 29

Page 3 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

values for structures which are “closer” to the native state of the protein. 2 In a nutshell, these programs construct various trial conformations and select one of them as the predicted ˆ native fold, where the whole process is guided by G. These cost-function methods are based on a well-known statistical physics principle stating that the equilibrium state of a system is the one with lowest free energy G. 3 This ˆ exhibits its minimum for the native state if G ˆ represents a sufficiently guarantees that G accurate approximation to the free energy G of the protein in solvent. Following Lazaridis and Karplus, 4 one can decompose G into three major contributions, namely the average intramolecular energy, the average solvation free energy, and the configurational entropy ˆ They are mod(times (−T )). The first two contributions are usually considered within G. eled as functions of the average atom coordinates whereupon fluctuation-induced effects on the averages are ignored. 5 ˆ mainly In contrast, the third contribution, configurational entropy, is often neglected in G because this quantity depends strongly on information beyond the average protein structure. In detail, configurational entropy is given by Z Sconf = −kB

dx ρ(x) log ρ(x)

(1)

where the integral is performed over the entire configurational space. Here, x denotes the set of protein-atom coordinates (in a body fixed coordinate system), ρ(x) represents their joint probability distribution (in the presence of the solvent), and kB is the Boltzmann constant. 4 From Eq. (1), it follows that Sconf depends crucially on the spatial fluctuations of the protein ˆ since this quantity is atoms and their correlations. This information is not present in G ˆ is only a function of the average atom coordinates. Therefore, incorporating Sconf into G cumbersome. ˆ account for Sconf only rudiConsequently, software tools based on a cost-function G mentarily 6,7 or neglect it completely 8–13. This praxis might be justified for some specially

3

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

compact proteins for which configurational entropy plays a minor role, while it is highly problematic in general, since Sconf has major impact on the native-state selection of most proteins. 14–17 Other software does include more elaborated estimates of Sconf in its costˆ where Sconf is obtained via sampling of the configurational space in the vicinity function G of the given structure. 18,19 This, however, changes dramatically the runtime of such software and, consequently, its applicability since the sampling process involves many evaluations of the energy function instead of just one. ˆ Here, we suggest an alternative approach for incorporating configurational entropy into G where the missing information about fluctuations and correlations is estimated with probabilistic rules instead of by sampling. Let us illustrate this idea using an elementary example. It is well-known that an AA which is buried in the bulk of the protein usually exhibits weaker fluctuations than an AA on the surface of the protein, 20 mainly due to strong steric constraints in the bulk. For this reason, surface AA’s usually contribute stronger to Sconf ˆ which favors than buried AA’s and it might be reasonable to include an entropic term in G proteins having a large surface. This term could easily be expressed in terms of the available ˆ average atom coordinates and its incorporation would generally improve the accuracy of G. In this work, we performed a more elaborated approach along these lines, which alˆ From lows to exploit more complex fluctuation and correlation patterns for improving G. molecular-dynamics (MD) simulations of about 1000 proteins, we measured an estimator of configurational entropy, as well as various features of the protein structure (such as solvent accessibility, local density, and hydrogen bonding). With these data, we trained an artificial neural network (NN) on predicting Sconf from the features as illustrated in Fig. 1 (top panel). The NN can now be used to construct a knowledge-based estimator for Sconf from a query structure via feature measurement and NN evaluation (see bottom panel of Fig. 1). This ˆ of existing protein is extremely fast and allows to incorporate Sconf into cost-functions G software without compromising its runtime.

4

ACS Paragon Plus Environment

Page 4 of 29

Page 5 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 1: Workflow illustration. Top panel: From MD simulations of 961 proteins, we measured 108 structural features per amino acid of the protein structures, as well as the partial entropies {Si } which are the configurational entropies of the residues. With these data, we trained a neural network (termed Popcoen) on predicting the Si ’s from the features. Bottom panel: Popcoen allows to estimate configurational entropy Sconf just from a query structure via feature measurement and Popcoen evaluation. The approach is extremely fast as it does not involve any kind of configurational sampling.

Results We trained a NN with simulation data to predict configurational entropy of proteins. First, we outline the underlying entropy approximation, the measurements, and the training process. Then, we report a test for the usefulness of the NN for native-state identification of proteins.

5

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 29

Entropy approximation Configurational entropy Sconf is the 3Natoms − 6 dimensional integral of Eq. (1) for a protein containing Natoms atoms. Due to its high dimension, brute force evaluation from simulation data is intractable and approximations need to be employed to obtain a feasible expression. 21 Using well-known approaches 22–24 (with slight adaptations), we obtained the decomposition

Sconf ≈

Nres X

Si + const.

(2)

i=1

in terms of so-called “partial” entropies {Si } (i = 1 . . . Nres ) which can be related to the entropy contributions of the Nres residues (see Methods). The Si ’s are functions of specific marginal entropies of torsion angles and mutual informations which do not involve integrals of dimension larger than two, enabling their reliable measurement from simulations. It turns out that the decomposition given by Eq. (2) is extremely valuable for entropy predictions because local quantities are considerably easier to be predicted than global ones. Note that a similar decomposition property does not hold for Eq. (1). 25

Measurements from simulations We analyzed molecular dynamic simulations of 961 proteins which were available from a public database of MD trajectories (MoDEL-database 26 ). The simulation details and the data-selection process are outlined in Methods. From the simulation of each protein, we measured the partial entropies {Si } of all AA’s, as well as the centroid structure which is the snapshot of the simulation which is most similar to all other snapshots. From the centroid structures, we then derived 108 features per residue (see Methods). The features contain global properties of the protein such as Nres , the total solvent accessible surface area (SASA), the total number of hydrogen bonds, and properties of the gyration tensor, as well as local properties of the residue and its neighbors such as the residue type, the burial level, the local density, the relative SASA, the average torsion angles, and the number of hydrogen bonds. 6

ACS Paragon Plus Environment

Page 7 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Table 1 lists the coefficient of determination R2 between Si and a subset of features showing that the features are predictive for Si (complete list in Supporting Information (SI)). A particularly large value of R2 = 0.210(7) is found for the relative solvent accessibility of the residues. This confirms the illustrative example discussed in the Introduction: a properly defined surface term might be used as an estimator for Sconf of low accuracy. However, by combining the information of all features, we were able to construct a substantially better estimator (see below).

7

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 1: Coefficient of determination R2 = 1 − MSE/Var(Si ) between the partial entropies {Si } and selected features (or combinations of features) where MSE represents the mean squared error when the neural network was trained on predicting Si just from a single feature, and Var represents the variance. The full list for all features is given in SI. The R2 values are significantly larger than zero which proves that the features are predictive for Si . The most predictive feature is the relative SASA of the residues (relSASA) which alone allows to explain 21% of the variance. The features were used for training a neural network resulting in an improved accuracy of R2 = 0.547(6) by combining the information of all features. † MSE and Var measured for data reduced to given residue type. Feature(-combination) i1/2 hQ 3 eval totalSASA/ j j=1

R2

Description

0.029(9)

relSASA ResType(−1) ResType(0) ResType(1) |dist|/Rg [evec1 · dist]2 /eval1 [evec2 · dist]2 /eval2 [evec3 · dist]2 /eval3 c(6) c(10) c(14) c(18) c(22) totalHbs/Nres Hbs(−1) Hbs(0) Hbs(1) ψ(−1) for ALA ψ(0) for ALA ψ(1) for ALA φ(0) for GLY χ1 (0) for LEU χ2 (0) for ILE ω(0) for SER

0.210(7) 0.018(3) 0.17(1) 0.014(2) 0.12(1) 0.026(6) 0.026(6) 0.043(5) 0.122(7) 0.19(1) 0.20(1) 0.162(7) 0.11(1) 0.016(3) 0.056(5) 0.068(4) 0.032(7) 0.25(2)† 0.33(1)† 0.20(2)† 0.17(2)† 0.090(8)† 0.13(1)† 0.04(1)†

totalSASA = total SASA; evalj = jth eigenvalue of gyration tensor in ascending order; relSASA = relative SASA of residue i ResType(s) = residue-type label (binary coded) for residue i + s dist = vector from Cα atom to center of mass; Rg = radius of gyration; evecj = eigenvector associated to evalj c(rc ) = number Cα atoms in sphere of radius rc around residue i

totalHbs = total number hydrogen bonds; Nres = number residues; Hbs(s) = number of hydrogen bonds of residue i + s ψ(s) = torsion angle ψ of residue i + s of the centroid (or query) structure if residue i is of given residue type; analogously for other torsion angles

8

ACS Paragon Plus Environment

Page 8 of 29

Page 9 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Training of the neural network Predicting the partial entropy Si from the 108 features constitutes a standard regression problem of supervised learning, which can be tackled with various machine-learning approaches. We chose to implement an artificial neural network (NN) for their excellent scalability and their highly successful application to diverse machine-learning problems in recent years. 27 By extensive search over network architectures, learning methods and other hyperparameters, we identified a 6-layer NN with shortcut connections that provided excellent validation performance when trained with a stochastic gradient descent variant and regularized by dropout and early-stopping (see Methods). On a separate test set, the trained network predicted Si with a coefficient of determination R2 = 0.547(6) (corresponding to an RMSE of 0.706(7)kB and a Pearson correlation coefficient of 0.740(5)) which is substantially better than the prediction performance obtained for single features (see Tab. 1) or with a linear model (R2 = 0.421(7)). This results in an excellent performance for the prediction of the accumulated entropy

SPC =

Nres X

Si

(3)

i=1

(referred to as Popcoen entropy hereafter) as confirmed by an R2 value of 0.66(5) (Pearson = 0.81(4)) for the prediction of SPC /Nres on the test set. Scatter plots of measured vs. predicted values are shown in Fig. S2 of SI. The Popcoen prediction for SPC /Nres is more accurate than for the individual Si ’s (bootstrapped p-value = 0.016). This stems from the fact that the Si ’s inside a protein are typ√ ically positively correlated which broadens the SPC distribution, i.e. std-dev(SPC / Nres ) = η1 · std-dev(Si ) with η1 = 3.5(3) > 1. As the feature set also contains global features of the √ proteins, Popcoen is able to partially capture these correlations, such that RMSE(SPC / Nres ) = 2.4(3)kB = η2 · RMSE(Si ) with η2 = 2.8(3) < η1 .

9

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Entropy prediction tool Popcoen The entropy prediction program Popcoen has been made available at http://fmc.ub.edu/ popcoen/. It allows to estimate the configurational entropy of a protein structure via two calculation steps (see bottom panel of Fig. 1). First, the query structure is loaded and the 108 features per residue are calculated. Second, the NN is evaluated for the features which yields predictions for the partial entropies {Si } of the residues. The program outputs the Si ’s and the Popcoen entropy SPC (of Eq. (3)) which represents the Popcoen estimate (in units of kB ) for Sconf + C where C is constant for a given AA sequence. Additionally, a reliability number λ ∈ [0, 1) is output which reflects the fact that the Popcoen entropy is reliable only when the given query structure is not too different from the training data (see SI). Popcoen entropies can be obtained in two ways. First, a web-server at http://fmc.ub. edu/popcoen/ allows easy but slow queries of Popcoen with no installation required. This access is suited for single requests while large scans cannot be carried out due to the limited computational resources of the server. Second, Popcoen can fully be downloaded and run as a separate server process on local architecture where input–output communication is realized via socket communication. In this way, incorporating Popcoen into existing protein software reduces to elementary socket communication which requires about ten lines of additional source code. Besides, file-based communication is also supported. Strictly speaking, querying Popcoen for non-equilibrium states (decoy structures) is based on the mild assumption that Sconf of native states is not fundamentally different than Sconf of other states, since Popcoen was exclusively derived from equilibrium simulations. More precisely, we assume that there is no hidden mechanism in nature which causes that the fluctuation and correlation patterns of native states are somehow special. A similar assumption is also made for software which estimates Sconf via sampling, namely that the used force field can also be applied to non-equilibrium states although force fields have only been parametrized from information accessible on experimental time scales. 10

ACS Paragon Plus Environment

Page 10 of 29

Page 11 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Performance test for native-state identification In the following, we show that the accuracy of existing protein software can be improved by incorporating Popcoen entropy. We focus on the task of native-state identification where protein software aims to identify the native-state of a protein between many decoy states of ˆ values of all strucequal AA sequence but distinct structure. This is done by ranking the G ˆ to be the native one. The accuracy of this tures and predicting the structure with lowest G ˆ task can be interpreted as a measure for the accuracy of the used free-energy expression G. The test was performed for the prominent protein software FoldX. 6 FoldX is based on the ˆ FX which contains, among others, a contribution for configurational free-energy expression G entropy, denoted by SFX here (see below). We replaced SFX by SPC , keeping the remaining ˆ FX unchanged, and compared the performances of the two cost-functions contributions of G for native-state identification. This procedure offered an indirect way for comparing the accuracies of SPC and SFX . It was used since it does not require the knowledge of exact configurational entropies. For a direct evaluation of SPC and SFX , we would need exact Sconf values which cannot be estimated reliably for proteins of realistic size (see Method). FoldX was chosen since it arguably includes the most elaborated Sconf estimator between those programs which do not require configurational sampling (see Introduction). Moreover, since SFX has clearly been declared to model Sconf inside FoldX, we could perform a meaningful test by simply replacing SFX with SPC . Performing a similar test for a free-energy expressions which does not contain a well-defined contribution for Sconf requires a cleaning process of all energy terms from remaining entropy contributions. This cleaning procedure could not be performed unambigously by us since it requires additional information about the cost-function which has not been published to full extent. Our test was based on a data-set compiled from the structure-prediction competition CASP. 1 It contains the experimental structures of 459 AA sequences together with decoy states for those sequences created by the CASP participants. Like in CASP, the experimental structures were interpreted as native states of the proteins, neglecting experimental artifacts. 11

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The decoys were selected to be mutually distinct and also distinct from the corresponding native states (see Methods). The number of decoys per sequence ranged from 6 to 315, with an average of 104. With these data, we tested how reliably FoldX 6 identified the native states between the ˆ FX by applying the FoldX protocols RepairPDB decoys. We computed FoldX’ free energy G and Stability to all structures. 28 We then ranked all structures with equal sequence in ˆ FX i.e. we ordered the structures from lowest to largest G ˆ FX values and assigned terms of G its rank in the sorted list to each structure. We further counted how often the native state was identified by the cost-function i.e. how often rank 1 was assigned to the native. FoldX identified the native in 78% of the cases (358 of 459) which is much better than the 2% success probability obtained by picking a fold randomly. We then calculated the Popcoen ˆ FX was obtained entropy SPC for all structures. First evidence that Popcoen can improve G by noticing that SPC /Nres is specifically large for those natives which could not be identify by FoldX (see Fig. 2 A, p-value < 10−10 , Mann–Whitney–Wilcoxon test). ˆ FX is Then, we modified the FoldX free-energy expression as mentioned above. FoldX’ G a sum of various enthalpic and entropic contributions, modeling all significant contributions of G. Configurational entropy (times (−T )) is modelled by two terms which are referred to as ”Wmc T ∆Smc + Wsc T ∆Ssc ” by the authors 29 and which together are referred to as (−T )SFX here. SFX accounts for configurational entropy stemming from backbone and sidechain motion. Its precise formula has not been published since FoldX is proprietary software but (−T )SFX values are output by FoldX for given input structure. We replaced (−T )SFX S /kB (with T ≈ 300K) which did not involve any kind of fitting by (−T )SPC = −0.6 kcal mol PC since FoldX energies are given in units of kcal/mol. The ranking of the structures was ˆ FX+PC = EFX − T SPC where EFX denotes repeated with the modified free energy function G ˆ FX , i.e. EFX = G ˆ FX + T SFX . The substitution did not affect all remaining contributions of G the ranks of 368 natives. With both free-energy expressions, 351 natives were identified (i.e. the natives had rank 1) while 17 natives had unchanged rank larger than 1. This is

12

ACS Paragon Plus Environment

Page 12 of 29

Page 13 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 2: Panel A: Normalized histograms of SPC /Nres for all 459 CASP natives and for the reduced set of 101 CASP natives which were not identified by FoldX. The associated medians are indicated by the arrows. FoldX fails to identify the native state preferentially for samples with large Popcorn entropy density (p-value < 10−10 , Mann–Whitney–Wilcoxon test). Panel B: Histograms of the differences ∆EFX /Nres and (−T )∆SPC /Nres between the decoy of lowest EFX and the native showing that the Popcoen entropy varies less between samples than the remaining energy contributions of FoldX. Inset: Scatter plot of the standardized values of EFX /Nres vs. (−T )SPC /Nres for the CASP natives shows an anti-correlation with correlation coefficient −0.3 ± 0.1. a consequence of the fact that the values of EFX /Nres varied stronger between structures than those of T SPC /Nres (see histograms of Fig. 2 B). Precisely, the standard-deviation of the differences ∆EFX /Nres and T ∆SPC /Nres between the decoy of lowest EFX and the native is 0.32(2) kcal/mol and 0.052(3) kcal/mol, respectively. Interestingly, ∆EFX /Nres and (−T )∆SPC /Nres are anti-correlated due to an observed anti-correlation between EFX /Nres and (−T )SPC /Nres (see inset of Fig. 2 B) which might be interpreted as evidence for energy– entropy compensation inside proteins. However, further investigation is needed to decide whether the anti-correlation indeed represents intrinsic properties of proteins, or whether it is rather an artifact of the imprecisions of Popcoen and/or FoldX. The ranks of the remaining 91 natives did change: they improved in 55 cases but worsened ˆ FX+PC than for G ˆ FX , and in 36 cases (i.e. in 55 cases, the rank of the native was smaller for G larger in 36 cases). This shows that incorporating SPC into FoldX results in a mild precision 13

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3: Scatter plot of inverse ellipsoidal density ρ−1 ellip vs. ellipsoidal aspect ratio Aellip for the centroid structures of the training and validation set (864 proteins), as well as for the ˆ FX and G ˆ FX+PC . Natives whose rank 91 CASP natives which were ranked differently by G worsened due to Popcoen are located preferentially in regions with low training-data density. This shows that a neural network gives reliable estimates for cases “close” to the training data. Popcoen accounts for this fact via the reliability number λ. increase (p-value = 0.045, generalized likelihood ratio test). The total number of identified ˆ FX+PC but not with G ˆ FX natives increased by 7 because 14 natives were identified with G ˆ FX (see Table 2). while 7 samples were only identified with G The main reason of why the ranking of some natives worsened due to Popcoen can be traced back to the fact that Popcoen’s neural network was evaluated for structures which are very different from all training data. This is shown in the scatter plot of Fig. 3 where two geometrical observables of the proteins are shown for the centroid structures of the training ˆ data as well as for the 92 CASP samples which were differently ranked for both G’s. The p 4π observables are the inverse ellipsoidal density ρ−1 det(G)/Nres and the ellipsoidal ellip = 3 p aspect ratio Aellip = eval3 /eval1 defined via the gyration tensor G and its largest and smallest eigenvalues eval3 , eval1 . The plot shows that natives whose ranking improved due to Popcoen accumulate in regions of high training data density, while natives whose rank 14

ACS Paragon Plus Environment

Page 14 of 29

Page 15 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

worsened due to Popcoen are located preferentially in regions of low training data density. This behavior is expected as, roughly speaking, a neural network is a reliably “intrapolation machine” but usually fails to “extrapolate” into regions of the feature space with low training data density. 30 A detailed analysis confirmed this observation. We dropped all CASP structures (natives and decoys) with reliability number λ < 0.5 which reduced the data to those structures   with local training-data density larger 10/ std-dev(ρ−1 ellip ) · std-dev(Aellip ) (see SI, std-dev = standard-deviation of training data). For the remaining data (containing 415 natives with in average 77 decoys per native in the range of 6 to 243), we repeated the ranking ˆ FX and G ˆ FX+PC . Now, the ranks of 49 natives improved due to Popcoen while using G they worsened only for 18 natives (with 348 native ranks unchanged), which represents a highly significant performance increase (p-value = 0.00013, generalized likelihood ratio test). Related, also native-state identification improved significantly (p-value = 0.0015): from the 415 natives, FoldX identified 336 (81 %) while FoldX+Popcoen found 348 (84 %) because FoldX+Popcoen could identify additional 14 natives while native-state identification was lost only in two cases (see Table 2). To further assess statistical significance of this result, we randomly permuted the Popcoen entropies between all structures of equal AA sequence. ˆ FX+PC . This yielded with large probability (> 0.9993) less than 336 identified natives using G Hence, the permuted Popcoen entropies act as noise while the un-permuted ones improve the detection rate significantly.

Discussion We presented the knowledge-based tool Popcoen for the prediction of protein configurational entropy. In contrast to previous approaches, Popcoen does not require any kind of configurational sampling which allows to include Sconf also in software based on cost-function minimization/evaluation where no configurational sampling is carried out to save computa-

15

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 29

Table 2: Summary of the performance test for native-state identification. The FoldX costˆ FX is compared to G ˆ FX+PC which contains Popcoen entropy for Sconf . function G Top: The rank of the native state is compared for the entire CASP data-set and for a reduced data-set containing only structures with reliability number λ > 0.5. The numbers (and percentages) of natives are given which are ranked equally by both cost-functions, or ˆ FX+PC performs significantly better than G ˆ FX better for one or the other. This shows that G (p-values obtained with generalized likelihood ratio test). Bottom: Number (and percentage) of natives which have been identified (3) by both cost-functions, have not been identified ˆ FX+PC (7) or have been identified by only one cost-function. For the reduced data-set, G ˆ FX . performs significantly better than G Data-set All CASP data (459 samples) λ filtered (415 samples)

Data-set All CASP data (459 samples) λ filtered (415 samples)

Comparison of the rank of the native state ˆ FX smaller for equal for G smaller for ˆ FX+PC ˆ FX+PC ˆ FX and G G G 368 = 80.2% 55 = 12.0% 36 = 7.8% 348 = 83.9%

49 = 11.8%

18 = 4.3%

ˆ FX+PC : 3, G ˆ FX G : 3 351 = 76.5%

Native-state ˆ GFX+PC : 7, ˆ FX G : 7 87 = 19.0%

identification ˆ FX+PC : 3, G ˆ FX G : 7 14 = 3.1%

334 = 80.5%

65 = 15.7%

14 = 3.4%

p-value 0.045 0.00013

ˆ FX+PC : 7, G ˆ FX G : 3 7 = 1.5% 2 = 0.5%

p-value not significant 0.0015

tional time. Entropy is estimated by evaluating a neural network which was trained on simulation data of 961 representative proteins. This allows reliable Sconf predictions for most protein structures. For cases which are very distinct from the training proteins, Popcoen entropies may not be reliable, as indicated by a small value of the reliability number λ . 0.5. We plan to re-parametrize Popcoen on a larger set of proteins when the simulation database contains substantially more trajectories. This will further amplify Popcoen’s applicability. In an explicit test, we have shown that incorporating Popcoen into the FoldX’ costfunction improves the accuracy of FoldX for native-state identification. We expect that a similar precision gain can be obtained for all protein software based on cost-function 16

ACS Paragon Plus Environment

Page 17 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

minimization/evaluation which currently neglect Sconf or account for Sconf only rudimentarily. Software of this type has been developed for all major tasks related to proteins such as structure predictions, proteins design, NMR and x-ray refinement, docking, and mutation effect predictions. Therefore, Popcoen may be helpful for many aspects of protein science. We have designed Popcoen to be very easily integrable into existing software since Popcoen can be run as a separate server process. This enables researches to test whether Popcoen improves their protein software with very limited effort as only socket communication needs to be established. However, to prevent double-counting, it might additionally be necessary to modify the original cost-function in order to eliminate partial effects of Sconf already captured ˆ within G. Due to the large entropic changes upon docking, 31 considering Sconf is particularly important for protein-docking predictions. In collaboration with the developers of pyDock, 10 we currently investigate the impact of Popcoen on this task. We further work on a similar knowledge-based approach also for the average solvation free energy. This may result in a more accurate estimator than current approaches (such as Generalized Born or GBSA 32 ), since it allows to account for the heterogeneous surface characteristics of proteins.

Methods Entropy expression Configurational entropy Sconf is the integral of Eq. (1) where x denotes the 3Natoms − 6 atom coordinates in a body fixed coordinate system and ρ(x) is their joint probability density in the presence of the solvent. This definition stems from a formal decomposition of the full phase space integral over all degrees of freedom (dofs) of the protein-solvent system into protein dofs and solvent dofs, introduced by Lazaridis and Karplus 4 (for details, see SI). Eq. (1) is better expressed in bond-angle-torsion coordinates to facilitate the isolation of dofs which are 17

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 29

strongly confined by the covalent structure of the protein. 22 These coordinates are basically “frozen” at room temperature 20 so that their contribution to Sconf can be neglected and Eq. (1) reduces to an integral over the flexible torsion angles usually denoted {ψi }, {φi }, and {χ1i }...{χ4i } (see SI). The remaining integral is still of very high dimension (namely about 3.8Nres -dimensional) rendering its brute-force evaluation from simulation data unfeasible (curse of dimensionality). Therefore, an approximation needs to be employed. Here, we used an approach similar to the second-order Maximum information spanning tree (MIST) approximation of entropy 23,24 which allowed us to decompose Sconf into a sum of partial entropies {Si } which could reliably be measured from the available simulation data as only one- and two-dimensional integrals are involved. We obtained the expression

Sconf ≈

Nres X

Si + C(T, AA sequence)

(4)

i=1

where C is a constant for different conformations of a given AA sequence and constant temperature, pressure, and pH. The partial entropies (except S1 and SNres )

Si :=

X

H(θ) − I(θ; κ(θ))

(5)

θ∈Θi

are defined in terms of the marginal entropies Z

π

H(θ) := −kB

dθ ρ(θ) log ρ(θ)

(6)

−π

(here denoted by H to avoid confusion with Si ) and the mutual informations

I(θ; τ ) := H(θ) + H(τ ) − H(θ, τ ).

18

ACS Paragon Plus Environment

(7)

Page 19 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Here, ρ(θ) is the marginal distribution of a single torsion angle θ; and Z

π

Z

π

H(θ, τ ) := −kB

dθ dτ ρ(θ, τ ) log ρ(θ, τ ) −π

(8)

−π

is the entropy of the two-dimensional joint distribution ρ(θ, τ ) of two torsion angles θ and τ . The sum of Eq. (5) runs over the set Θi containing all unfrozen torsion angles of residue i, i.e. Θi is the set {φi , ψi } extended by those side-chain angles χ1i , χ2i , χ3i , χ4i which exist in residue i. The function κ encodes adjacent torsion angles along the covalent structure of the protein via the definitions κ(φi ) := ψi−1 , κ(ψi ) := φi , κ(χ1i ) := ψi , κ(χ2i ) := χ1i , κ(χ3i ) := χ2i , κ(χ4i ) := χ3i . The detailed calculation leading to Eq. (5) together with the slightly modified formulas for S1 and SNres can be found in SI. The dependence of C on the AA sequence is also discussed in SI. This dependency must be considered when entropy for different sequences is compared, as e.g. in mutation studies.

MD data We trained the neural network with data measured from MD simulations. The trajectories were obtained from the MoDEL database which contains simulation trajectories for a large and representative set of proteins covering all major fold classes. 26 All simulations used for this study had been performed with AMBER 8.0 (parm99 force field) at ambient conditions (T = 300 K, p = 1 atm) and constant volume. Each simulation was performed for a single protein in explicit solvent (TIP3P water). After careful equilibration to the simulation conditions, production runs lasted for 10ns using a time step of 2 fs where snapshots were taken every ps. All used trajectories had passed all eight quality criteria of the MoDEL database 26 which guarantees that unfolding events did not occur. In total, we analyzed simulation data for 961 proteins ranging from 21 to 539 AA’s with an average size of 151.4 AA’s.

19

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 29

The simulation data were available from the MoDEL database in compressed form where a non-reversible compression algorithm had been employed. 33 Enormous compression rates were achieved by reducing the fluctuation spectra of the Cartesian coordinates to the most contributing modes. At first glance, this might be problematic for the present work as configurational entropy depends crucially on fluctuations. However, as we show in SI, the compression causes only a negligible error when the partial entropies are rescaled appropriately (see Eq. (S19) of SI).

Measurements From the MD trajectories, we measured all marginal entropies and mutual informations needed to construct the partial entropies {Si }. The corresponding integrals were calculated using histograms which is efficient and sufficiently accurate. 34 400 bins of equal size were used for the one-dimensional integral of H(θ) while 20 bins per angle were used for the oneand two-dimensional integrals needed for computing I, to assure that I ≥ 0 holds exactly for the estimator. Finally, the rescaling (Eq. (S19) of SI) accounting for the data compression was applied. We further derived the centroid structure defined as the snapshot k with largest

P

l

 exp −dkl /d¯

where dkl is the root-mean-square deviation of the snapshots k and l and d¯ is the mean of all dkl encountered for the trajectory. The alignment is performed with mdtraj. 35 The calculations were accelerated with the cost of negligible imprecision. First, a centroid was calculated for all 100 consecutive sub-trajectory of 100 ps length. Then, the centroid of these centroids was derived. From the centroid structures, as well as from the CASP data (see below), we computed the following 108 features for all residues (labeled by i, 1 ≤ i ≤ Nres ): Feature 1: Nres = number of residues. Features 2–4: eval1 , eval2 , eval3 = eigenvalues of the gyration tensor, sorted in ascending order. 20

ACS Paragon Plus Environment

Page 21 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Features 5–7: evec1 · dist, evec2 · dist, evec3 · dist where evecj is the j th eigenvector of the gyration tensor associated to eigenvalue evalj and directed such that its first component is non-negative; dist is the vector from the center-of-mass to the Cα atom of residue i. The center-of-mass is calculated as the (unweighted) average position of all protein atoms except hydrogens. Note that the features 2–7 contain information about the position of residue i inside an ellipsoid representing the protein shape. Features 8–18: c(2), c(4), c(6), ..., c(22), where c(rc ) = Number of Cα atoms with an Euclidean distance of less than rc [in ˚ A] or a chain distance of less than three to the Cα atom of residue i. These features represent a measure for the local density profile. The c(rc ) values were not normalized by the associated volumes 4πrc3 /3 since such normalization would not affect the NN performance while integer values are more human-readable. Feature 19: relSASA = solvent accessible surface area (SASA) of residue i relative to the standard accessibility of the residue type of residue i. SASA calculations are performed with the mdtraj 35 implementation of the Shrake-Rupley algorithm (probe-radius = 0.14˚ A). The standard accessibility values of NACCESS 36 are used for normalization. Normalization was introduced since it allows for better human-interpretability. It is irrelevant for the network performance (due to feature ResType(0), see below). 2 Feature 20: totalSASA = total SASA of the protein in ˚ A.

Features 21–41: torsions(−1), torsions(0), torsions(1) where torsions(s) = { ψ(s), φ(s), χ1 (s), χ2 (s), χ3 (s), χ4 (s), ω(s) } which are the torsion-angle values of residue i + s of the given protein structure, shifted by −π/2, −π, −5π/3, −π/4, −π/3, −π, and −π/2, respectively. Angles are defined on (−π, π]. Nonexistent angles are assigned the value nan. Due to the constant shifts, there is typically small weight at the ends of the interval. Feature 42: totalHbs = total number of hydrogen bonds of the protein where hydrogen bonds are identified using the mdtraj implementation of the Wernet-Nilsson algorithm. 37 Features 43–45: Hbs(−1), Hbs(0), Hbs(1) where Hbs(s) = number of hydrogen bonds of residue i + s (acting as either donor or acceptor).

21

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

X

W6 b6

108 features

L1

L2

L3

0.1

Li

=

L4

L5

Wi bi

t y

100 nodes

0.1 0.3

Σ( - )²

Page 22 of 29

= dropout = addition = dot product = concatenation = ELU Σ( - )² = squared loss

Figure 4: Schematic depiction of the computational graph underlying the NN architecture. The values of the 108 features are passed through 6 feed-forward layers (5 hidden, one output) to obtain an estimate y for Si . Each hidden layer Li applies an affine transformation (i.e. multiplication with weight matrix Wi and addition of bias bi ) and an exponentiallinear transformation. The output layer only applies an affine transformation. Shortcut connections from the input layer to the output of the hidden layers were provided. 10% and 30% dropout was applied to the output of the input layer and hidden layers, respectively. All parameters were optimized to minimize the squared loss between y and the targets t = partial entropies Si calculated from the MD simulations. Features 46–108: ResType(−1), ResType(0), ResType(1) where ResType(s) represents 21 binary variables encoding the residue type of residue i + s. All features are predictive for Si as confirmed by the associated R2 values (see Sec. S7 of SI). We further showed that reasonable groups of features contain additional information which is not inherent in the remaining features (see Sec. S6 of SI). Adding a feature for the residue’s secondary structure did not improve the performance. Hence, the features compose a reasonable minimal feature set, though, we cannot discard that some subset might provide similar prediction accuracy.

Training of the neural network The MD data were randomly divided into three sets: a training set containing 780 proteins (118062 residues), a validation set of 84 proteins (13113 residues) and a test set of 97 proteins (14311 residues). Optimization of NN architecture, learning method and other hyperparameters was performed by training various NNs on the training set and evaluating their

22

ACS Paragon Plus Environment

Page 23 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

performance in terms of RMSE on the validation set. The manual and random search included architectural variants (simple convolutional architectures, skip-connections, 38 batchnormalization layers 39 ), different activation functions (rectified-linear unit (ReLU), Parametric ReLU, exponential-linear unit (ELU) 40 ), as well as learning methods (stochastic gradientdescent variants Adam 41 and RMSprop 42 ). The random search sampled independently between 0 and 14 hidden layers, between 8 and 1024 neurons per layer, between 0 and 50% dropout, and between 10−2 and 10−5 for the initial learning rate. In all cases, features and targets were standardized for training, parameters were initialized in a data dependent way with Gaussian noise and within-layer weight-normalization, 43 and learning rate was reduced by a factor of 10 at 80 and at 160 epochs. All NNs and their training were implemented in Python using the Theano library. 44 Based on this extensive search, superior performance was found for a 6-layer NN with short-cut connections (re-feeding the inputs to deeper layers), and in total 94709 weight and bias parameters trained with Adam for 200 epochs in mini-batches of size 200, with an initial step size of 5 · 10−4 and otherwise default parameters. The precise feed-forward architecture is depicted in Fig. 4. It consists of 5 hidden layers each of 100 nodes which are fully connected to both the previous layer and the input layer with 30% and 10% dropout, respectively. ELUs serve as non-linearity for all layers except the output-layer, for which a linear activation function was used. The learning curve of the final NN is shown in Fig. S3 of SI.

CASP data CASP is a competition for blind predictions of protein structure. In a nutshell, the organizers publish AA sequences for which the participants submit models. Then, the experimental structures (called targets) are published and compared with the models. From CASP’s historical data repository, we downloaded all target files of CASP5–CASP11 together with the associated models submitted in the category tertiary structure predictions. 23

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Entirely prior to the performance test reported in Results, the data were cleaned and filtered in various ways as described in SI. Summarizing, we selected those samples which were not canceled by the organizers and whose target pdb-files did not contain errors. Further, we dropped erroneous models, unfolded models, models for improper AA sequences, and models containing insufficient side-chain information. Then we performed hierarchical clustering on the models (of given target) and dropped all models per cluster except a single random one. The remaining models were sequence-aligned with the associated targets and reduced to their common atoms. Finally, very accurate models were dropped (to assure that the remaining models represent decoy states of the protein) and targets with less than 5 remaining models were dropped. In this way, we obtained 459 targets (=natives) with in average 104 models (=decoys) per target ranging from 6 to 315.

Error estimation The reported errors represent the standard error of the mean, computed with the bootstrap method. Errors are given in concise notation, i.e. numbers in parenthesis represent the error on the last significant figure(s).

Acknowledgments We are grateful to the authors of Ref. 26 (especially Adam Hospital, Josep Llu´ıs Gelp´ı, and Modesto Orozco) for setting up the MoDEL database and for providing their simulation data. We thank Mareike Poehling for drawing Fig. 1.

Supporting Information Available The Supporting Information (pdf-file of 30 pages) contains the following information: (1) the full derivation of the used entropy expression; (2) additional figures referred to in the paper; (3) an analysis of the effects caused by the compression algorithm of the MoDEL 24

ACS Paragon Plus Environment

Page 24 of 29

Page 25 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

database; (4) details about the reliability number λ; (5) a report of how the CASP dataset was compiled; (6) an analysis confirming the importance of the features; (7) a list of the R2 values of all features.

This material is available free of charge via the Internet at

http://pubs.acs.org/.

References (1) Moult, J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Current Opinion in Structural Biology 2005, 15, 285–289. (2) Bonneau, R.; Baker, D. Ab Initio Protein Structure Prediction: Progress and Prospects. Annual Review of Biophysics and Biomolecular Structure 2001, 30, 173–189. (3) Reichl, L. E. A Modern Course in Statistical Physics, 3rd Edition; WILEY–VCH, 2009. (4) Lazaridis, T.; Karplus, M. Effective Energy Function for Proteins in Solution. Proteins: Struct., Funct., Genet. 1999, 35, 133–152. (5) Goethe, M.; Fita, I.; Rubi, J. M. Thermal motion in proteins: Large effects on the time-averaged interaction energies. AIP Advances 2016, 6, 035020. (6) Schymkowitz, J.; Borg, J.; Stricher, F.; Nys, R.; Rousseau, F.; Serrano, L. The FoldX web server: an online force field. Nucleic Acids Res. 2005, 33, W382–W388. (7) Pokala, N.; Handel, T. M. Energy Functions for Protein Design: Adjustment with Protein-Protein Complex Affinities, Models for the Unfolded State, and Negative Design of Solubility and Specificity. J. Mol. Biol. 2005, 347, 203–227. (8) Rohl, C. A.; Strauss, C. E.; Misura, K. M.; Baker, D. Protein Structure Prediction using Rosetta. Methods Enzymol. 2004, 383, 66–93. (9) Schwieters, C. D.; Kuszewski, J. J.; Tjandra, N.; Clore, G. M. The Xplor-NIH NMR molecular structure determination package. J. Magn. Reson. 2003, 160, 65–73. 25

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(10) Cheng, T. M.; Blundell, T. L.; Fern´andez-Recio, J. pyDock: Electrostatics and desolvation for effective scoring of rigid-body protein–protein docking. Proteins: Structure, Function, and Bioinformatics 2007, 68, 503–515. (11) Su´arez, M.; Tortosa, P.; Jaramillo, A. PROTDES: CHARMM toolbox for computational protein design. Syst. Synth. Biol. 2008, 2, 105–113. (12) Pierce, B.; Weng, Z. ZRANK: Reranking protein docking predictions with an optimized energy function. Proteins: Struct., Funct., Bioinf. 2007, 67, 1078–1086. (13) Th´evenet, P.; Shen, Y.; Maupetit, J.; Guyon, F.; Derreumaux, P.; Tuff´ery, P. PEPFOLD: an updated de novo structure prediction server for both linear and disulfide bonded cyclic peptides. Nucleic Acids Res. 2012, 40, W288–W293. (14) Sch¨afer, H.; Smith, L. J.; Mark, A. E.; van Gunsteren, W. F. Entropy Calculations on the Molten Globule State of a Protein: Side-Chain Entropies of α-Lactalbumin. Proteins: Struct., Funct., Genet. 2002, 46, 215–224. (15) Berezovsky, I. N.; Chen, W. W.; Choi, P. J.; Shakhnovich, E. I. Entropic Stabilization of Proteins and Its Proteomic Consequences. PLoS Comput. Biol. 2005, 1, 322–332. (16) Zhang, J.; Liu, J. S. On Side-Chain Conformational Entropy of Proteins. PLoS Comput. Biol. 2006, 2, 1586–1591. (17) Goethe, M.; Fita, I.; Rubi, J. M. Vibrational Entropy of a Protein: Large Differences between Distinct Conformations. J. Chem. Theory Comput. 2015, 11, 351–359. (18) Benedix, A.; Becker, C. M.; de Groot, B. L.; Caflisch, A.; B¨ockmann, R. A. Predicting free energy changes using structural ensembles. Nat. Methods 2009, 6, 3–4. (19) Xu, D.; Zhang, Y. Ab initio protein structure assembly using continuous structure fragments and optimized knowledge–based force field. Proteins: Structure, Function, and Bioinformatics 2012, 80, 1715–1735. 26

ACS Paragon Plus Environment

Page 26 of 29

Page 27 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(20) McCammon, J. A.; Harvey, S. C. Dynamics of proteins and nucleic acids; Cambridge University Press: Cambridge, 1987. (21) Su´arez, D.; D´ıaz, N. Direct methods for computing single-molecule entropies from molecular simulations. WIREs Comput. Mol. Sci. 2014, doi: 10.1002/wcms.1195. (22) Hnizdo, V.; Gilson, M. K. Thermodynamic and Differential Entropy under a Change of Variables. Entropy 2010, 12, 578–590. (23) King, B. M.; Tidor, B. MIST: Maximum Information Spanning Trees for dimension reduction of biological data sets. Bioinformatics 2009, 25, 1165–1172. (24) King, B. M.; Silver, N. W.; Tidor, B. Efficient Calculation of Molecular Configurational Entropies Using an Information Theoretic Approximation. J. Phys. Chem. B 2012, 116, 2891–2904. (25) Mark, A. E.; van Gunsteren, W. F. Decomposition of the Free Energy of a System in Terms of Specific Interactions: Implications for Theoretical and Experimental Studies. J. Mol. Biol. 1994, 240, 167–176. (26) Meyer, T.; D’Abramo, M.; Hospital, A.; Rueda, M.; Ferrer-Costa, C.; P´erez, A.; Carrillo, O.; Camps, J.; Fenollosa, C.; Repchevsky, D. et al. MoDEL (Molecular Dynamics Extended Library): A Database of Atomistic Molecular Dynamics Trajectories. Structure 2010, 18, 1399–1409. (27) Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks 2015, 61, 85–117. (28) Delgado, J. private communication. (29) Guerois, R.; Nielsen, J. E.; Serrano, L. Predicting Changes in the Stability of Proteins and Protein Complexes: A Study of More Than 1000 Mutations. J. Mol. Biol. 2002, 320, 369–387. 27

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(30) Bishop, C. M. Neural networks for pattern recognition; Oxford university press, 1995. (31) Chang, C.-E.; Chen, W.; Gilson, M. K. Ligand configurational entropy and protein binding. Proc. Natl. Acad. Sci. 2007, 104, 1534–1539. (32) Cramer, C. J.; Truhlar, D. G. Implicit Solvation Models: Equilibria, Structure, Spectra, and Dynamics. Chem. Rev. 1999, 99, 2161–2200. (33) Meyer, T.; Ferrer-Costa, C.; P´erez, A.; Rueda, M.; Bidon-Chanal, A.; Luque, F. J.; Laughton, C. A.; Orozco, M. Essential Dynamics: A Tool for Efficient Trajectory Compression and Management. J. Chem. Theory Comput. 2006, 2, 251–258. (34) Lange, O. F.; Grubm¨ uller, H. Full correlation analysis of conformational protein dynamics. Proteins: Struct., Funct., Bioinf. 2008, 70, 1294–1312. (35) McGibbon, R. T.; Beauchamp, K. A.; Harrigan, M. P.; Klein, C.; Swails, J. M.; Hern´andez, C. X.; Schwantes, C. R.; Wang, L.-P.; Lane, T. J.; Pande, V. S. MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories. Biophys. J. 2015, 109, 1528 – 1532. (36) Hubbard, S. J.; Thornton, J. M. ’NACCESS’, Computer Program. Department of Biochemistry and Molecular Biology, University College London 1993, (37) Wernet, P.; Nordlund, D.; Bergmann, U.; Cavalleri, M.; Odelius, M.; Ogasawara, H.; N¨aslund, L.; Hirsch, T. K.; Ojam¨ae, L.; Glatzel, P. et al. The structure of the first coordination shell in liquid water. Science 2004, 304, 995–999. (38) Huang, G.; Liu, Z.; Weinberger, K. Q. Densely Connected Convolutional Networks. CoRR 2016, abs/1608.06993 . (39) Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015; pp 448–456. 28

ACS Paragon Plus Environment

Page 28 of 29

Page 29 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(40) Clevert, D.-A.; Unterthiner, T.; Hochreiter, S. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). CoRR 2015, abs/1511.07289 . (41) Kingma, D. P.; Ba, J. Adam: A Method for Stochastic Optimization. CoRR 2014, abs/1412.6980 . (42) Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 2012, (43) Kr¨ahenb¨ uhl, P.; Doersch, C.; Donahue, J.; Darrell, T. Data-dependent Initializations of Convolutional Neural Networks. CoRR 2015, abs/1511.06856 . (44) Al-Rfou, R.; Alain, G.; Almahairi, A.; Angerm¨ uller, C.; et al., D. B. Theano: A Python framework for fast computation of mathematical expressions. CoRR 2016, abs/1605.02688 . (45) Li, D.-W.; Meng, D.; Br¨ uschweiler, R. Short–Range Coherence of Internal Protein Dynamics Revealed by High-Precision in Silico Study. J. Am. Chem. Soc. 2009, 131, 14610–14611. (46) Su´arez, E.; D´ıaz, N.; D. Su´arez, D. Entropy Calculations of Single Molecules by Combining the Rigid-Rotor and Harmonic-Oscillator Approximations with Conformational Entropy Estimations from Molecular Dynamics Simulations. J. Chem. Theory Comput. 2011, 7, 2638–2653. (47) Sokal, R.; Michener, C. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 1958, 38, 1409–1438. (48) Zemla, A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003, 31, 3370–3374.

29

ACS Paragon Plus Environment