Prediction of Protein Configurational Entropy - ACS Publications

Department of Condensed Matter Physics, University of Barcelona, Carrer Martı i. Franqués 1, Barcelona, Spain, Faculty of Biosciences, Heidelberg Un...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/JCTC

Cite This: J. Chem. Theory Comput. 2018, 14, 1811−1819

Prediction of Protein Configurational Entropy (Popcoen) Martin Goethe,*,† Jan Gleixner,‡ Ignacio Fita,¶ and J. Miguel Rubi †

Department of Condensed Matter Physics, University of Barcelona, Carrer Martí i Franqués 1, 08028 Barcelona, Spain Faculty of Biosciences, Heidelberg University, Im Neuenheimer Feld 234, 69120 Heidelberg, Germany ¶ Molecular Biology Institute of Barcelona (IBMB-CSIC, Maria de Maeztu Unit of Excellence), Carrer Baldiri Reixac 4-8, 08028 Barcelona, Spain ‡

S Supporting Information *

ABSTRACT: A knowledge-based method for configurational entropy prediction of proteins is presented; this methodology is extremely fast, compared to previous approaches, because it does not involve any type of configurational sampling. Instead, the configurational entropy of a query fold is estimated by evaluating an artificial neural network, which was trained on molecular-dynamics simulations of ∼1000 proteins. The predicted entropy can be incorporated into a large class of protein software based on cost-function minimization/evaluation, in which configurational entropy is currently neglected for performance reasons. Software of this type is used for all major protein tasks such as structure predictions, proteins design, NMR and X-ray refinement, docking, and mutation effect predictions. Integrating the predicted entropy can yield a significant accuracy increase as we show exemplarily for native-state identification with the prominent protein software FoldX. The method has been termed Popcoen for Prediction of Protein Configurational Entropy. An implementation is freely available at http://fmc.ub.edu/popcoen/.



INTRODUCTION Predicting the protein structure from the amino acid (AA) sequence is a challenging task, especially for larger proteins without sequence homologues in the PDB database.1 For these cases, ab initio methods are employed to derive the protein structure from first principles. A major class of such software is based on some cost (or scoring) function Ĝ that relates a given structure (or conformation) of the protein to a number that is intended to take lower values for structures which are “closer” to the native state of the protein.2 In a nutshell, these programs construct various trial conformations and select one of them as the predicted native fold, where the entire process is guided by Ĝ . These cost-function methods are based on a well-known statistical physics principle stating that the equilibrium state of a system is the one with the lowest free energy G.3 This guarantees that Ĝ exhibits its minimum for the native state if Ĝ represents a sufficiently accurate approximation to the free energy G of the protein in solvent. Following Lazaridis and Karplus,4 one can decompose G into three major contributions, namely, the average intramolecular energy, the average solvation free energy, and the configurational entropy (multipled by (−T)). The first two contributions are usually considered within Ĝ . They are modeled as functions of the average atom coordinates, whereupon fluctuation-induced effects on the averages are ignored.5 In contrast, the third contribution, configurational entropy, is often neglected in Ĝ , mainly because this quantity is strongly dependent on information beyond the average protein structure. In detail, configurational entropy is given by © 2018 American Chemical Society

Sconf = −kB

∫ dx ⃗ ρ(x ⃗) log ρ(x ⃗)

(1)

where the integral is performed over the entire configurational space. Here, x⃗ denotes the set of protein−atom coordinates (in a body-fixed coordinate system), ρ(x⃗) represents their joint probability distribution (in the presence of the solvent), and kB is the Boltzmann constant.4 From eq 1, it follows that Sconf is crucially dependent on the spatial fluctuations of the protein atoms and their correlations. This information is not present in Ĝ , since this quantity is only a function of the average atom coordinates. Therefore, incorporating Sconf into Ĝ is cumbersome. Consequently, software tools based on a cost-function Ĝ account for Sconf only rudimentarily6,7 or neglect it completely.8−13 This praxis might be justified for some especially compact proteins for which configurational entropy plays a minor role, while it is highly problematic in general, since Sconf has a major impact on the native-state selection of most proteins.14−17 Other software does include more elaborate estimates of Sconf in its cost-function Ĝ , where Sconf is obtained via sampling of the configurational space in the vicinity of the given structure.18,19 This, however, dramatically changes the runtime of such software and, consequently, its applicability, since the sampling process involves many evaluations of the energy function, instead of just one. Received: October 26, 2017 Published: January 19, 2018 1811

DOI: 10.1021/acs.jctc.7b01079 J. Chem. Theory Comput. 2018, 14, 1811−1819

Article

Journal of Chemical Theory and Computation

Figure 1. Workflow illustration. (Top panel) From MD simulations of 961 proteins, we measured 108 structural features per amino acid of the protein structures, as well as the partial entropies {Si}, which are the configurational entropies of the residues. With these data, we trained a neural network (termed Popcoen) on predicting the Si values from the features. (Bottom panel) Popcoen allows one to estimate configurational entropy Sconf just from a query structure via feature measurement and Popcoen evaluation. The approach is extremely fast, because it does not involve any type of configurational sampling.



RESULTS We trained a NN with simulation data to predict configurational entropy of proteins. First, we outline the underlying entropy approximation, the measurements, and the training process. Then, we report a test for the usefulness of the NN for nativestate identification of proteins. Entropy Approximation. Configurational entropy Sconf is the 3Natoms − 6 dimensional integral of eq 1 for a protein containing Natoms atoms. Because of its high dimension, brute force evaluation from simulation data is intractable and approximations must be employed to obtain a feasible expression.21 Using well-known approaches22−24 (with slight adaptations), we obtained the decomposition

Here, we suggest an alternative approach for incorporating configurational entropy into Ĝ , where the missing information about fluctuations and correlations is estimated with probabilistic rules, instead of by sampling. Let us illustrate this idea using an elementary example. It is well-known that an AA that is buried in the bulk of the protein usually exhibits weaker fluctuations than an AA on the surface of the protein,20 mainly due to strong steric constraints in the bulk. For this reason, surface AAs usually contribute stronger to Sconf than buried AAs and it might be reasonable to include an entropic term in Ĝ , which favors proteins having a large surface. This term could easily be expressed in terms of the available average atom coordinates, and its incorporation would generally improve the accuracy of Ĝ . In this work, we performed a more elaborate approach along these lines, which allows one to exploit more-complex fluctuation and correlation patterns for improving Ĝ . From moleculardynamics (MD) simulations of ∼1000 proteins, we measured an estimator of configurational entropy, as well as various features of the protein structure (such as solvent accessibility, local density, and hydrogen bonding). With these data, we trained an artificial neural network (NN) on predicting Sconf from the features, as illustrated in Figure 1 (top panel). The NN can now be used to construct a knowledge-based estimator for Sconf from a query structure via feature measurement and NN evaluation (see the bottom panel of Figure 1). This is extremely fast and allows one to incorporate Sconf into cost-functions Ĝ of existing protein software without compromising its runtime.

Nres

Sconf ≈

∑ Si + constant i=1

(2)

in terms of so-called “partial” entropies {Si} (i = 1, ..., Nres), which can be related to the entropy contributions of the Nres residues (see the Methods section). The Si values are functions of specific marginal entropies of torsion angles and mutual informations that do not involve integrals of dimension larger than two, enabling their reliable measurement from simulations. It turns out that the decomposition given by eq 2 is extremely valuable for entropy predictions, because local quantities are considerably easier to predict than global ones. Note that a similar decomposition property does not hold for eq 1.25 Measurements from Simulations. We analyzed molecular dynamics (MD) simulations of 961 proteins that were available 1812

DOI: 10.1021/acs.jctc.7b01079 J. Chem. Theory Comput. 2018, 14, 1811−1819

Article

Journal of Chemical Theory and Computation from a public database of MD trajectories (MoDEL-database26). The simulation details and the data-selection process are outlined in the Methods section. From the simulation of each protein, we measured the partial entropies {Si} of all AAs, as well as the centroid structure, which is the snapshot of the simulation that is most similar to all other snapshots. From the centroid structures, we then derived 108 features per residue (see the Methods section). The features contain global properties of the protein such as Nres, the total solvent accessible surface area (SASA), the total number of hydrogen bonds, and properties of the gyration tensor, as well as local properties of the residue and its neighbors such as the residue type, the burial level, the local density, the relative SASA, the average torsion angles, and the number of hydrogen bonds. Table 1 lists the coefficient of determination R2 between Si and a subset of features showing that the features are predictive for Si (complete list in the Supporting Information (SI)). A particularly large value of R2 = 0.210(7) is found for the relative solvent accessibility of the residues. This confirms the illustrative example discussed in the Introduction: a properly defined surface term might be used as an estimator for Sconf of low accuracy. However, by combining the information of all features, we were able to construct a substantially better estimator (see discussion below). Training of the Neural Network. Predicting the partial entropy Si from the 108 features constitutes a standard regression problem of supervised learning, which can be tackled with various machine-learning approaches. We chose to implement an artificial neural network (NN) for their excellent scalability and their highly successful application to diverse machine-learning problems in recent years.27 By extensive search over network architectures, learning methods, and other hyper-parameters, we identified a six-layer NN with shortcut connections that provided excellent validation performance when trained with a stochastic gradient descent variant and regularized by dropout and early stopping (see the Methods section). On a separate test set, the trained network predicted Si with a coefficient of determination of R2 = 0.547(6) (corresponding to an RMSE of 0.706(7) kB and a Pearson correlation coefficient of 0.740(5)), which is substantially better than the prediction performance obtained for single features (see Table 1) or with a linear model (R2 = 0.421(7)). This results in an excellent performance for the prediction of the accumulated entropy

Table 1. Coefficient of Determination (R2) between the Partial Entropies {Si} and Selected Features (or Combinations of Features)a

∑ Si i=1

totalSASA/[∏3j=1 evalj]1/2

0.029(9)

relSASA

0.210(7)

ResType(−1)

0.018(3)

ResType(0) ResType(1) |dist|/Rg

0.17(1) 0.014(2) 0.12(1)

[evec1 · dist]2/eval1 [evec2 · dist]2/eval2

0.026(6) 0.026(6)

[evec3 · dist]2/eval3 c(6) c(10)

0.043(5) 0.122(7) 0.19(1)

c(14) c(18) c(22) totalHbs/Nres

0.20(1) 0.162(7) 0.11(1) 0.016(3)

Hbs(−1) Hbs(0) Hbs(1) ψ(−1) for ALA

0.056(5) 0.068(4) 0.032(7) 0.25(2)c

ψ(0) for ALA

0.33(1)c

ψ(1) for ALA ϕ(0) for GLY χ1(0) for LEU χ2(0) for ILE ω(0) for SER

0.20(2)c 0.17(2)c 0.090(8)c 0.13(1)c 0.04(1)c

description totalSASA = total SASA; evalj = jth eigenvalue of gyration tensor in ascending order relSASA = relative SASA of residue i ResType(s) = residue-type label (binary coded) for residue i + s dist = vector from Cα atom to center of mass Rg = radius of gyration evecj = eigenvector associated with evalj c(rc) = number Cα atoms in sphere of radius rc around residue i

totalHbs = total number hydrogen bonds Nres = number residues Hbs(s) = number of hydrogen bonds of residue i + s ψ(s) = torsion angle ψ of residue i + s of the centroid (or query) structure if residue i is of given residue type; analogously for other torsion angles

a

The full list for all features is given in the Supporting Information (SI). The R2 values are significantly larger than zero, which proves that the features are predictive for Si. The most predictive feature is the relative SASA of the residues (relSASA), which alone allows one to explain 21% of the variance. The features were used for training a neural network, resulting in an improved accuracy of R2 = 0.547(6) by combining the information of all features. bR2 is defined as R2 = 1 − MSE/Var(Si), where MSE represents the mean squared error when the neural network was trained on predicting Si just from a single feature, and Var represents the variance. cMSE and Var measured for data reduced to a given residue type.

Nres

SPC =

feature (-combination)

coefficient of determination,b R2

(3)

(referred to as Popcoen entropy hereafter), as confirmed by an R2 value of 0.66(5) (Pearson = 0.81(4)) for the prediction of SPC/ Nres on the test set. Scatter plots of measured versus predicted values are shown in Figure S2 in the SI. The Popcoen prediction for SPC/Nres is more accurate than for the individual Si values (bootstrapped p-value = 0.016). This stems from the fact that the Si values inside a protein are typically positively correlated, which broadens the SPC distribution, i.e., std‐dev(SPC/ Nres ) = η1 · std-dev(Si) with η1 = 3.5(3) > 1. Since the feature set also contains global features of the proteins, Popcoen is able to partially capture these correlations, such that RMSE(SPC/ Nres ) = 2.4(3)kB = η2 · RMSE(Si), with η2 = 2.8(3) < η1. Entropy Prediction Tool Popcoen. The entropy prediction program Popcoen has been made available via the

Internet at http://fmc.ub.edu/popcoen/. It allows one to estimate the configurational entropy of a protein structure via two calculation steps (see the bottom panel of Figure 1). First, the query structure is loaded and the 108 features per residue are calculated. Second, the NN is evaluated for the features, which yields predictions for the partial entropies {Si} of the residues. The program outputs the Si values and the Popcoen entropy SPC (of eq 3), which represents the Popcoen estimate (in units of kB) for Sconf + C, where C is constant for a given AA sequence. In addition, a reliability number λ ∈ [0, 1) is output, which reflects the fact that the Popcoen entropy is reliable only when the given query structure is not too different from the training data (see the SI). 1813

DOI: 10.1021/acs.jctc.7b01079 J. Chem. Theory Comput. 2018, 14, 1811−1819

Article

Journal of Chemical Theory and Computation

Figure 2. (A) Normalized histograms of SPC/Nres for all 459 CASP natives and for the reduced set of 101 CASP natives that were not identified by FoldX. The associated medians are indicated by the arrows. FoldX fails to identify the native state preferentially for samples with large Popcorn entropy density (p-value of 0.5. The numbers (and percentages) of natives are given which are ranked equally by both cost functions, or better for one or the other. This shows that Ĝ FX+PC performs significantly better than Ĝ FX (p-values obtained with generalized likelihood ratio test). In the bottom portion of the table, the number (and percentage) of natives that have been identified (+) by both cost-functions, have not been identified (−), or have been identified by only one cost function. For the reduced dataset, Ĝ FX+PC performs significantly better than Ĝ FX. a

definition stems from a formal decomposition of the full phase space integral over all degrees of freedom (dofs) of the protein− solvent system into protein dofs and solvent dofs introduced by Lazaridis and Karplus4 (for details, see the SI). Equation 1 is better expressed in bond-angle-torsion coordinates to facilitate the isolation of dofs which are strongly confined by the covalent structure of the protein.22 These coordinates are basically “frozen” at room temperature,20 so that their contribution to Sconf can be neglected and eq 1 reduces to an integral over the flexible torsion angles usually denoted {ψi}, {ϕi}, and {χ1i }...{χ4i } (see the SI). The remaining integral is still of very high dimension (namely, ∼3.8Nres-dimensional) rendering its brute-force evaluation from simulation data unfeasible (curse of dimensionality). Therefore, an approximation must be employed. Here, we used an approach similar to the second-order Maximum Information Spanning Tree (MIST) approximation of entropy,23,24 which allowed us to decompose Sconf into a sum of partial entropies {Si} which could reliably be measured from the available simulation data, since only one- and two-dimensional integrals are involved. We obtained the expression

evaluation where no configurational sampling is carried out to save computational time. Entropy is estimated by evaluating a neural network that was trained on simulation data of 961 representative proteins. This allows reliable Sconf predictions for most protein structures. For cases that are very distinct from the training proteins, Popcoen entropies may not be reliable, as indicated by a small value of the reliability number (λ ≲ 0.5). We plan to reparametrize Popcoen on a larger set of proteins when the simulation database contains substantially more trajectories. This will further amplify Popcoen’s applicability. In an explicit test, we have shown that incorporating Popcoen into the FoldX cost function improves the accuracy of FoldX for native-state identification. We expect that a similar precision gain can be obtained for all protein software based on cost-function minimization/evaluation which currently neglect Sconf or account for Sconf only rudimentarily. Software of this type has been developed for all major tasks related to proteins such as structure predictions, proteins design, NMR and X-ray refinement, docking, and mutation effect predictions. Therefore, Popcoen may be helpful for many aspects of protein science. We have designed Popcoen to be very easily integratable into existing software, since Popcoen can be run as a separate server process. This enables researchers to test whether Popcoen improves their protein software with very limited effort, as only socket communication must be established. However, to prevent double counting, it might additionally be necessary to modify the original cost function to eliminate partial effects of Sconf already captured within Ĝ . Because of the large entropic changes upon docking,31 considering Sconf is particularly important for protein-docking predictions. In collaboration with the developers of pyDock,10 we currently investigate the impact of Popcoen on this task. We further work on a similar knowledge-based approach also for the average solvation free energy. This may result in a more accurate estimator than current approaches (such as Generalized Born or GBSA32), since it will account for the heterogeneous surface characteristics of proteins.

Nres

Sconf ≈

∑ Si + C(T , AA sequence) i=1

(4)

where C is a constant for different conformations of a given AA sequence and constant temperature, pressure, and pH. The partial entropies (except S1 and SNres), given as Si ≔



H(θ ) − I(θ ; κ(θ )) (5)

θ∈Θi

are defined in terms of the marginal entropies π

H (θ ) ≔ − k B

∫−π dθ ρ(θ) log ρ(θ)

(6)

(denoted here by H to avoid confusion with Si) and the mutual informations



I (θ ; τ ) ≔ H (θ ) + H (τ ) − H (θ , τ )

METHODS Entropy Expression. Configurational entropy Sconf is the integral of eq 1, where x⃗ denotes the 3Natoms − 6 atom coordinates in a body fixed coordinate system and ρ(x⃗) is their joint probability density in the presence of the solvent. This

(7)

Here, ρ(θ) is the marginal distribution of a single torsion angle θ, and π

H (θ , τ ) ≔ − k B 1816

π

∫−π ∫−π dθ dτ ρ(θ , τ) log ρ(θ , τ)

(8)

DOI: 10.1021/acs.jctc.7b01079 J. Chem. Theory Comput. 2018, 14, 1811−1819

Article

Journal of Chemical Theory and Computation is the entropy of the two-dimensional joint distribution ρ(θ, τ) of two torsion angles θ and τ. The sum of eq 5 runs over the set Θi containing all unfrozen torsion angles of residue i, i.e., Θi is the set {ϕi, ψi} extended by those side-chain angles χ1i , χ2i , χ3i , and χ4i that exist in residue i. The function κ encodes adjacent torsion angles along the covalent structure of the protein via the definitions κ(ϕi) ≔ ψi−1, κ(ψi) ≔ ϕi, κ(χ1i ) ≔ ψi, κ(χ2i ) ≔ χ1i , κ(χ3i ) ≔ χ2i , and κ(χ4i ) ≔ χ3i . The detailed calculation leading to eq 5, together with the slightly modified formulas for S1 and SNres, can be found in the SI. The dependence of C on the AA sequence is also discussed in the SI. This dependency must be considered when entropy for different sequences is compared, e.g., in mutation studies. MD Data. We trained the neural network with data measured from MD simulations. The trajectories were obtained from the MoDEL database which contains simulation trajectories for a large and representative set of proteins covering all major fold classes.26 All simulations used for this study had been performed with AMBER 8.0 (parm99 force field) under ambient conditions (T = 300 K, p = 1 atm) and constant volume. Each simulation was performed for a single protein in explicit solvent (TIP3P water). After careful equilibration to the simulation conditions, production runs lasted for 10 ns using a time step of 2 fs, where snapshots were taken every picosecond. All trajectories used had passed all eight quality criteria of the MoDEL database,26 which guarantees that unfolding events did not occur. In total, we analyzed simulation data for 961 proteins, ranging from 21 AAs to 539 AAs, with an average size of 151.4 AAs. The simulation data were available from the MoDEL database in compressed form where a nonreversible compression algorithm had been employed.33 Enormous compression rates were achieved by reducing the fluctuation spectra of the Cartesian coordinates to the most contributing modes. At first glance, this might be problematic for the present work, because configurational entropy is crucially dependent on fluctuations. However, as we show in the SI, the compression causes only a negligible error when the partial entropies are rescaled appropriately (see eq (S19) in the SI). Measurements. From the MD trajectories, we measured all marginal entropies and mutual informations needed to construct the partial entropies {Si}. The corresponding integrals were calculated using histograms, which is efficient and sufficiently accurate.34 Four hundred (400) bins of equal size were used for the one-dimensional integral of H(θ), while 20 bins per angle were used for the one- and two-dimensional integrals needed for computing I, to ensure that I ≥ 0 holds exactly for the estimator. Finally, the rescaling that accounts for the data compression (eq (S19)) was applied. We further derived the centroid structure defined as the snapshot k with largest ∑l exp(−dkl/d̅), where dkl is the rootmean-square deviation of the snapshots k and l, and d̅ is the mean of all dkl encountered for the trajectory. The alignment was performed with mdtraj.35 The calculations were accelerated with the cost of negligible imprecision. First, a centroid was calculated for all 100 consecutive subtrajectory of 100 ps length. Then, the centroid of these centroids was derived. From the centroid structures, as well as from the CASP data (see below), we calculated the following 108 features for all residues (labeled by i, 1 ≤ i ≤ Nres): Feature 1: Nres = number of residues.

Features 2−4: eval1, eval2, eval3 = eigenvalues of the gyration tensor, sorted in ascending order. Features 5−7: evec1 · dist, evec2 · dist, evec3 · dist, where evecj is the jth eigenvector of the gyration tensor associated with eigenvalue evalj and directed such that its first component is nonnegative; dist is the vector from the center-of-mass to the Cα atom of residue i. The center-of-mass is calculated as the (unweighted) average position of all protein atoms, except hydrogens. Note that features 2−7 contain information about the position of residue i inside an ellipsoid representing the protein shape. Features 8−18: c(2), c(4), c(6), ..., c(22), where c(rc) is the number of Cα atoms with an Euclidean distance of less than rc [in Å] or a chain distance of