Subscriber access provided by UNIV OF DURHAM
Article
A New Knowledge-based Scoring Function with Inclusion of Backbone Conformational Entropies from Protein Structures Xinxiang Wang, Di Zhang, and Sheng-You Huang J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00601 • Publication Date (Web): 14 Feb 2018 Downloaded from http://pubs.acs.org on February 16, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
A New Knowledge-based Scoring Function with Inclusion of Backbone Conformational Entropies from Protein Structures Xinxiang Wang, Di Zhang, and Sheng-You Huang∗
School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China
∗
Email:
[email protected]; Phone: +86-27-87543881; Fax: +86-027-87556576
1
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Abstract Accurate prediction of a protein’s structure requires a reliable free energy function that consists of both enthalpic and entropic contributions. Although considerable progresses have been made in the calculation of potential energies in protein structure prediction, the computation for entropies of protein has lagged far behind, due to the challenge that estimation of entropies often requires expensive conformational sampling. In this study, we have used a knowledge-based approach to estimate the backbone conformational entropies from experimentally determined native structures. Instead of conducting computationally expensive MD/MC simulations, we obtained the entropies of protein structures based on the normalized probability distributions of back dihedral angles observed in the native structures. Our new knowledge-based scoring function with inclusion of the backbone entropies, which is referred to as ITDA, was extensively evaluated on 16 commonly-used decoy sets and compared with 50 other published scoring functions. It was shown that ITDA is significantly superior to the other tested scoring functions in selecting native structures from decoys. The present study suggests the importance of backbone conformation entropies in protein structures and provides a way for fast estimation of the entropic effect.
2
ACS Paragon Plus Environment
Page 2 of 27
Page 3 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
1 INTRODUCTION Knowledge of a protein’s structures is necessary to understand its dynamics and function. As such, protein structure prediction has been an active area in contemporary computational biology and biophysics. According to Anfinsen’s thermodynamic hypothesis [1], the native structure of a protein is determined only by its sequence and corresponds to a unique, stable and kinetically accessible minimum on the free energy landscape. Thus, a critical part in protein structure prediction is to develop an energy scoring function that can accurately give the free energy of a given protein structure [2–5]. The native structure of a protein can then be predicted by searching the global minimum of its free energy landscape. The free energy ∆G of a protein structure can be expressed as ∆G = ∆E − T ∆S where ∆E and ∆S are the enthalpic and entropic components, respectively, and T is the absolute temperature. In real applications, the contribution from the enthalpic ∆E may be approximated by the total potential energy summing over all the interaction potentials between the atoms of the protein structure. Therefore, for simplicity, ∆E will refer to the potential energy in this study. The entropic term ∆S requires extensive sampling of the protein conformations. Both terms are challenging to calculate due to the combinations of multiple contributions. For years, a number of scoring functions have been developed to calculate the enthalpic term ∆E. These scoring functions can be basically grouped into two broad categories: physics-based and knowledge-based. In physics-based scoring functions, the potential energy of a structure is defined as the sum of multiple components including van der Waals interactions, electrostatic interactions, and bond stretching, bending and torsional forces, etc. Their force field parameters are normally derived from quantum mechanical calculations according to the principles of physics [6–11]. Despite its lucid physical meaning, the physics-based scoring function has not been widely adopted in protein structure prediction mostly because of its high computational cost. However, due to its good balance between accuracy and speed, the knowledge-based scoring function has been widely used in protein structure prediction and appears to be the most successful approach [12–14]. The energy terms of knowledge3
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
based scoring functions are directly converted from the occurrence frequencies wij (r) of atom pair ij in the native structures by using an inverse Boltzmann relation as uij (r) = −kB T ln wij (r) [15–22]. Since the pioneering work of Tanaka and Scheraga [23], the knowledge-based scoring function has received considerable advancements. A number of knowledge-based scoring functions have been developed in the past three decades [24–39]. Various strategies have also been used to improve the accuracy of knowledge-based scoring functions, such as including multi-body potentials, improving the reference state, and considering the orientation of interactions. Recently, to overcome the longstanding reference state problem [15, 21], we have developed a statistical mechanics-based iterative method to determine a set of distance-dependent and all-atom potentials, further pushing the accuracy limit of knowledge-based scoring functions [40]. Compared to the advances in potential energies, the computation of entropies has really lagged and relatively received little investigation in protein structure prediction, although the entropy is crucial for a protein’s folding and function as well as binding with other molecules [41–45]. Challenges for prediction of entropies may come from two aspects. One is that the protein entropy is a combination of contributions from multiple sources including protein, solvent and its binding partners [43]. The second is that the accurate calculation of entropy often requires extensive sampling for the conformational states of a protein structure. The computational cost for such sampling is very expensive due to the huge degrees of freedom in the protein system. Although NMR relaxation experiments provide a possibility to quantitatively determine the conformational entropies based on the estimated motions of backbone (BB) and side chains (SC) [46, 47], measuring entropies directly from experiments is still difficult for proteins. Therefore, current computational methods for conformational entropy normally rely on molecular dynamics (MD) and Monte Carlo (MC) simulations to sample possible conformations. With the generated conformations or trajectories, entropies can then be computed by various approaches, such as the local states method [48], the quasi-harmonic method [49–52], hypothetical scanning [53] and using the covariance matrix of atomic fluctuations [54, 55] or dihedral angles [56–61]. However, despite some successes in a limited number structures, these methods are 4
ACS Paragon Plus Environment
Page 4 of 27
Page 5 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
not practical for evaluating a huge number of conformations for protein structure prediction due to the high computational cost of simulations. In this work, we have taken a new approach to estimate the conformational entropy of a protein structure through the probability distribution of backbone dihedral angles based on a knowledge-based method. Instead of conducting computationally expensive MD/MC simulations, we have computed the protein’s conformational entropy according to the frequencies of the backbone (φ, ψ) dihedral angles observed in experimentally determined protein structures from the Protein Data Bank (PDB) [62]. Thus, our approach is computationally efficient to be used for a large number of structures in protein structure prediction. Extensive tests on 16 diverse decoy sets from the literature showed that our new scoring function with the backbone entropy significantly enhanced the predictive power of the knowledge-based scoring function in selecting native protein structures from decoys, outperforming 50 other scoring functions with a significant advantage. The results suggest the efficacy of our approach for estimating the backbone conformational entropy of protein structures.
2 METHODS 2.1 Entropy calculations The conformational entropy of a protein comes from its backbone and side chains, which may be estimated from the frequencies of the backbone (φ, ψ) dihedral angles and side chain rotameric angle [χn ] [56]. As the backbone entropy dominates the total change in entropy in protein folding [60], we will focus on the backbone entropy in this study. Let’s consider a protein chain of N amino acids. The total number of conformational states of the backbone for the protein can be expressed as Ω0 ≈
NY −1 Z π i=2
−π
Z
π
ρi (φi , ψi )dφi dψi
(1)
−π
where ρi (φi , ψi ) is the number density of backbone conformation states for the i-th residue at a microstate (φi , ψi ) in the φ-ψ dihedral angle space. Strictly speaking, Eq. (1) is only valid if all the other 5
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 27
backbone angels of the protein is independent from those the i-th residue. However, the backbone conformation of a reside always correlates with that of its neighbors to some extent. Therefore, Eq. (1) is only an approximate of the number of conformational states for a protein. The number density ρi (φi , ψi ) here reflects both the Ramachandran basin populations and the torsional vibrations in the backbone φ-ψ space[60]. Thus, the total entropy S0 for the protein chain can be calculated according to the Boltzmann formula as S0 = kB ln Ω0 = kB
N −1 X
ln
i=2
Z
π −π
Z
π
ρi (φi , ψi )dφi dψi
(2)
−π
where kB is the Boltzmann constant. Thus, the backbone entropy S1 for a specific structure of the protein can be approximated as S1 = kB ln Ω1 = kB
N −1 X
ln ρi (φi , ψi )∆φi ∆ψi
(3)
i=2
where φi and ψi are the backbone dihedral angles of the i-th residue for the protein structure. Here, Ω1 = ρi (φi , ψi )∆φi ∆ψi stands for the number of backbone conformational states in a (∆φi , ∆ψi ) local region at (φi , ψi ), a proxy for entropy in the region of conformational space. Correspondingly, the loss of backbone entropy for this structure is ∆S = S1 − S0 = kB
N −1 X
ln ρi (φi , ψi )∆φi ∆ψi − kB
i=2
N −1 X i=2
ln
Z
π −π
Z
π
ρi (φi , ψi )dφi dψi
(4)
−π
Then, we can rewrite Eq. (4) as ∆S = kB
N −1 X i=2
where
N −1 X ρi (φi , ψi )∆φi ∆ψi ln pi (φi , ψi ) = kB ln R π R π −π −π ρi (φi , ψi )dφi dψi i=2
ρi (φi , ψi )∆φi ∆ψi pi (φi , ψi ) = R π R π −π −π ρi (φi , ψi )dφi dψi
(5)
(6)
is the probability or fraction of the conformational state for the i-th residue at backbone (φi , ψi ). Then, the issue becomes how to calculate the probability distribution pi (φi , ψi ) of backbone (φi , ψi ) for the i-th residue. Theoretically, the probability distribution pi (φi , ψi ) needs to be calculated through an extensive MD simulation for a protein structure. Fortunately, recent studies in 6
ACS Paragon Plus Environment
Page 7 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
entropy reduction of unfolded peptides showed that the conformational distribution of tripeptides like GxG correlated well with the conformational distribution of corresponding amino acid residues in the native structure in terms of entropic loss, where x stands for the studied residue [61]. This finding suggests two aspects of information. One is that the conformational distribution of the i-th residue largely depends on its neighbors and is relatively independent of their environment. Therefore, we can adopt a neighbor-dependent Ramachandran probability distribution, pi (φi , ψi ) = pi (φi , ψi |L, C, R), where C is the central residue i whose φi , ψi are being calculated and L and R are the identities of its left and right neighbors [63]. Thus, for a given protein structure, the entropy can be rewritten as ∆S = kB
N −1 X
ln pi (φi , ψi |L, C, R)
(7)
i=2
The other is that the neighbor-dependent probability distribution function (PDF) pi (φi , ψi |L, C, R) is relatively general and may be calculated from experimentally determined native structures instead of conducting MD/MC simulations [61, 64], given that experimental structures represent snapshots of the dynamic protein in a real environment. In the present study, the general probability distribution function (PDF) for a central residue C with neighbors L and R, p(φ, ψ|L, C, R), was calculated from experimental protein structures in the PDB as follows:
p(φ, ψ|L, C, R) =
M P
M P
m=1 π P
nm (φ, ψ|L, C, R) π P
m=1 φ=−π ψ=−π
· f (φ, ψ)
(8)
nm (φ, ψ|L, C, R)
where nm (φ, ψ|L, C, R) =
N −1 X i=2
1 φi =φ−∆φ/2 ψi =ψ−∆ψ/2 φ+∆φ/2
ψ+∆ψ/2
X
X
(9) i∈C;i−1∈L,i+1∈R
is the number of backbone conformation states for a central residue of type C with neighbors L and R, within a dihedral angle interval from (φ − ∆φ/2, ψ − ∆ψ/2) to (φ + ∆φ/2, ψ + ∆ψ/2) for the m-th native structure with N residues in the training set of M proteins. The f (φ, ψ) is a residue-
7
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 27
independent normalization factor and has the following form f (φ, ψ) = 1
,
M P
M P
m=1 π P
N m (φ, ψ) π P
m=1 φ=−π ψ=−π
where N m (φ, ψ) =
N −1 X
(10) N m (φ, ψ)
φ+∆φ/2
ψ+∆ψ/2
X
X
1
(11)
i=2 φi =φ−∆φ/2 ψi =ψ−∆ψ/2
is the total number of backbone conformation states for all the residues of the m-th protein within a dihedral angle interval from (φ − ∆φ/2, ψ − ∆ψ/2) to (φ + ∆φ/2, ψ + ∆ψ/2). Therefore, the p(φ, ψ|L, C, R) represents an estimation of the backbone entropy of a tripeptide in the accessible conformational space allowed by the neighbors of its central residue.
2.2 Data set During the calculation of the backbone PDFs, we have used a very large training set of non-redundant protein chains. The set was generated through the PISCES server [65], using a sequence identity ˚ and an R-factor cutoff of 0.25. To avoid possible bias cutoff of 30%, a resolution cutoff of 2.0 A, towards the derived entropies, we have removed those PDB entries that overlap with our test sets from the PISCES-generated database. This yielded a final set of 10504 protein structures. The large number of experimental structures in the data set are expected to well cover the whole conformational space of backbone (φ, ψ). The kB T was set to be 1, as used in our previous study. The angle intervals ∆φ and ∆ψ were both set to be 30◦ [64] so that enough number of backbone conformations were obtained for good statistics in the probability distribution.
2.3 Iterative knowledge-based function with backbone entropy For a given protein structure with N amino acids, its backbone conformational entropy ∆S can be calculated according to Eq. (7) as ∆S = kB
N −1 X
ln pi (φi , ψi |L, C, R)
i=2
8
ACS Paragon Plus Environment
(12)
Page 9 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
where pi (φi , ψi |L, C, R) is the probability of the backbone conformation for the i-th residue with type C and backbone dihedral angles (φi , ψi ), and L and R are the types of its left and right neighbors The potential energy ∆E of the structure was calculated using our pair interaction potentials from an improved iterative knowledge-based scoring function. Specifically, the potential energy was obtained by summing up the interaction energies over all the atom pairs in the protein as ∆E =
X
uij (r)
(13)
i