Integrating bonded and nonbonded potentials in the knowledge

May 2, 2019 - Addressing the limitation, we have developed a composite knowledge-based scoring function, named as ITCPS, by integrating bonded and ...
0 downloads 0 Views 1MB Size
Subscriber access provided by AUBURN UNIV AUBURN

Bioinformatics

Integrating bonded and nonbonded potentials in the knowledgebased scoring function for protein structure prediction Xinxiang Wang, and Sheng-You Huang J. Chem. Inf. Model., Just Accepted Manuscript • Publication Date (Web): 02 May 2019 Downloaded from http://pubs.acs.org on May 4, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Integrating bonded and nonbonded potentials in the knowledge-based scoring function for protein structure prediction Xinxiang Wang and Sheng-You Huang∗

School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China



Email: [email protected]; Phone: +86-27-87543881; Fax: +86-027-87556576

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract An accurate energy scoring function is crucial for protein structure prediction. Given the increasing number of experimentally determined structures, knowledge-based approaches have been widely used to develop scoring functions for protein structure prediction in the past three decades. However, current scoring functions often only consider nonbonded interactions and neglect bonded potentials like covalent bonds and angles for the sake of speed and simplicity. Although such scoring functions may be successful on fully relaxed conformations, they would have difficulties in ranking those decoys with distorted bonds or angles, especially when being used for conformational sampling in structure prediction. Therefore, such a scoring function may perform well on one or several decoy sets, but it often has a limited accuracy on large diverse sets. Addressing the limitation, we have developed a composite knowledge-based scoring function, named as ITCPS, by integrating bonded and nonbonded potentials as well as orientation-dependent and hydrophobic interactions. Our scoring function ITCPS was extensively evaluated on 18 decoy sets of 927 proteins including three sets of 3DRobot, AMBER benchmarking set, HR, CASP5-8, CASP9-13, eight sets of Decoy ‘R’ Us, MOULDER, ROSETTA, and I-TASSER set, and compared with 51 other scoring functions. It was shown that overall ITCPS performed the best among the 52 scoring functions and achieved a good performance on all the test sets. Of 927 proteins, ITCPS recognized the native structures for 842 proteins, giving a success rate of 90.8% and an average Z-score of 3.36. Moreover, ITCPS also exhibited a strong ability in distinguishing the best near-native structure among decoys and achieved a significantly better performance than other tested scoring functions. The present model is expected to be beneficial for the development of scoring functions for other interactions.

2

ACS Paragon Plus Environment

Page 2 of 38

Page 3 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1 INTRODUCTION The structure of a protein is necessary for investigating its dynamics and functions [1–4]. However, due to the technical difficulty and cost in experimental methods, the number of the experimentally determined structures is still limited [5], compared to the huge number of protein sequences. Therefore, various computational approaches have been developed to predict the protein structures from their sequences, in which a great challenge is the development of an accurate energy scoring function [6–8]. According to Anfinsen’s dogma, the native structure of a protein is determined only by its sequence, and corresponds to a unique, stable and kinetically accessible minimum of the free energy [9]. Thus, an ideal scoring function should be able to give the native structure the lowest free energy compared to nonnative decoys. For years, a number of scoring functions have been developed for protein structure prediction, which can be grouped into two broad categories: physics-based force fields and knowledge-based empirical potentials. Physics-based approaches approximate the energy function through a set of physics-based terms associated with bond lengths, angles, torsional angles, van der Waals (VDW), and electrostatic interactions [10–15]. Despite its lucid physical meaning, the physics-based scoring function is computational expensive and normally used in molecular dynamics (MD) simulations. The other group of approaches are knowledge-based scoring functions whose energy terms are directly converted from the occurrence frequencies of interacting atoms observed in the experimentally determined native structures by using an inverse Boltzmann relationship [16–22]. Compared to physics-based force fields, knowledge-based scoring functions showed a good balance between speed and accuracy and have been more successful in protein structure prediction [23–26]. Since the pioneering work of Tanaka and Scheraga [27], various approaches have been proposed in order to obtain an accurate knowledge-based scoring function [28–45]. To address the reference state problem [16, 22], Zhou and Zhou introduced a volume correction in the derivation of their allatom pairwise statistical energy function (DFIRE) by using a distance-scaled, finite ideal gas approach [36]. Zhang and Zhang presented a pairwise distance-dependent, atomic statistical potential function 3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(RW), using an ideal random-walk and no amino acid-specific chain as reference state [39]. Including a side-chain orientation-dependent energy term into RW (RWplus) was found to further improve the decoy recognition ability of the statistical potential [39]. Recently, we have developed a statistical mechanics-based iterative method to derive a set of distance-dependent, all-atom knowledge-based potentials from experimental structures [46]. Our iterative method circumvents the long-standing reference state problem by optimizing the pair potentials iteratively through comparisons of the predicted and experimental pair distribution functions until the potentials can discriminate the native structures from decoys in the training set [46]. In addition to the pairwise potentials in traditional knowledge-based scoring functions, orientationdependent statistical potentials were also included to considered the directionality of some interactions like hydrogen bonding and dipolar effects. Yang and Zhou proposed a dipolar DFIRE (dDFIRE) energy function based on the orientation angles involved in dipole-dipole interactions, in which each polar atom was treated as a dipole [37]. The inclusion of dipolar interactions improved the performance of dDFIRE for ab initio folding of protein terminal regions with secondary structures over DFIRE. Lu et al. have developed an orientation-dependent statistical potentials based on side-chain packing (PSP), named as OPUS-PSP [38]. It was designed to bridge the gap between all-atom and residue-based potentials. Twenty residues were decomposed into 19 rigid-body blocks, which constitute the foundation of angular analysis. Zhou and Skolnick also proposed a generalized orientationdependent, all-atom statistical potential (GOAP) by associating each heavy (nonhydrogen) atom with a plane using its two nearest-neighboring heavy atoms [40]. Based on the similar idea, rotamer state indexes were introduced into GOAP to form the rotamer-dependent energy scoring function (ROTAS) [47]. In addition, to design realistically packed protein backbones, Chu and Liu introduced a tetrahedron-based protein backbone-based orientational statistical energy (tetraBASE) model [48]. The model has been applied to optimize the tertiary organizations of protein secondary structure elements of pre-designated types with Monte Carlo (MC) simulated annealing starting from artificial initial conformations [48]. 4

ACS Paragon Plus Environment

Page 4 of 38

Page 5 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Despite the significant advancement and considerable successes of knowledge-based scoring functions in protein structure prediction, current knowledge-based scoring functions often only consider the nonbonded (or noncovalent) potentials that describe long-range electrostatic and van der Waals (VDW) interactions, and ignore those bonded terms for interactions like covalent bonding, for the sake of speed and simplicity. Such scoring functions with nonbonded potentials are normally working well in selecting near-native conformations from decoy structures where the bonds and angles have been fully optimized and relaxed [49]. However, in many cases, the non-native decoys may include severely distorted bonds or angles, though their nonbonded atoms are well packed in terms of electrostatic and VDW interactions. In such cases, the knowledge-based scoring with nonbonded potentials only would not be able to discriminate the native structures from decoys [50]. The bonded energy terms are especially important when a scoring function is used for an ab initio prediction that involves both sampling and evaluation of protein conformations. Therefore, the bonded energy terms are necessary in both native structure selection and ab initio protein structure prediction. Meeting the need, we have developed a composite scoring function that includes both bonded and nonbonded interactions, named as ITCPS, in which the bonded terms were derived based on the knowledge-based approach and the nonbonded pair potentials were determined through a statistic mechanics-based iterative method. Orientation-dependent potentials and desolvation term were also included to consider the directionality of dipolar interactions and the hydrophobic effect of non-polar atoms. Extensive evaluations of ITCPS on 17 commonly used decoy sets of 750 proteins and the CASP9-13 set of 177 proteins showed the necessity of considering bonded, orientational and hydrophobic energies in the knowledge-based scoring function and the accuracy of our scoring function ITCPS.

2

MATERIALS AND METHODS

Our composite scoring function, ITCPS, consists of four types of interactions as Etotal = Ebonded + Enonbonded + Eorientation + Ehydrophobic 5

ACS Paragon Plus Environment

(1)

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 38

where Ebonded = Ebond + Eangle + Edihedral

(2)

are for the bonded energies associated with bonds (Ebond ), angles (Eangle ), and dihedral/torsional (Edihedral ) interactions, Enonbonded are for the contributions from nonbonded interactions whose potentials were directly taken from our previous study [46], Eorientation are for those orientation-dependent interactions between two noncovalent atoms, and Ehydrophic is for the hydrophobic effect of nonpolar atoms, respectively. The details of calculating different energies are described as follows.

2.1 Bonded energy Ebonded The bonded interactions Ebonded in our scoring function consist of three components: bond energy Ebond , angle energy Eangle , and dihedral energy Edihedral , whose potentials were derived from a large training set of experimental structures based on a knowledge-based approach. For simplicity, we only focused on the bonded energies involving three main chain atoms in this study, i.e. N, CA, and C, as the backbone of a protein directly determines the topology of its three dimensional structure. Thus, given the 20 standard amino acids, we have a total 3 × 20 = 60 atom types in our derived potentials for bonded energies.

2.1.1

Bond potential

The interaction potential for a bond of length b between atoms i and j was calculated as follows Vij (b) = −kB T ln

nobs ij (b) nobs ij (b)



(3)

where nobs ij (b) is the number of bonds between atoms i and j within a length bin from b − ∆b/2 to b + ∆b/2 observed in a training set of experimental structures, and hnobs ij (b)i is the arithmetic average ˚ over all nonzero nobs ij (b). The bin size ∆b for bond length is set to 0.01 A in this study. kB is the Boltzmann constant and T is the temperature of the system. Without loss of generality, kB T was set to unit one in this study. 6

ACS Paragon Plus Environment

Page 7 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

2.1.2

Angle potential

Similarly, the interaction potential for a bond angle of θ formed by three covalently linked atoms i, j, and k can be calculated as follows Vijk (θ) = −kB T ln

nobs ijk (θ) nobs ijk (θ)



(4)

where nobs ijk (θ) is the number of bond angles formed by three consecutive atoms i, j, and k within an angle bin from θ − ∆θ/2 to θ + ∆θ/2 observed in a training set of protein structures, and hnobs ijk (θ)i is ◦ the arithmetic average over all nonzero nobs ijk (θ). The bin size ∆θ for the bond angle is set to 1 in the

present study.

2.1.3

Dihedral potential

The interaction potentials for a residue with dihedral/torsional angles φ and ψ were directly taken from our previous study [51], which were derived using the following formula V (φ, ψ|L, C, R) = −kB T ln p(φ, ψ|L, C, R)

(5)

where p(φ, ψ|L, C, R) is the general probability distribution function of backbone torsional angles for a residue C with left residue L and right residue R. It can be calculated from experimental protein structures in the PDB as follows [51] p(φ, ψ|L, C, R) = P π

n(φ, ψ|L, C, R) · f (φ, ψ) π P n(φ, ψ|L, C, R)

(6)

φ=−π ψ=−π

where n(φ, ψ|L, C, R) is the number of the backbone torsional states for three consecutive amino acid residues, L, C, R, with the dihedral angle intervals from (φ − ∆φ/2, ψ − ∆ψ/2) to (φ + ∆φ/2, ψ + ∆ψ/2) in the native structures of the training set. The f (φ, ψ) is a normalization factor and has the following form f (φ, ψ) = 1

,

π P

N (φ, ψ|C) π P N (φ, ψ|C)

φ=−π ψ=−π

7

ACS Paragon Plus Environment

(7)

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 38

where N (φ, ψ|C) is the total number of backbone torsional states for a residue of type C within the (∆φ, ∆ψ) intervals at the (φ, ψ) dihedral angles. With the potentials Vij (b), Vijk (θ), and V (φ, ψ|L, C, R), the bonded energies for bonds, angles, and dihedral/torsional angles can be then calculated by a sum of the potentials over all possible bonds, bond angles, and dihedral angles on the backbone of a protein.

2.2 Nonbonded energy Enonbonded The nonbonded pair potentials taken from our previous study were used to calculate the nonbonded energy in this study [51]. Specifically, we obtained the nonbonded energy Enonbonded by summing the interaction potentials over all the pairs of non-covalent atoms in a protein as follows Enonbonded =

X

uij (r)

(8)

where uij (r) are a set of effective knowledge-based potentials between two non-covalent atoms i and j at distance r, which were derived through a statistical mechanics-based iterative method [46, 51]. The iteration starts from a set of initial potentials as follows u0ij (r)

ρobs ij (r) = −kB T ln bulk ρij

(9)

where ρobs ij (r) is the number density of atom pair ij from non-neighboring residues in a spherical shell of radius from r − ∆r/2 to r + ∆r/2 observed in the native structures and ρobs bulk is the average number density of atom pair ij in a reference sphere of radius Rmax , respectively. Here, the bin size ˚ and the radius of the sphere Rmax was set to 15 A. ˚ With the initial potentials ∆r was set to 0.2 A u0ij (r), a statistical mechanics-based iteration was conducted to improve the potentials step by step by comparing the predicted pair distribution function gij (r) and experimentally observed pair distribution function gijobs (r) using the following formula [46]   1 obs n k T g (r) − g (r) (r) = u (r) + un+1 B ij ij ij ij 2 8

ACS Paragon Plus Environment

(10)

Page 9 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

until the potentials unij (r) can discriminate the native structures from decoys in the training set, thus circumventing the calculation of reference state in traditional knowledge-based scoring functions. The details regarding the iteration algorithm and the calculation of pair distribution functions can be found in our previous study [46].

2.3 Orientation-dependent energy Eorientation To obtain the orientation between two non-covalent atoms A and B, each heavy atom is associated with a plane that is determined by the atom and its two neighboring bonded atoms (Figure 1) [40]. When an atom has more than two bonded heavy atoms, two of its bonded heavy atoms will be selected to create the associated plane. Then, for each atom like A here, one local coordinate system can be created by defining ~rz =

~r2 × ~r1 ~r1 ,~ry = ,~rx = ~rz × ~ry |~r1 | |~r2 × ~r1 |

(11)

as the unit vectors of z, y, and x axes, respectively, where ~r1 = ~r(A1 ) −~r(A) and ~r2 = ~r(A2 ) −~r(A) (Figure 1). With the defined coordinate system for atom A, the orientation of the vector ~rAB from atom A to atom B can be expressed by its polar angles (θa , φa ) in the A-coordinate system. Similarly, we can define a local coordinate system for atom B, and the orientation of the vector ~rBA from atom B to atom A can be expressed by its polar angles (θb , φb ) in the B-coordinate system. When an atom only has one bonded heavy atom, the local coordinate system for its nearest bonded atom will be used instead. The relative orientation of the two planes is measured by the dihedral/torsional angle η between the two z-axes for the A- and B-coordinate systems. With the defined angles, the orientation-dependent potential for two atoms A and B at distance rab can be calculated as uab (θa , φa , θb , φb , η|rab ) = −kB T ln

nobs ab (θa , φa , θb , φb , η|rab ) nobs (θ , φ , θ , φ , η|r ) a a b b ab ab

(12)

where nobs ab (angles|rab ) is the number of atom pair A-B with a set of orientational angles at distance rab in a training set of experimental structures, and hnobs ab (angles|rab )i is the arithmetic average over 9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 38

all nonzero nobs ab (angles|rab ). Similar to other studies [40], we also assumed that the dependence of the potential on different angles θa , φa , θb , φb , η are independent of each other at a given distance rab to overcome the problem of insufficient statistics [40]. Thus, the above equation can be approximated as uab (θa , φa , θb , φb , η|rab ) = uab (θa , θb |rab ) + uab (φa , φb |rab ) + uab (η|rab )

(13)

where uab (θa , θb |rab ), uab (φa , φb |rab ), and uab (η|rab ) can be calculated in a similar way as Eq. (12). During the derivation of orientational potentials, we binned the cos(θ, φ, η)-values with a size of 0.25, and used all the 167 heavy atoms of 20 amino acids. We only considered two interacting atoms in different residues with five or more residues apart to eliminate the secondary structure effects. A cut ˚ was used for the two atoms considered in angular calculation, as orientation off distance of 6.5 A relevance can be neglected at large distance.

2.4 Hydrophobic effect Ehydrophobic In addition to inter-atomic potentials, the desolvation term is also an important contribution to the free energy of a protein structure. Therefore, we also included the desolvation effect in our scoring function. Specifically, we considered the hydrophobic effect of nonpolar atom C, as it is known to be the driving force of protein folding [52]. The desolvation energy for the hydrophobic effect was calculated by a sum of residue-based singlet potentials as follows [53] Ehydrophobic =

X

σi · SAi

(14)

where SAi is the nonpolar solvent-accessible surface area (SA) for a residue i that was calculated using the NACCESS program [54]. The σi is the nonpolar desolvation parameter for the residue of type i and can be calculated using a knowledge-based approach as follows P SAi σi = −kB T ln ni · SA0i

(15)

where ni is the number of residues with type i, and SA0i is the standard nonpolar solvent-accessible surface area (SA) for a residue of type i [54]. 10

ACS Paragon Plus Environment

Page 11 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

2.5 Training data set Similar to our previous study, the derivation for the potentials of Ebonded , Eorientation , and Ehydrophobic in this work was based on a data set generated from the PISCES server [55], using the following ˚ R-factor ≤ 0.3, sequence length 40criteria: sequence percentage identity ≤ 30%, resolution ≤ 3.0A, 10000, non X-ray entries “excluded”, CA-only entries “excluded”, and cull PDB by chain. We also removed those PDB entries that overlap with our benchmark test sets from the PISCES generated database to avoid possible bias towards the derived statistical potentials. Finally, a set of 16563 diverse protein structures were obtained. All the structures were used to derive the knowledge-based potentials for Ebonded and Eorientation , of which 10822 structures with complete side chains were used to construct the potentials for Ehydrophobic .

2.6 Test data sets To evaluate the performance and generality of our integrated scoring function ITCPS, we have tested it on a variety of benchmark data sets which were taken from the published studies. These data sets include the 3DRobot decoy sets [56], AMBER benchmarking decoy set [60], HR decoy set [49], CASP5-8 decoy set [67], Decoy ‘R’ Us set [69], MOULDER [58], ROSETTA [57], and I-TASSER [59] data sets. In addition, we also prepared a realistic decoy set of 177 proteins based on the human and server predictions from recent CASP9-CASP13 experiments.

3 RESULTS AND DISCUSSION 3.1 Cross-validation In order to investigate the performance robustness of ITCPS and to avoid over-training, a 5-fold crossvalidation experiment has been performed to optimize the scoring function parameters. Namely, we trained our scoring function on 4/5 of the proteins from the full training set, and applied the trained 11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 38

scoring function on the remaining 1/5 of the data set (927 proteins). If the scoring function is robust, the success rates in native structure recognition should not differ significantly between the scoring functions based on the full and the reduced training set. As observed in our validation, this is indeed the case. Our 5-fold cross-validation gave an average success rate of 90.7% on the evaluation set, which is very close to 90.8% obtained by the full training set, suggesting the general applicability and robustness of ITCPS.

3.2 Native structure recognition from decoys 3.2.1

Three 3DRobot Decoy Sets

We first tested our scoring function ITCPS on the 3DRobot decoy sets constructed by the Zhang group. Free fragment assembly and simulations have been used to ensure the evenness distribution and structure diversity in the generated decoys [56]. The local structure features of the native structure were reinforced to eliminate the correlation between the root mean square deviation (RMSD) and the local structural characteristics so as to enhance the difficulty of the native structure recognition by trivial potentials in these test sets. In the evaluation, we have used three 3DRobot decoy sets, Rosetta set(3DR) with 100 decoys for each protein, Modeller set(3DR) with 200 decoys for each protein, and I-TASSER set(3DR) with 400 decoys for each protein that were generated by the 3DRobot program based on 58, 20, and 56 proteins from the original Rosetta [57], Modeller [58], and I-TASSER [59] decoy sets. Due to the improvement of hydrogen bond network in the decoy conformations, the 3DRobot decoy sets have better packed local structures, which renders the native configurations more challenging to discriminate. Thus, a variety of knowledge-based scoring functions showed a poor performance on the three 3DRobot decoy sets, giving an average success rate of merely < 3% in native structure recognition [56]. Table 1 shows the success rates of ITCPS in native structure recognition on the three 3DRobot decoy sets. For reference, the table also lists the corresponding results of ITDA and six other scoring 12

ACS Paragon Plus Environment

Page 13 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

functions on these three sets [51, 56]. It can be seen from the table and figure that ITCPS performed significantly better than the other seven scoring functions and identified 131 native structures of 134 proteins with a success rate of 97.8%, compared to 107 native structures of 134 proteins with a success rate of 79.9% for ITDA, while the other six scoring functions only recognized a few native structures of 134 proteins. Correspondingly, our new scoring function ITCPS gave the highest Z-score of 4.34, compared to 2.47 for ITDA, and 1.15 for DFIER-REF. As ITCPS and ITDA share the same set of torsional potentials and nonboned iterative knowledge-based pair potentials, the significantly better performance of ITCPS than ITDA suggested the importance of including the bonded and orientationdependent potentials as well as solvation effects in the scoring function.

3.2.2

AMBER benchmarking decoy set

The AMBER benchmarking decoys constructed by Wroblewska and Skolnick is another challenge set for scoring functions [60]. The set contains a total of 47 proteins with 1040 decoy conformations for each protein. All the structures, including natives or decoys, have been relaxed by a 2ns MD simulation using the AMBER/GBSA force field. This benchmark set was originally presented to check the effectiveness of the AMBER/GBSA force field in discriminating native protein structures from decoy configurations. Due to the 2ns simulation with AMBER, all the atoms in the native or decoy conformations form good contacts in terms of electrostatic and VDW interactions. Thus, this test set is challenging for a scoring function to discriminate the native structure from its decoys. Figure 2 shows the success rate of our scoring function ITCPS in discriminating the native structures from their corresponding decoys on the AMBER benchmarking decoy set. For comparison, the figure also lists the corresponding results of nine other scoring functions, ITDA [51], dDFIRE [37], ITScore/Pro [46], OPUS-PSP [38], MODELLER/DOPE [20], PMF, DFIRE 2.0 [61], AMBER/GBSA [60], and ITScore/PP [62]. It can be seen from the figure that ITCPS again performed the best and achieved a success rate of 87.2% in ranking native protein structures as No. 1, compared to 83.0% for ITDA, 57.5% for dDFIRE, 55.3% for ITScore/Pro, 42.6% for OPUS-PSP, 34.0% for 13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 38

MODELLER/DOPE, 31.9% for PMF, 29.8% for DFIRE 2.0, 20.0% for AMBER/GBSA, and 8.5% for ITScore/PP. Figure 3 shows the score versus RMSD relationships of ITCPS and seven other scoring functions for the protein 1bho1 on which ITCPS succeeded in discriminating the native structure from decoys while the other scoring functions failed. It can be seen from the figure that the performance of ITCPS benefits from two aspects. Namely, ITCPS is able to not only discriminate the native structure from decoys in terms of energy score, but also yield a higher score-RMSD correlation for the decoys (Figure 3). It is also noted that the scoring functions of ITCPS, ITDA, ITScore/Pro, and ITScore/PP all contain the nonbonded pair potentials derived using the same statistical mechanics-based iteration method. The worst performance of ITScorePP suggested that the pair potentials of ITScore/PP derived for protein-protein interactions may not necessarily be efficient for protein structure prediction. In addition, the relative performances of three scoring functions, ITCPS, ITDA, and ITScore/Pro also indicated that energy terms like bonded interactions, orientation-dependent potentials, and hydrophobic effects are important in estimating the free energy of a protein structure.

3.2.3

HR decoy set

The High-Resolution (HR) decoy set constructed by Rajgaria et al. contains 148 non-homologous proteins with 500-1600 HR decoy conformations for each protein target [63]. Majorities of the decoy ˚ Table 2 lists the test results of ITCPS, ITDA and 12 conformations have an RMSD less than 6-7 A. other scoring functions [46, 49, 51, 62, 64–66]. It can be found from the table that ITCPS obtained an excellent performance and gave a high success rate of 98% and a Z-score of 6.03, compared to 98.7% and 5.61 for ITDA, 98.7% and 4.48 for ITScore/Pro, 98.7% and 4.61 for ITScore/PP, and 96.0% and 4.29 for DFIRE 2.0. The excellent performance of ITCPS on this test set suggested the robustness of our new scoring function ITCPS.

14

ACS Paragon Plus Environment

Page 15 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

3.2.4

Decoy ‘R’ Us et al. 11 decoy sets

We further tested ITCPS on the eight decoy sets from Decoy ‘R’ Us at http://compbio.buffalo.edu/dd/ [69], which include 4state reduced, fisa, fisa casp3, lmds, lattice ssfit, hg structal, ig structal, and ig structal hires, respectively. The eight decoy sets were downloaded from the Decoy ‘R’ Us website. In addition, we have also used three other widely-used decoy sets, including the MOULDER set constructed by John and Sali [58] (http://salilab.org/decoys/), the ROSETTA set generated by Baker and coworkers [57] (http://depts.washington.edu/bakerpg/decoys/), and the I-TASSER ab initio decoy set prepared by the Zhang group [59] (http://zhanglab.ccmb.med.umich.edu/). The 11 decoy sets contain a total of 278 proteins ranging from 4 proteins for the ‘fisa’ set to 58 proteins for the ‘ROSETTA’ set. Table 3 gives the success rate of ITCPS and six other scoring functions in native structure identification on these 11 sets of 278 proteins. It can be seen from the table that overall ITCPS performed the best and recognized the natives for 245 of 278 proteins, compared to 244/278 for ITDA, 226/278 for GOAP, and 128/278 for DFIRE. These results again demonstrated the robustness of our composite scoring function ITCPS.

3.2.5

CASP5-8 decoy set

The CASP5-8 decoy set (http://www.fiserlab.org/potentials/casp decoys) consists of 2628 predicted models for 143 protein targets that were collected from the CASP5-CASP8 experiments [67]. At CASP, participants were invited to submit predicted protein structures for given sequences when their experimental structures were not yet known for the participants. Therefore, the CASP provides a realistic experiment to assess the performance of conformational sampling algorithms and scoring functions in protein structure prediction. Table 4 shows the performance of our scoring function ITCPS in native structure recognition on the test set. For comparison, the table also lists the corresponding results of 33 other scoring functions [51, 68]. It can be seen from the table that ITCPS also performed the best and identified the native structures for 140 and 143 of 143 proteins within top 1

15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 38

and 5 predictions, compared to 136/143 and 143/143 for ITDA, 97/143 and 138/143 for KBF, and 110/143 and 136/143 for RW. ITCPS also gave a good Z-score of 1.85, compared to 1.65 for ITDA, 1.92 for KBF, and 1.69 for RW.

3.2.6

CASP9-13 decoy set

In addition to using the public test sets, we have also constructed a realistic decoy set of 177 protein targets based on the human and server predictions from CASP9-CASP13 experiments. Specifically, the decoys of a protein target were selected from the submitted CASP predictions by using the following criteria [67]: (1) The experimental structure for the protein target is available; (2) The predictions for the target should include at least one model with a GDT TS score of 65.0 or higher compared to the experimental structure; (3) All the models for each target were clustered according to their lengths and the models from the largest cluster were used; (4) The models were binned by their GDT TS scores with an increment of 1.0 and the model with the highest GDT TS score was kept as the representative from each bin. This yielded a final decoy set of 6389 models for 177 proteins from CASP9-CASP13 experiments. Table 5 shows the success rate of ITCPS in discriminating native structures from decoys on the CASP9-13 decoy set of 177 proteins when the top prediction was considered. For comparison, the table also lists the results of four other scoring functions, ITDA, GOAP, DFIRE2.0, and OPUS-PSP on this test set. It can be seen from Table 5 that ITCPS again achieved the best performance among the tested scoring functions and recognized the native structures for 140 of 177 proteins, yielding a success rate of 79.1%, 76.3% for OPUS-PSP, compared to 70.1% for ITDA, 63.3% for GOAP, and 35.2% for DFIRE2.0. Comparing the results on this and other test sets also reveals that the CASP913 decoy set is more challenging and led to a lower success rate for all the tested scoring functions. This can be understood because the CASP9-13 decoy set consists of the submitted models from the most recent CASP experiments, which have been carefully built with the most advanced algorithms by various experienced groups around the world. 16

ACS Paragon Plus Environment

Page 17 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

3.3 Comparison between ITCPS and ITDA To further investigate the performance of our new scoring function ITCPS, we have conducted an extensive comparison between ITDA and ITCPS on all the 18 decoy sets including three 3DRobot sets, AMBER benchmarking, HR, CASP5-8, Decoy ‘R’ Us, and MOULDER, ROSETTA, I-TASSER, and CASP9-13 decoy set. The 18 test sets contain a total of 927 proteins with various types of decoys that were generated by various approaches from different groups. Table 6 lists the test results of ITDA and ITCPS on the 18 decoy sets in both the number of recognized native structures and average Z-scores. It can be seen form the table that overall ITCPS performed significantly better than ITDA in native structure recognition. Specifically, ITCPS performed better in both success rate and Z-score for 7 of the 18 test sets including ROSETTA, AMBER, CASP5-8, I-TASSER set(3DR), Rosetta set(3DR), Modeller set(3DR), and CASP9-13 decoy sets, compared to ITDA that performed better for only 3 of the 18 test sets including 4state reduced, hg structal, and I-TASSER decoy sets. Overall, ITCPS successfully identified the native structures for 842 of the total 927 proteins with a good Z-score of 3.36, compared to 796 of 927 proteins with a Z-score of 3.02 for the second-best ITDA. The significantly better performance of ITCPS than ITDA again confirmed the importance of bonded, orientational and hydrophobic interactions in a scoring function.

3.4 Near-native structure recognition among decoys As a basic requirement, a good scoring function should be able to discriminate the native structure from decoys. This is the most straightforward criterion and also the primary assessment to evaluate the performance of a scoring function, as done in the previous sections. However, from the perspective of realistic purpose, a useful scoring function should be able to recognize near-native structures out of decoys, as the native structure is unknown in realistic applications. Therefore, we further checked the capability of our scoring function ITCPS in distinguishing near-native structures among decoys. 17

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 38

For a reasonable assessment, the selected test set should possess several properties: First, the decoys in the test set should include some near-native structures; Second, the decoys should reflect some of the key features in the native structures like hydrogen-bonding and compactness so as to increase the difficulty of the decoy set; Third, the decoys should be structurally diverse with a wide range of RMSDs from the native structure. Among the present test sets, the 3DRobot decoys set, which meets the three requirements, was used to test the performance of ITCPS in near-native structure recognition among decoys. Table 7 gives the average rank and the success rates for top 1 and 5 predictions of ITCPS in distinguishing the best near-native structure (i.e. with the smallest RMSD) among decoys on the 3DRobot test set of 134 proteins. For comparison, the table also lists the corresponding results of six other scoring functions. It can be seen from Table 7 that ITCPS still performed the best in recognizing near-native structures and yielded the lowest average rank of 6.51 for the best near-native structures, compared to 7.25 for ITDA, 9.60 for GOAP, 9.75 for OPUS-PSP, 9.87 for RW, 10.03 for DFIRE2.0 and 10.16 for RWPlus. Correspondingly, ITCPS also obtained the highest success rates of 52.2% and 82.8% for top 1 and 5 predictions, followed by 47.0% and 73.9% for RWplus, 42.5% and 73.1% for RW, and 42.5% and 71.6% for OPUS-PSP. Interestingly, GOAP had the lowest success rates of 36.6% and 67.9% for top 1 and 5 predictions, respectively. These results demonstrate the strong ability of ITCPS in distinguishing near-native structures among decoys.

3.5 Roles of different potentials As ITCPS is a composite scoring function that integrates bonded and nonbonded potentials including our iterative knowledge-based pairwise potentials with several other energy components, an important question is to what extent different energy components play a role in the performance of ITCPS in discriminating native structures from decoys. To answer this question, we have evaluated the performances of five sub-scoring functions by combining the nonbonded pairwise potentials with

18

ACS Paragon Plus Environment

Page 19 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

other energy components as follows E0 = Enonbonded E1 = Enonbonded + Ehydrophobic E2 = Enonbonded + Ebond + Eangle

(16)

E3 = Enonbonded + Edihedral E4 = Enonbonded + Eorientation We calculated the success rates of the five sub-scoring functions by 5-fold cross-validation based on the data set of 927 proteins, whose results are shown in Figure 4. It can be seen from the figure that compared to the success rate of 82.8% for the original nonbonded pair potentials E0 , the scoring function E4 obtained the highest improvement and had a success rate of 87.7%, followed by 86.5% for E3 , 84.8% for E2 , and 83.9% for E1 . The findings highlight the importance of including orientational potentials in a scoring function. Interestingly, despite the critical importance of hydrophobic interactions in protein folding, inclusion of hydrophobic effects did not yield a significant improvement. One possible reason is that the hydrophobic effects could have been cancelled each other between the native structure and some decoys due to their similar surfaces in both size and composition.

4 CONCLUSION We have developed a composite knowledge-based scoring function for protein structure prediction, named as ITCPS, by integrating the bonded potentials for covalent bonds, bond angles, and dihedral angles, the nonbonded potentials for long-range interactions like electrostatic and VDW interactions, the orientation-dependent potentials for directional interactions like hydrogen bonding and dipolar effect, and the desolvation energy for hydrophobic interactions. Our scoring function ITCPS was extensively tested on 17 publicly available decoy sets consisting of 750 proteins and one combined decoy set of 177 proteins from CASP9-CASP13 experiments. It was shown that ITCPS achieved the best performance among the tested 52 scoring functions and performed well on all 18 test sets in 19

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 38

discriminating native structures from decoys. Considering all the decoy sets of 927 proteins, ITCPS obtained a high success rate of 90.8%, compared to 85.8% for the second-best scoring function ITDA. ITCPS is especially successful on the challenging sets like 3DRobot and CASP9-13 and obtained a significantly better performance than the other 51 scoring function on these two sets. In addition to native structure recognition, ITCPS also showed a good performance in distinguishing the best near-native structure among decoys and obtained a significantly higher success rate than six other scoring functions on the 3DRobot decoy set. The significantly better performance of ITCPS than other scoring functions suggested the importance of including bonded interactions, orientation-dependent potentials, and hydrophobic effects into scoring functions for protein structure prediction.

Competing interests The authors declare that they have no competing interests.

Author contributions S.-Y.H. initiated and supervised the project. X.W. and S.-Y.H. developed the model. X.W. performed the computations. All authors interpreted the results and wrote the manuscript.

Acknowledgements This work is supported by the National Natural Science Foundation of China (grant No. 31670724), the National Key R&D Program of China (grant Nos. 2016YFC1305800 and 2016YFC1305805), the Thousand Youth Talents Plan of China, and the startup grant of Huazhong University of Science and Technology.

20

ACS Paragon Plus Environment

Page 21 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

References 1. Simons, K. T.; Kooperberg, C.; Huang, E.; Baker, D. Assembly of Protein Tertiary Structures from Fragments with Similar Local Sequences Using Simulated Annealing and Bayesian Scoring Functions. J. Mol. Biol. 1997, 268, 209-225. 2. Zhang, Y.; Arakaki, A. K.; Skolnick, J. TASSER: An Automated Method for the Prediction of Protein Tertiary Structures in CASP6. Proteins 2005, 61(S7), 91-98. 3. Xia, Y.; Huang, E. S.; Levitt, M.; Samudrala, R. Ab Initio Construction of Protein Tertiary Structures Using a Hierarchical Approach. J. Mol. Biol. 2000, 300, 171-185. 4. Shmygelska, A.; Levitt, M. Generalized Ensemble Methods for De Novo Structure Prediction. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 1415-1420. 5. Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I.N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235-242. 6. Skolnick, J.; Fetrow, J. S.; Kolinski, A. Structural Genomics and Its Importance for Gene Function Analysis. Nat. Biotechnol. 2000, 18, 283-287. 7. Baker, D. Protein Structure Prediction and Structural Genomics. Science 2001, 294, 93-96. 8. Zhang, Y. Progress and Challenges in Protein Structure Prediction. Curr. Opin. Struct. Biol. 2008, 18, 342-348. 9. Anfinsen, C. B. Principles That Govern Folding of Protein Chains. Science 1973, 181, 223-230. 10. Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M. CHARMM: A Program for Macromolecular Energy Minimization and Dynamic Calculations. J. Comput. Chem. 1983, 4, 187-217. 11. Lazaridis, T.; Karplus, M. Discrimination of the Native from Misfolded Protein Models with an Energy Function Including Implicit Solvation. J . Mol. Biol. 1998, 288, 477-487. 12. Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T. Semi Analytical Treatment of Solvation for Molecular Mechanics and Dynamics. J. Am. Chem. Soc. 1990, 112, 6127-6129. 13. Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Development and Testing of a General Amber Force Field. J. Comput. Chem. 2004, 25, 1157-1174. 14. Case, D. A.; Cheatham, T. E., 3rd.; Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M., Jr. Onufriev A, Simmerling C, Wang B, Woods RJ. The Amber Biomolecular Simulation Programs. J. Comput. Chem. 2005, 26, 1668-1688.

21

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 38

15. Liwo, A.; Arłukowicz, P.; Czaplewski, C.; Ołdziej, S.; Pillardy, J.; Scheraga, H. A. A Method for Optimizing Potential-Energy Functions by a Hierarchical Design of the Potential-Energy Landscape: Application to The UNRES Force Field. Proc. Natl. Acad. Sci. U. S. A. 2002, 99, 1937-1942. 16. Thomas, P. D.; Dill, K. A. Statistical Potentials Extracted from Protein Structures: How Accurate Are They? J. Mol. Biol. 1996, 257, 457-469. 17. McQuarrie, D. A. in Statistical Mechanics; University Science Books, 2000. 18. Poole, A. M.; Ranganathan, R. Knowledge-Based Potentials in Protein Design. Curr. Opin. Struct. Biol. 2006, 16, 508-513. 19. Zhou, Y.; Zhou, H.; Zhang, C.; Liu, S. What Is a Desirable Statistical Energy Function for Proteins and How Can It Be Obtained? Cell Biochem. Biophys. 2006, 46, 165-174. 20. Shen, M.-Y.; Sali, A. Statistical Potential for Assessment and Prediction of Protein Structures. Protein Sci. 2006, 15, 2507-2524 21. Benkert, P.; Tosatto, S. C.; Schomburg, D. QMEAN: A Comprehensive Scoring Function for Model Quality Assessment. Proteins 2008, 71, 261-277. 22. Koppensteiner, W. A.; Sippl, M. J. Knowledge-Based Potentials – Back to the Roots. Biochemistry (Moscow) 1998, 63, 247-252. 23. Skolnick, J. In Quest of an Empirical Potential for Protein Structure Prediction. Curr. Opin. Struct. Biol. 2006, 16, 166-171. 24. Li, X.; Liang, J. Knowledge-Based Energy Functions for Computational Studies of Proteins. in Computational Methods for Protein Structure Prediction and Modeling, eds Xu Y, Xu D, Liang J, (Springer) 2006, 1, 71-124. 25. Zhu, J.; Fan, H.; Periole, X.; Honig, B.; Mark, A. E. Refining Homology Models by Combining ReplicaExchange Molecular Dynamics and Statistical Potentials. Proteins 2008, 72, 1171-1188. 26. Huang, S.-Y.; Zou, X. Mean-Force Scoring Functions for Protein-Ligand Binding. Annu. Rep. Comput. Chem. 2010, 6, 281-296. 27. Tanaka, S.; Scheraga, H. A. Medium- and Long-Range Interaction Parameters between Amino Acids for Predicting Three-Dimensional Structures of Proteins. Macromolecules 1976, 9, 945-950. 28. Jacobson, M.; Sali, A. Comparative Protein Structure Modeling and Its Applications to Drug Discovery. in Annual Reports in Medicinal Chemistry, ed Overington J (Inpharmatica Ltd., London), 2004, 39, 259-276. 29. Ginalski, K.; Grishin, N. V.; Godzik, A.; Rychlewski, L. Practical Lessons from Protein Structure Prediction. Nucleic Acids Res. 2005, 33, 1874-1891. 30. Zhou, H.; Skolnick, J. Protein Structure Prediction by Pro-Sp3-TASSER. Biophys J. 2009, 96, 2119-2127.

22

ACS Paragon Plus Environment

Page 23 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

31. Petrey, D.; Honig, B. Protein Structure Prediction: Inroads to Biology. Mol. Cell 2005, 20, 811-819. 32. Lazaridis, T.; Karplus, M. Effective Energy Functions for Protein Structure Prediction. Curr. Opin. Struct. Biol. 2000, 10, 139-145. 33. Buchete, N. V.; Straub, J. E.; Thirumalai, D. Development of Novel Statistical Potentials for Protein Fold Recognition. Curr. Opin. Struct. Biol. 2004, 14, 225-232. 34. Huang, S.-Y.; Zou, X. An Iterative Knowledge-Based Scoring Function to Predict Protein-Ligand Interactions: I. Derivation of Interaction Potentials. J. Comput. Chem. 2006, 27, 1865-1875. 35. Huang, S.-Y.; Zou, X. An Iterative Knowledge-Based Scoring Function to Predict Protein-Ligand Interactions: II. Validation of The Scoring Function. J. Comput. Chem. 2006, 27, 1876-1882. 36. Zhou, H.; Zhou, Y. Distance-Scaled, Finite Ideal-Gas Reference State Improves Structure-Derived Potentials of Mean Force for Structure Selection and Stability Prediction. Protein Sci. 2002, 11, 2714-2726. 37. Yang, Y.; Zhou Y. Specific Interactions for Ab Initio Folding of Protein Terminal Regions with Secondary Structures. Proteins 2008, 72, 793-803. 38. Lu, M.; Dousis, A. D.; Ma, J. OPUS-PSP: An Orientation-Dependent Statistical All-Atom Potential Derived from Side-Chain Packing. J. Mol. Biol. 2008, 376, 288-301. 39. Zhang, J.; Zhang, Y. A Novel Side-Chain Orientation Dependent Potential Derived from Random-Walk Reference State for Protein Fold Selection and Structure Prediction. PLoS ONE 2010, 5, e15386. 40. Zhou, H.; Skolnick, J. GOAP: A Generalized Orientation-Dependent, All-Atom Statistical Potential for Protein Structure Prediction. Biophys J. 2011, 101, 2043-2052. 41. Anishchenko, I.; Kundrotas, P.J.; Vakser, I. A. Contact Potential for Structure Prediction of Proteins and Protein Complexes from Potts Model. Biophys J. 2018, 115, 809-821. 42. Hoque, M. T.; Yang, Y.; Mishra, A.; Zhou, Y.; SDFIRE: Sequence-Specific Statistical Energy Function for Protein Structure Prediction by Decoy Selections. J. Comput. Chem. 2016, 37, 1119-24. 43. Xu, G.; Ma, T.; Zang, T.; Wang, Q.; Ma, J. OPUS-CSF: A C-Aatom-Based Scoring Function for Ranking Protein Structural Models. Protein Sci. 2018, 27, 286-292. 44. Xu, G.; Ma, T.; Zang, T.; Sun, W.; Wang, Q.; Ma, J. OPUS-DOSP: A Distance- and Orientation-Dependent All-Atom Potential Derived from Side-Chain Packing. J. Mol. Biol. 2017, 429, 3113-3120. 45. Wang, X; Huang, S.-Y. Optimizing the Atom Types of Proteins through Iterative Knowledge-based Potentials. Chin. Phys. B 2018, 27, 20503-020503. 46. Huang, S.-Y.; Zou, X. Statistical Mechanics-Based Method to Extract Atomic Distance-Dependent Potentials from Protein Structures. Proteins 2011, 79, 2648-2661.

23

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 38

47. Park, J.; Saitou, K. ROTAS: A Rotamer-Dependent, Atomic Statistical Potential for Assessment and Prediction of Protein Structures. BMC Bioinformatics 2014, 15, 307. 48. Chu, H.; Liu, H. TetraBASE: A Side Chain-Independent Statistical Energy for Designing Realistically Packed Protein Backbones. J. Chem. Inf. Model. 2018, 58, 430-442. 49. Rajgaria, R.; McAllister, S. R.; Floudas, C. A. Development of a Novel High Resolution Ca-Ca Distance Dependent Force Field Using a High Quality Decoy Set. Proteins 2006, 65, 726-741. 50. Handl, J.; Knowles, J.; Lovell, S. C. Artefacts and Biases Affecting the Evaluation of Scoring Functions on Decoy Sets for Protein Structure Prediction. Bioinformatics 2009, 25, 1271-1279. 51. Wang, X.; Zhang, D; Huang, S.-Y. New Knowledge-Based Scoring Function with Inclusion of Backbone Conformational Entropies from Protein Structures. J. Chem. Inf. Model. 2018, 58, 724-732. 52. Mirzaie, M. Hydrophobic Residues Can Identify Native Protein Structures. Proteins 2018, 86, 467-474. 53. Eisenberg, D.; McLachlan, A. D. Solvation Energy in Protein Folding and Binding. Nature 1986, 319, 199-203. 54. Hubbard, S. J.; Thornton, J. M. ‘NACCESS’, Computer Program, Department of Biochemistry and Molecular Biology, University College London 1993. 55. Wang, G.; Dunbrack, R. L., Jr. PISCES: A Protein Sequence Culling Server. Bioinformatics 2003, 19, 1589-1591. 56. Deng, H.; Jia, Y.; Zhang, Y. 3DRobot: Automated Generation of Diverse and Well-Packed Protein Structure Decoys. Bioinformatics 2016, 32, 378-387. 57. Qian, B.; Raman, S.; Das, R.; Bradley, P.; McCoy, A. J.; Read, R. J.; Baker, D. High-Resolution Structure Prediction and the Crystallographic Phase Problem. Nature 2007, 450, 259-264. 58. John, B.; Sali, A. Comparative Protein Structure Modeling by Iterative Alignment, Model Building and Model Assessment. Nucleic Acids Res. 2003, 31, 3982-3992. 59. Wu, S.; Skolnick, J.; Zhang, Y. Ab Initio Modeling of Small Proteins by Iterative TASSER Simulations. BMC Biol. 2007, 5, 17. 60. Wroblewska, L.; Skolnick, J. Can a Physics-Based, All-Atom Potential Find a Protein’s Native Structure among Misfolded Structures? I. Large Scale AMBER benchmarking. J. Comput. Chem. 2007, 28, 20592066. 61. Yang, Y.; Zhou, Y. Ab Initio Folding of Terminal Segments with Secondary Structures Reveals the Fine Difference Between Two Closely Related All-Atom Statistical Energy Functions. Protein Sci. 2008, 17, 1212-1219.

24

ACS Paragon Plus Environment

Page 25 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

62. Huang, S.-Y.; Zou, X. An Iterative Knowledge-Based Scoring Function for Protein-Protein Recognition. Proteins 2008, 72, 557-579. 63. Rajgaria, R.; McAllister, S. R.; Floudas, C. A. Distance Dependent Centroid to Centroid Force Fields Using High Resolution Decoys. Proteins 2008, 70, 950-970. 64. Tobi, D.; Elber, R. Distance-Dependent, Pair Potential for Protein Folding: Results from Linear Optimization. Proteins 2000, 41, 40-46. 65. Hinds, D. A.; Levitt, M. Exploring Conformational Space with a Simple Lattice Model for Protein Structure. J. Mol. Biol. 1994, 243, 668-682. 66. Loose, C.; Klepeis, J. L. Floudas CA. A New Pairwise Folding Potential Based on Improved Decoy Generation and Side-Chain Packing. Proteins 2004, 54, 303-314. 67. Rykunov, D.; Fiser, A. New Stattical Potential for Quality Assessment of Protein Models and a Survey of Energy Functions. BMC Bioinformatics 2010, 11, 128. 68. Sankar, K.; Jia, K.; Jernigan, R. L. Knowledge-Based Entropies Improve the Identification of Native Protein Structures. Proc. Natl. Acad. Sci. U. S. A. 2017, 114, 2928-2933. 69. Samudrala, R.; Levitt, M. Decoys ‘R’ Us: A Database of Incorrect Conformations to Improve Protein Structure Prediction. Protein Sci. 2000, 9, 1399-1401.

25

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 38

Figure Captions: Figure 1: Definition of the orientation angles θa , φa , θb , φb , and η between two non-covalent interacting heavy atoms A and B that are associated with two planes A1-A-A2 and B1-B-B2, where θa and φa are the polar angles of ~rAB in the local A-xyz coordinate system, θb and φb are the polar angles of ~rBA in the local B-x′ y ′ z ′ coordinate system, and η is the torsional angle of A1-A-B-B1. Figure 2: Success rates of ITCPS and nine other scoring functions, ITDA, dDFIRE, ITScore/Pro, OPUS-PSP, MODELLER/DOPE, PMF, DFIRE 2.0, AMBER/GBSA, and ITScore/PP in discriminating native structures from decoys on the AMBER benchmarking set constructed by the Skolnick group [60]. The results for the scoring functions other than ITCPS and OPUS-PSP were taken from the literature [51]. Figure 3: The score-RMSD scatter plots for ITCPS and seven other scoring functions on the decoys of protein 1bho1 from the AMBER benchmarking set. Figure 4: Success rates of the five sub-scoring functions calculated by 5-fold cross-validation based on the data set of 927 proteins, where E0 = Enonbonded , E1 = E0 +Ehydrophobic , E2 = E0 +Ebond +Eangle , E3 = E0 + Edihedral , and E4 = E0 + Eorientation , respectively.

26

ACS Paragon Plus Environment

Page 27 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1: Performances of ITCPS and seven other knowledge-based scoring functions in recognizing native structures and average Z-score on three 3DRobot decoy sets of 134 proteins. Function

Decoy sets

ITCPS

Rosetta set(3DR) Modeller set(3DR) I-TASSER set(3DR) Average Rosetta set(3DR) Modeller set(3DR) I-TASSER set(3DR) Average Rosetta set(3DR) Modeller set(3DR) I-TASSER set(3DR) Average Rosetta set(3DR) Modeller set(3DR) I-TASSER set(3DR) Average Rosetta set(3DR) Modeller set(3DR) I-TASSER set(3DR) Average Rosetta set(3DR) Modeller set(3DR) I-TASSER set(3DR) Average Rosetta set(3DR) Modeller set(3DR) I-TASSER set(3DR) Average Rosetta set(3DR) Modeller set(3DR) I-TASSER set(3DR) Average

ITDAb

RAPDF-REFc

KBP-REFc

DFIRE-REFc

Dope-REFc

SRS-REFc

RW-REFc

No of firstsa

Avg Z-score

58/58 20/20 53/56 131/134 53/58 15/20 39/56 107/134 0/41 1/20 0/56 1/117 1/41 1/20 3/56 5/117 0/41 3/20 0/56 3/117 0/41 2/20 0/56 2/117 0/41 1/20 0/56 1/117 0/41 2/20 0/56 2/117

4.36 4.04 4.41 4.34 2.61 2.28 2.40 2.47 0.94 1.41 1.67 1.34 1.12 1.15 1.21 1.16 1.06 1.13 1.25 1.15 1.04 1.47 1.82 1.44 0.96 1.40 1.67 1.34 1.07 1.16 1.26 1.16

a Number

of proteins with their native structures ranked as first versus the total number of tested proteins. taken from literature [51]. c Results taken from literature [56]. b Results

27

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 38

Table 2: Summary of the test results of ITCPS and 13 other scoring functions on the high resolution (HR) decoy set prepared by Rajgaria et al. [49]. The results for the 13 scoring functions other than ITCPS were taken from previous studies [51, 63]. Scoring function Avg ranka ITCPS ITDA[51] ITScore/Pro[46] ITScore/PP[62] DFIRE 2.0[61] dDFIRE[37] DOPE[20] 6bin-HRSC[63] 7bin-HRSC[63] PMFg HR[49] TE13[64] HL[65] LKF[66]

1.22 1.01 1.05 1.09 8.59 9.26 18.18 2.49 2.01 48.47 1.87 19.94 44.93 39.45

No of firstsb 145/148 (98.0%) 146/148 (98.7%) 146/148 (98.7%) 146/148 (98.7%) 142/148 (96.0%) 140/148 (94.6%) 134/148 (90.5%) 128/148 (86.5%) 125/148 (84.5%) 112/148 (75.7%) 113/150 (75.3%) 92/148 (62.2%) 70/150 (46.7%) 17/150 (11.3%)

Avg Z-scorec Avg CCd Avg Cα-rmsde Avg Cα-rmsdf 6.03 5.61 4.48 4.61 4.29 6.02 4.76 3.62 3.39 3.30 2.11 3.15 2.34 1.55

0.77 0.77 0.82 0.76 0.81 0.72 0.72 0.70 0.70 0.40 0.80 0.63 0.59 0.52

a The

0.058 0.037 0.028 0.041 0.095 0.117 0.201 0.298 0.321 0.60 0.451 0.813 1.092 1.721

1.68 1.72 1.64 2.02 1.65 1.64 1.68 1.82 1.83 1.86 1.76 1.89 1.84 1.93

average rank of the native conformations. The best rank is 1. number of the proteins with native structures as rank #1 in terms of the calculated energy scores. c The average Z-score for all the tested proteins, measuring the relative energetic separation of the native structure of a protein with respect to its decoys. d The average Pearson correlation coefficients (CC) between the energy scores and the RMSDs of decoys. e The average RMSD of the best predicted structures with the lowest energy scores (native structures included). f The average RMSD of the best predicted structures with the lowest energy scores with native structures excluded. g PMF is a knowledge-based scoring function derived with an atom-randomized reference state. b The

28

ACS Paragon Plus Environment

Page 29 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 3: Performances of ITCPS and six other scoring functions in native structure recognition and their average Z-scores of natives on the Decoy ‘R’ Us et al. 11 test sets, where the results for the scoring function other than ITCPS were taken from the literature [40, 51]. Decoy sets

DFIRE

RWplus

dDFIRE

OPUS-PSP GOAP

ITDA

ITCPS

No. of targets

4state reduced

6(3.48)

6(3.51)

7(4.15)

7(4.49)

7(4.38)

7(6.11)

6(4.19)

7

fisa

3(4.87)

3(4.79)

3(3.80)

3(4.24)

3(3.97)

3(4.00)

3(4.40)

4

fisa casp3

4(4.80)

4(5.17)

4(4.83)

5(6.33)

5(5.27)

5(5.08)

5(5.60)

5

lmds

7(0.88)

7(1.03)

6(2.44)

8(5.63)

7(4.07)

7(5.17)

8(4.64)

10

lattice ssfit

8(9.44)

8(8.85)

8(10.12)

8(6.75)

8(8.38)

8(5.88)

8(6.87)

8

hg structal

12(1.97)

12(1.74)

16(1.33)

18(1.87)

22(2.73)

21(1.86)

20(1.80)

29

ig structal

0(0.92)

0(1.11)

26(1.02)

20(0.69)

47(1.62)

48(1.66)

48(2.03)

61

ig structal hires

0(0.17)

0(0.32)

16(2.05)

14(0.77)

18(2.35)

18(2.06)

18(2.29)

20

MOULDER

19(2.97)

19(2.84)

18(2.74)

19(4.84)

19(3.58)

19(3.12)

19(3.48)

20

ROSETTA

20(1.82)

20(1.47)

12(0.83)

39(3.00)

45(3.70)

52(2.86)

57(3.36)

58

I-TASSER

49(4.02)

56(5.77)

48(5.03)

55(7.43)

45(5.36)

56(6.63)

53(3.99)

56

No. total (Z-score) 128(1.94) 135(2.13) 164(2.52) 196(2.86)

226(3.57) 244(3.52) 245(3.19)

278

The number of the proteins with native structures as rank 1 are listed outside the parentheses. The average Z-scores of the native configurations are listed in parentheses and more positive is better. The numbers in bold indicate the best performance.

29

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 38

Table 4: Performance of ITCPS and 33 other scoring functions on the decoy sets of 143 protein targets from CASP5-CASP8 under three criteria: number of natives within rank 5, number of natives with rank 1, average Z-score of native structures. Functiona ITCPS ITDA KBF RW RWplus 4BOPT dDFIRE SKOb fourBody SKOa SKJG 4BOPT GNM MJ3 HLPL Qp BFKV MS VD TEs Qm Qa genFour Tel shortRange BT MJ3h TD GKS MJPL MJ2h TS MSBM RO MJ1 a The

Rank ≤ 5

Rank = 1

Avg Z-score

143/143 143/143 138/143 136/143 135/143 131/143 128/143 125/143 124/143 122/143 120/143 119/143 114/143 113/143 112/143 112/143 112/143 111/143 111/143 111/143 111/143 110/143 108/143 108/143 106/143 106/143 105/143 103/143 100/143 96/143 95/143 61/143 57/143 14/143

140/143 136/143 97/143 110/143 106/143 72/143 100/143 48/143 53/143 49/143 55/143 46/143 46/143 50/143 57/143 57/143 42/143 53/143 46/143 42/143 37/143 53/143 46/143 39/143 58/143 51/143 57/143 38/143 45/143 45/143 36/143 17/143 7/143 0/143

1.85 1.65 1.92 1.69 1.69 1.46 1.41 1.45 1.30 1.27 1.37 1.06 1.36 1.28 1.32 1.52 1.34 1.37 1.39 1.21 1.16 1.25 1.36 1.05 1.49 1.34 1.30 1.18 1.02 1.05 1.00 0.11 0.36 0.76

results for the scoring functions other than ITCPS were taken from the literature [51, 68].

30

ACS Paragon Plus Environment

Page 31 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 5: Performances of ITCPS and four other scoring functions on the decoy sets of 177 proteins from CASP9 to CASP13 in discriminating native structures from decoys. Function ITCPS OPUS-PSP ITDA GOAP DFIRE2.0b a The

a No.

of firsts

140/177 135/177 124/177 112/177 62/176

Success rate (%)

Avg Z-score

79.1 76.3 70.1 63.3 35.2

1.61 1.73 1.36 1.60 1.06

number of the proteins with native structures as rank 1 versus the total number of proteins.

b DFIRE2.0

failed on one protein due to too many atoms in the protein.

31

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 38

Table 6: Performances of ITCPS and ITDA on 18 decoy sets in terms of the number of recognizing native structures and average Z-scores, where the numbers in bold indicate a better performance. Decoy sets

ITDA

ITCPS

No. of targets

4state reduced fisa fisa casp3 lmds lattice ssfit hg structal ig structal ig structal hires MOULDER ROSETTA I-TASSER HR AMBER CASP5-8 (3DR)I-TASSER set (3DR)Rosetta set (3DR)Modeller set CASP9-13

7(6.11) 3(4.00) 5(5.08) 7(5.17) 8(5.88) 21(1.86) 48(1.66) 18(2.06) 19(3.12) 52(2.86) 56(6.63) 146(5.61) 39(3.99) 136(1.65) 39(2.40) 53(2.61) 15(2.28) 124(1.36)

6(4.19) 3(4.40) 5(5.60) 8(4.64) 8(6.87) 20(1.80) 48(2.03) 18(2.29) 19(3.48) 57(3.36) 53(3.99) 145(6.03) 41(4.41) 140(1.85) 53(4.42) 58(4.36) 20(4.04) 140(1.61)

7 4 5 10 8 29 61 20 20 58 56 148 47 143 56 58 20 177

Total targets

796(3.02)

842(3.36)

927

32

ACS Paragon Plus Environment

Page 33 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 7: The average rank and the success rates within the top 1 and 5 predictions by ITCPS and six other scoring functions in distinguishing the best near-native structure among decoys on the 3DRobot test set of 134 proteins. Scoring function ITCPS ITDA GOAP OPUS-PSP RW DFIRE RWplus

Average rank

Success rate (%) Top 1 Top 5

6.51 7.25 9.60 9.75 9.87 10.03 10.16

52.24 41.79 36.57 42.54 42.54 41.04 47.01

33

ACS Paragon Plus Environment

82.84 73.13 67.91 71.64 73.13 72.39 73.88

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 38

Figure 1

B2

x'

φb

z η A1

B1

θa

y

z' φa

B

θb

rab

A

y'

x

A2

34

ACS Paragon Plus Environment

Page 35 of 38

Figure 2

100 87.2 83.0

80

Success rate(%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

57.5 55.3

60

42.6

40

34.0 31.9 29.8 20.0

20

8.5

0

PS ITC

IT

DA

R E / Pr o PSP O PE FI re D D d US co P S O IT

35

F P A .0 PM RE2 GBS re/P o I / r Sc DF be IT Am

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

Figure 3

-8.0 (b) ITScorePro

-98

-8.5

-9.0

Score (x1000)

-96

-4.0 (d) ITScorePP

-2.0 (c) DOPE

Score (x10000)

-94

Score (x1000)

Score (x1000)

-92 (a) PMF

-2.1 -2.2 -2.3

-100 1

2

3

-9.5

4

0

RMSD (Å)

2

3

4

0

RMSD (Å) -3.4

(e) OPUS-PSP

(f) dDFIRE

Score (x100)

-2.8 -3.0 -3.2

2

3

1

2

3

RMSD (Å)

4

-4.6 0

-3.8 -4.0 -4.2

1

2

3

4

RMSD (Å) -1.6 (h) ITCPS

-3.1 -3.2 -3.3

-1.7 -1.8 -1.9 -2.0

-4.4 0

-4.4

4

-3.0 (g) DFIRE2.0

-3.6

-2.6

1

-4.2

RMSD (Å)

Score (x100)

-2.4

1

-2.4

Score (x10000)

0

Score (x1000)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 38

0

1

2

3

-3.4

4

RMSD (Å)

0

1

2

3

RMSD (Å)

36

ACS Paragon Plus Environment

4

0

1

2

3

RMSD (Å)

4

Page 37 of 38

Figure 4

90

Success rate (%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

87.7

88 86.5

86 84.8 83.9

84 82.8

82 80

E0

E1

E2

37

ACS Paragon Plus Environment

E3

E4

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TOC figure 582x288mm (72 x 72 DPI)

ACS Paragon Plus Environment

Page 38 of 38