HotLig: A Molecular Surface-Directed Approach to Scoring Protein

Jul 17, 2013 - In addition to molecular surface distance, ligand-contacting areas and hydrogen-bond angles were also introduced to the scoring functio...
0 downloads 0 Views 5MB Size
Article pubs.acs.org/jcim

HotLig: A Molecular Surface-Directed Approach to Scoring Protein− Ligand Interactions Sheng-Hung Wang,†,§ Ying-Ta Wu,‡ Sheng-Chu Kuo,∥ and John Yu*,†,‡,§ †

Institute of Cellular and Organismic Biology, ‡Genomics Research Center, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 115, Taiwan § Center of Stem Cell and Translational Cancer Research, Chang Gung Memorial Hospital at Linkou, Taoyuan 333, Taiwan ∥ Graduate Institute of Pharmaceutical Chemistry, China Medical University, 91 Hsueh-Shih Road, Taichung 404, Taiwan S Supporting Information *

ABSTRACT: Accurate prediction of ligand-binding poses is crucial for understanding molecular interactions and is very important for drug discovery, structural design, and optimization. In this study, we developed a novel scoring program, HotLig, which applies the Connolly surface of a protein to calculate hydrophobic interaction and paired pharmacophore interactions with ligands. In addition to molecular surface distance, ligand-contacting areas and hydrogen-bond angles were also introduced to the scoring functions in HotLig. Four individual energy scoring functions for H-bonds, ionic pairs, metal coordination, and hydrophobic effects were derived from 600 protein−ligand complexes, and then, their weighting factors were optimized through an interactioncharacterized training set. Success rates of ligand-binding-pose predictions (with a root mean squared deviation of ≤2 Å) for the Wang, GOLD, and Cheng data sets were respectively validated to be 91.0%, 87.0%, and 85.6%. HotLig was found to possess equally good predictive powers for the hydrophilic (88.6%) and hydrophobic subsets (87.5%), and the success rate for the mixed subset was as high as 96.9%. The Spearman correlation coefficients were as good as 0.609 to 0.668, which indicates HotLig also has satisfactory predictive power for binding affinities. These results suggested that the HotLig can analyze diverse ligands, including peptides, and is expected to be a powerful tool for drug design and discovery.



INTRODUCTION As numbers of experimentally determined biological macromolecules are rapidly increasing in the Protein Data Bank (http://www.rcsb.org), molecular docking has become a canonical strategy in drug discovery.1−4 Molecular docking possesses the ability to predict interactions and binding poses of a ligand within the binding pocket of its target macromolecule. Therefore, molecular docking is usually applied to in silico screening of small-molecule databases for new lead discoveries. One report has shown that the hit rate of in silico virtual screening was improved about 1700-fold compared to random high-throughput screening in their search for inhibitors of protein tyrosine phosphatase.5 In addition, molecular docking offers a rational strategy which is a time- and costsaving method for structure-based lead optimization. Moreover, molecular docking is also applicable to prediction of putative drug targets.6,7 Integrated molecular docking includes two major tasks: sampling and scoring. The first, sampling, aims to search for conformations and orientations of docking ligands against target macromolecules, while scoring will assess molecular interactions of the assembled molecular complexes. Molecular interactions of modeled complexes are analyzed by scoring functions, which play key roles in determining the predictive power by molecular docking. Generally, the principles of © XXXX American Chemical Society

algorithms of known scoring functions can be grouped into three classes: force-field-based, empirical-based, and knowledge-based scoring functions.11−13 Force-field-based methods refer to scoring functions that apply energy functions employing classical molecular mechanics. Programs such as Dock,14,15 GOLD,16 AutoDock,17 D-Score,18 and G-Score18 are force-field-based scoring functions. Empirical-based scoring functions are composed of various weighted structural parameters, and the functions are trained using protein−ligand complexes whose experimental binding constants are known. Example programs are X-Score,19 ChemScore,20 PLP,21 LigScore,22 LUDI,23 F-Score,18 and Glide.24 Knowledge-based scoring functions are derived from the occurrence frequency of protein−ligand atom pair interactions in an experimentally resolved structural database.25−27 The distance-dependent potential functions are converted from probability distributions of distances between paired atoms. Representative programs are DrugScore,26,27 PMF,28,29 DFIRE,30 SMoG,31 BLEEP,32 ITScore,33,34 and ASP.35 The predictive power of the scoring function is still one issue of great concern in molecular docking studies. To improve the success rates of prediction, many of the scoring functions described above, such as X-Score,19 DrugReceived: May 21, 2013

A

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

Score,26,27 ITScore,33,34 etc., are used for rescoring after docking. Accurate prediction of ligand binding pose is important to understand how a ligand binds to a protein and affects its function. In addition, modeling the molecular interactions involved in the binding pocket is crucial for structural optimization of the lead. Improvement of the power of binding pose prediction should also be helpful for reducing false positives when molecular docking is used for virtual screening of ligand databases.8−10 The strength of the binding force between two atoms should be reflected in their relative distance which can be measured from experimentally determined protein−ligand complexes. Hence, the statistical potentials derived from knowledge-based methods might be very suitable to be used in prediction of ligand-binding poses. However, there are still challenges with current knowledge-based scoring functions. For example, it is difficult to determine an accurate reference state for all types of pair potentials.13,36 In fact, it is also unreasonable to apply a universal reference-state function for all paired atom types, since this will result in improper weights for each function of paired atom types. Moreover, the probability of finding protein−ligand paired atoms within a radius is not homogeneous in all directions, because it is influenced by solvent molecules, ligand size, shape, location, etc.36 These factors thus affect the quality of potential functions and also the reference states. Furthermore, some important factors, such as angles in hydrogen bonds (H-bonds), and the molecular surface area and distance in hydrophobic effects, are usually deficient in such knowledge-based scoring functions. These factors are interaction-type dependent, therefore, developing individual scoring functions for each type of molecular interaction is essential. On the other hand, it has been observed that the success rates for predicting binding poses of hydrophobic ligands are relatively poor in many conventional scoring functions.11 Through calculating the area of solvent-accessible surface (SAS) occupied by a binding ligand, the number of water molecules excluded from the ligand pocket could be estimated for scoring hydrophobic effects.26 Therefore, a solvent-accessible surface is commonly used to assess hydrophobic effects.26 However, the functions of a solvent-accessible surface are usually not distance-dependent functions, which should be greatly helpful for prediction of ligand-binding poses. In this study, we combined both knowledge-based and empirical-based methods to develop a novel scoring function, HotLig, for the prediction of protein−ligand interactions. Generally, the knowledge-based methods were used to derive potential functions for molecular interactions, and subsequently, the weighting factors for these knowledge-derived descriptors were determined by an empirical strategy. Instead of using pair potentials based on atom types, we simplified the types of potentials to four classes according to general molecular interactions, including H-bonds, ionic pairs, metal coordination, and hydrophobic effects. We applied the Connolly surface37 to calculate molecular surface distances and ligand-contacted surface areas and, thus, derived a molecular surface-distance-dependent scoring function for hydrophobic effects. The H-bond angles were also compiled in the H-bond potential function. In addition, in order to overcome the problem that the knowledge-based potentials are affected by ligand size, location, etc.,36 the pharmacophore pairs located in the region of nonhomogeneous distribution were ignored in the calculation of occurrence frequency. The

reference state of each type of potential thus can be identified at the region of noninteractive distance where the curve of occurrence frequency presents a leveling-off state. Moreover, to provide an optimal balance among these four types of potential functions, weighting factors were optimized through a special interaction-characterized training set. Finally, the performance of HotLig in predicting ligand binding poses and ligand binding affinities was validated.



METHODS Collection of a Protein−Ligand Statistical Set. In order to derive statistical potential functions, a statistical set which contained 600 protein−ligand complexes (Supporting Information Table S1) was collected for the statistical analysis of molecular interactions. First, the ligand database released by the Protein Data Bank (PDB) was downloaded from the Ligand Expo Web site (http://ligand-expo.rcsb.org/). The logP values of the ligands were calculated using XLogP.38 The ligands in database were further grouped into 30 subsets according to their logP values and molecular weights (MW), i.e. six groups of logP subsets (4), and each logP subset contains 5 groups of MW subsets (500). Then, we used the computer to randomly pick up ligands from each subset. Therefore, we can get a set of diverse ligands which covers a broad range of molecular weights and logP. Additionally, we manually searched 200 protein−ligand complexes (Supporting Information Table S1) from the AffinDB (http://pc1664. pharmazie.uni-marburg.de/affinity/index.php) in order to enrich the ligands with pKD ranging from 1.48 to 11.52. Furthermore, the Wang data set11 (100 complexes) and GOLD data set16 (100 complexes), which have been well examined in other studies, were also included. Overall, we collected the 600 complexes (Supporting Information Table S1) as our statistical set. As shown in Supporting Information Figure S1, several molecular characteristics were used to demonstrate ligand diversity, such as chemical elements, molecular weights, logP, numbers of rotatable bonds, and numbers of H-bond donors and acceptors. Chemical elements, including H, C, N, O, F, Cl, Br, I, P, and S, were involved in the ligands in the data set. The molecular weights of the ligands mainly ranged from 100 to 1000 Da. The calculated values of logP38 were from −6 to 9, which would cover both hydrophilic and hydrophobic ligands in the statistical data set. The flexibility of a molecule is estimated from the number of rotatable single bonds it has. While counting the rotatable bond numbers, single bonds which do not change the molecular conformation were neglected. For example, single bonds connecting to a hydrogen atom or to an end group of single heavy atom (such as −CH3, −NH3, or −OH) were not counted. Numbers of rotatable bonds of ligands in the statistical set were found to range from 0 (rigid) to 20. Furthermore, numbers of H-bond donors covered from 0 to 12, and the numbers of acceptors were from 0 to 20. Finally, the structural diversity of ligands contained in the statistical set was calculated using a centroid diversity sorting algorithm39 as briefly discussed below. N

diversity(A) =

N

∑I = 1 ∑ J = 1 dissimilarity(I , J ) N (N − 1)

(1)

The diversity coefficient of a data set A with size N is calculated from the average value of dissimilarity coefficient of all pairs of compounds I and J. The dissimilarity coefficient is obtained B

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

Figure 1. Detection of binding pockets and water-contactable atoms by PscanMS to construct protein surfaces. (A) Grid box which contained the region of a protein to be scanned was set up to detect binding cavities. (B) Scanning box used to select the atoms to be scanned by the probe in a single scanning. (C) Size of the scanning box, which was set to fit the diameter of the probe (2r Å). The probe moves along the scanning direction in a step length of dS Å. At positions A and B, the probe contacted with one or more protein atom(s), but the positions located between A and B did not contact any protein atoms. As long as the length between A and B is ≥2dS Å, the protein atoms contacted by the probe at positions A and B are defined as water-contactable atoms. Since water-contactable atoms were identified, the protein surface could thus be calculated using these superficial atoms and their neighboring atoms according to their atomic radii. (D) Resulting Connolly surface, which possesses three types of surface area, the convex (red), the concave (cyan), and the saddle (yellow) area.

from the equation, 1 − similarity(I, J), in which the similarity coefficient is calculated through comparing structural fragments using a function of centroid vector.39 This diversity coefficient ranges from 0 to 1, with the larger values representing the component structures more dissimilar with each other within the set. The diversity coefficient of ligands in the statistical set is 0.897. The annotations of gene ontology for these protein complexes were also searched using UniProtKB (http://www. uniprot.org/). As shown in Supporting Information Figure S2− S4, the proteins in the statistical set represented a broad range of biological processes, cellular components, and molecular functions. The quality of X-ray crystal structures can be assessed by those measures, such as resolution, R-factor, Rfree, etc.40 These parameters reported with the 600 structures are shown in Supporting Information Table S2. The average resolution of the 600 complexes is 2.04 ± 0.42 (mean ± SD; 95% confidence interval (CI) = 2.01−2.08). The average Rfactor is 0.185 ± 0.029 (mean ± SD; 95% CI = 0.182−0.187; 588 structures), and the average Rfree is 0.236 ± 0.038 (mean ± SD; 95% CI = 0.232−0.241; 281 structures). Generally, the

coordinate error is positively correlated with the resolution, and it was suggested to use the R-factor < 0.4 (Rfree < 0.45) as a selection criteria.40 Therefore, these values indicate that the quality of the 600 structures in the statistical set are good enough for the development of scoring functions. Collection of an Interaction-Characterized Training Set. In order to optimize the weighting factors for each type of interaction, a training set of 200 protein−ligand complexes (Supporting Information Table S3) was manually selected from the statistical set based on the types of protein−ligand interactions. Structural complexes in the statistical set overlapping with those in the test sets (see below) were not selected for this training set. As listed in Supporting Information Table S3, this training set consisted of three subsets: simple set (94 complexes), ionic set (52 complexes), and metal set (54 complexes). In the simple set, neither charged ionic interactions nor metal coordination were involved in interactions between the protein and ligand in the complex. Structural complexes which possessed ionic interactions between a protein and ligand, but without mediating metal coordination, were classified as the ionic set. The metal set contained complexes C

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

in which the protein bound its ligand through metal coordination. Molecular characteristics of these ligands in the training set are shown in Supporting Information Figure S5, and the structural diversity was also calculated using the centroid diversity sorting algorithm39 as described above. The ligand diversity coefficient of the training set was calculated to be 0.905. The average resolution of the 200 structures is 1.90 ± 0.42 (mean ± SD; 95% CI = 1.84−1.96). The average R-factor is 0.192 ± 0.033 (mean ± SD; 95% CI = 0.188−0.197; 199 structures), and the average Rfree is 0.233 ± 0.041 (mean ± SD; 95% CI = 0.227−0.238; 187 structures). Calculation of Protein Surface. The program, PscanMS, was developed to calculate protein surfaces, such as solventaccessible surfaces and Connolly surfaces. To calculate the protein surface, the water-contactable atoms and cavities of a protein were detected using a probe moving in a grid box as described below. As shown in Figure 1A, a grid box defined the overall region to be calculated, whereas a scanning box defined the local region to be scanned in a single scanning. To discover the features of any buried cavity or groove in a protein structure, the probe scanned through all three dimensions in the grid box. In a single scanning, only the atoms within the scanning box were used to calculate the distances with the probe for the sake of saving time (Figure 1B). As shown in Figure 1C, the width of the scanning box was set to be equal to the diameter of the probe (2r Å, default radius r of probe is 3 Å). Radii of protein atoms were ignored in the scanning process, and the definition of “contact” here means the distance between the probe and protein atom was less than or equal to r Å. When the scanning began, the probe moved along the scanning direction in the scanning box with a step length of dS Å (default value is 0.5 Å) as the trajectory shown in Figure 1C. As shown in the zoom-in view (Figure 1C), the coordinates of A and B are the positions where the probe contacted at least one atom of protein, but the positions between A and B did not contact any protein atoms. As long as the length between A and B is ≥2dS Å, the protein atoms contacted by the probe at the A and B positions are identified as “water-contactable atoms”. As a result, the minimum distance between protein atom and point A (or B) ranges from r − dS to r Å. The minimum length of a cavity that can be detected between any two atoms is 2r + dS Å. Therefore, all of the water-contactable atoms of the protein and their neighbor atoms were able to be identified after complete scanning in all three dimensions. These atoms were then used to calculate protein surfaces based on their van der Waals radii41 or metal ionic radii.42 The Connolly surface defines three types of component molecular surface, the convex, the concave, and the saddle areas.37 The spherical surface (the convex and concave surfaces) was constructed from a regular octahedron. At the first recursion, the edges of the spherical triangles on the regular octahedron were bisected and, thus, generated 4-fold number of spherical triangles. Therefore, the levels of recursion were able to determine the density of surface dots. In order to allocate a spherical triangle area to one surface dot, only the centroid points of spherical triangles were used to represent the Connolly surface. The unit normal vectors located on the centroid points can also be determined. The saddle surface was constructed using the set of parametric equations as that for a torus:

x = cos(ω0)[r0 + r1 cos(ω1)] y = sin(ω0)[r0 + r1 cos(ω1)] z = r1 sin(ω1)

(2)

where r0 is the distance from the center of the torus to the center of the tube and r1 is the radius of the tube. The angles of ω0 and ω1 make the circles of the torus and the tube, respectively. The increment of the angles was able to control the density of saddle surface dots. Similarly, only the centroid points on the surface elements were used to represent the Connolly surface, and their areas and unit normal vectors can also be calculated. A representative dotted Connolly surface of three atoms was shown in Figure 1D. The convex area represents the van der Waals surface of the atoms (red surface in Figure 1D). The concave and the saddle area (cyan and yellow surfaces in Figure 1D, respectively), where the interatom space is unable to be accessed by an entire solvent sphere, were represented by the van der Waals surface of the solvent sphere. Therefore, the Connolly surface of a binding pocket possesses good complementarity to the configuration of ligand molecules. The protein surface was calculated using simulated water of 1.4 Å in radius by default. The Connolly surface dots located at the concave area or saddle area were assigned to the nearest protein atoms. The resulting Connolly surface was output as a dot surface in QCPE-compatible MS format,37 which recorded an area value and a unit normal vector with each surface dot. These normal vectors and their associated area values on the Connolly surface were then applied in HotLig to evaluate molecular interactions (see below). Identification of Pharmacophore Attributes. A pharmacophore is a description of attributes for chemical functional groups essential for specific molecular interactions between a ligand and its target. Molecular interactions of H-bonds, ionpairs, metal-coordination, and hydrophobic effects were specifically identified by the HotLig program. To calculate binding scores according to molecular interactions, each atom (except hydrogen atoms) in both ligand and protein was assigned appropriate pharmacophore attributes according to its connection with other atoms as discussed below. H-bond interactions refer to a paired H-bond donor and acceptor. Electronegative atoms (such as N, O, and S) which covalently bond with at least one hydrogen, were identified as H-bond donors. A heteroatom which possesses a lone pair of electrons was assigned as an H-bond acceptor. Ionic interactions refer to positively and negatively charged atom pairs, rather than metal coordination. Metal coordination refers to protein−ligand interactions that are mediated by a coordinate covalent bond with a metal cation, such as Mg, Ca, Mn, Fe, Co, Ni, Cu, Zn ions, etc. Similar to the PMF scoring function,29 the contributions from these metals have been taken into consideration, except for alkali metals, such as Na and K ions, which were treated as ionic interaction in our study. A hydrophobic atom means a non-H-bonding and noncharged atom, such as noncharged carbon atom, halogen atom, etc. The molecular surface of H-donor (only where donor-H−R angle < 90°), H-acceptor (only where electron-acceptor−R angle > 90°) or neutralized ionic pair of atoms in a protein was also treated as hydrophobic area. But the hydrophobic area that is contacted with a charged atom was neglected in HotLig. A single atom could be given more than one type of pharmacophore attribute. For example, an oxygen atom on a D

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

conventional methods,25−27 can be represented by the following equation:

carboxylic acid group was assigned as both an H-acceptor and a negatively charged atom. A magnesium ion was assigned as a positively charged atom and also a metal cation for coordination. Definitions of Parameters for Analysis of Molecular Interactions. Definitions of parameters used to evaluate molecular interactions are illustrated in Figure 2. To assess

ΔWi , j(d) = −ln

g i , j (d ) gref (d)

(3)

The potential function, ΔWi,j(d), between atoms of type i and j at a atomic distance d can be derived from the probability distribution function, gi,j(d), of paired i and j, and the reference distribution function, gref(d), which is the mean of all or defined pair distribution functions.25−29 Atom types could be defined according to the chemical elements, hybridization of atoms, or any other chemical characteristics. In our approach, we introduced the molecular surface distance, r, to the pharmacophore-paired potential function, ΔWi,j(r): ΔWi , j(r ) = Wi , j(r ) − Wi , j(rref ) = −ln

Figure 2. Parametric analysis of protein−ligand interactions in HotLig. HotLig is a molecular surface-directed scoring function which applies the Connolly surface to analyze molecular interactions. To assess the energy potentials of polar interactions, the atomic surface distance (i.e., between atoms N and O), donor-H−acceptor angle (θ) and electronacceptor−H angle (ϕ) were measured from pharmacophore-pair atoms. To estimate hydrophobic interactions, normal vectors (only representative vectors are shown) associated with dots of Connolly surface of a protein were applied to calculate the molecular surface distance and ligand-contacted area. Normal vectors were vectors perpendicular to the protein surface, pointing toward the ligand side.

g i , j (r ) gi , j(rref )

(4)

and gi , j(r ) = Ni , j(r )/4πr 2

(5)

As shown in eq 4, the potential function, ΔWi,j(r), for a specific interaction between pharmacophore types i and j at the molecular surface distance, r, was compiled from the normalized radial pharmacophore-paired distribution function, gi,j(r), and the reference state, gi,j(rref). The distance, rref, for the reference state was obtained from the normalized occurrence distribution, gi,j(r), in the region of a noninteractive distance where the curve of the occurrence frequency presents a leveling-off state. The number Ni,j(r) represents counts of specific paired pharmacophores found in the radius of r to r + dr (dr = 0.1 Å). The counts increase with the radius in a space of a homogeneous distribution of atoms. But the occurrence frequency of paired pharmacophores varies with angles (see Results and Discussion). For H-bond interaction, only the pharmacophore pair whose angle θ is larger than 90° and angle ϕ is smaller than 130° were counted in Ni,j(r). For ionic interaction and metal coordination, only the pharmacophore pair whose angle ϕ is smaller than 130° were counted. Since dr is equal to 0.1 Å, which is much smaller than the diameter of an atom, the increasing counts of the paired atoms are thus significantly correlated with the spherical area. Therefore, Ni,j(r) should be normalized according to the sphere area using eq 5 to obtain a comparable occurrence-frequency function, gi,j(r), for the specific paired pharmacophores i and j. It should be pointed out that the repulsive functions (ΔW > 0) were derived from relative rare cases and might be affected dramatically by bad quality of the structures. In the hydrophobic-effect potential function, the molecular surface distance, r, is obtained from the adapted length of normal vectors on the protein’s Connolly surface where there is van der Waals contact with the ligand (r < 6 Å). Since the distribution of lengths of normal vectors is independent of the radial distribution, the potential ΔWvdW(r) for hydrophobic effects can be directly derived from the counts of the normal vector, Nvector(r), and the reference state, Nvector(rref), using the following equation:

the energy potentials of polar interactions (H-bond, ion-pairs, and metal coordination), the parameter of the atomic surface distance (δ; not labeled in Figure 2) was introduced in distance-dependent functions. The atomic surface was calculated from the atomic center distance minus the sum of van der Waals radii of the paired atoms. In H-bond paired pharmacophores, the coordinate of hydrogen atom which is connected to a rotatable donor group was determined according to the minimum distance available between the hydrogen and acceptor atoms. Additionally, the donor-H− acceptor angle (θ) and electron-acceptor−H angle (ϕ) were also calculated. The lone pair of electrons was modeled by a pseudoelectron vector to calculate the angle ϕ. On the other hand, as described above, the Connolly surface was composed of the following descriptors: surface-dot coordinates, areas, and unit normal vectors. Normal vectors were vectors perpendicular to the protein surface, pointing toward the ligand side. Herein, the molecular surface distance was estimated by the adapted length of normal vectors (λ) which fit between the surface of the protein and ligand atoms. Hundreds of thousands of normal vectors on a Connolly surface of a protein were involved in calculating the molecular surface contact. Additionally, the ligand-contacted area (A) on the Connolly surface can be integrated according to normal vectors which were touched by a ligand (λ < 6 Å). As a result, the molecular surface distance and ligand-contacted area could be measured and applied to estimating the binding-energy contribution of the hydrophobic effect. Derivation of Molecular Surface-Distance-Dependent Potentials. The principle of knowledge-based strategy, which was used for the derivation of atom-pair potentials by the

ΔWvdW(r ) = −ln E

Nvector(r ) Nvector(rref )

(6)

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

Figure 3. Scheme of development of the HotLig scoring function. HotLig is a knowledge-based scoring function for predicting protein−ligand interactions. The molecular surface-distance-dependent potential of H-bonds, ion-pairs, metal coordination, and hydrophobic effects were derived from 600 protein−ligand complexes according to the occurrence frequency of each kind of molecular interaction. Furthermore, the Connolly surface of binding pockets was calculated by PscanMS and was then applied to estimate hydrophobic effects through calculation of molecular surface distances and ligand-contacted areas. Subsequently, weighting factors of each type of interaction were determined by the training set. This training set was classified into three subsets and arranged in the series: simple set (94 complexes), ionic set (52 complexes), and metal set (54 complexes). Finally, the sum of all types of potential functions was used to give scores to protein−ligand complexes.

of 1−3. The parameter of “maximum iterations” was set to 100−200. Then, the sample conformers were analyzed and scored by the HotLig program. The topmost ranked ligand pose was used to validate the success rate of the prediction. Test Sets Used for Validation. Three well-known protein−ligand structural data sets, the GOLD,16 Wang,11 and Cheng44 data sets which contain 100, 100, and 195 structural complexes, separately, were used to validate the predictive power of HotLig with respect to the prediction of ligand-binding poses and ligand-binding affinities. The Cheng data set is an updated version of the Wang data set for comparative assessment of the scoring functions.44 The overlapping structures among these three data sets are shown in Supporting Information Figure S6 and Table S4. Decoy conformers of ligands for the GOLD and Cheng data sets were generated by the DOCK software through flexible docking as described above. On the other hand, the Wang data set, which was used in a comparative study of various scoring functions, had provided well-sampled conformers of ligands (http://sw16. im.med.umich.edu/software/xtool/). Success of the prediction

Finally, the minimum of the molecular surface−surfacedependent function, ΔW, for each type of interaction was normalized to −1 for applications. The radii of atoms were defined in the HotLig program. van der Waals radii for nonmetallic elements were taken from Bondi’s compilation.41 The ionic radii of metal atoms were applied using Shannon’s crystal data.42 Molecular Docking. The 2006 version of DOCK v5.1.114,15 was used as the flexible docking engine to produce the sample conformations and orientations of ligands against their binding pockets in protein structures. Kollman partial charges18 were applied to atoms of protein models. The Gasteiger partial charges of ligands were calculated using Open Babel.43 The cognate ligands and all solvent molecules were removed from the protein structures before docking. To evaluate potentials with respect to metal coordination, the metal ions were retained in the protein structures. Parameters for the DOCK program were generally set to generate 2000− 5000 orientations and 50−200 conformers per cycle of conformation search in a binding pocket with “anchor size” F

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

Figure 4. Derivation of potential functions for H-bond interaction. (A) Density map for the distribution of donor-H−acceptor angles (θ) vs atomic surface distances (δ) of H-bonding pairs. The density shows average dots per 0.02 Å per degree. (B) Density map for the distribution of electronacceptor−H angles (ϕ) vs atomic surface distances (δ). The density shows average dots per 0.02 Å per degree. (C) Statistical potential for θ. The N90−100° was used as reference state, and then the maximum value was normalized to 1. The resulting curve was able to be fitted by the function of cos(180 − θ). (D) Plot shows the molecular surface-distance-dependent potential for H-bonds (Whbond) which was derived using eq 4. (E) Plot of H-bond energy scores using eq 8 which simulates the distribution of H-bonding pairs in part A. (F) Plot of the potential functions for H-bonds which is corresponding to Figure 4B.

the statistical potential functions, we prepared a statistical set with 600 protein−ligand complexes as described above. Instead of analyzing pair potentials according to atom types, potential functions in HotLig were derived from the probability distribution of paired pharmacophores based on the types of molecular interactions. Four types of potential functions with respect to H-bonds (ΔEhbond), ionic pairs (ΔEion), metal coordination (ΔEmetal), and hydrophobic effects (ΔEvdW) were derived from the statistical set. The identification of

of a ligand-binding pose was qualified by the root-mean-square deviation (RMSD) of the retrieved ligand pose with reference to the corresponding crystal structure. Spearman’s rank order correlation coefficient (Rs) was used to validate the ranking power with respect to the ligand binding affinity using similar methods as reported in previous studies.11 Scheme of the HotLig Algorithm. Development of the HotLig algorithm is illustrated in Figure 3. In order to derive G

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

was plotted as a density map in Figure 4A. On the other hand, the density map in Figure 4B shows the distribution of the angle electron-acceptor−H (ϕ) vs the atomic surface distance (δ). Figures 4A and B show a remarkable distribution of Hbond pairs when δ is in the range of −0.5−0 Å. This reflects that the H-bond length was shorter than the summation of the van der Waals radii of H-bond donor and acceptor atoms. In addition, the H-bonds were mainly distributed in the region where θ is larger than 90°, in particular in 140−180°, when δ is smaller than 0 Å (Figure 4A). On the other hand, the distribution of H-bonds with the angle ϕ presented a normal distribution centered at about 40−60° when δ is smaller than 0 Å (Figure 4B). The distribution of H-bonds with the angle ϕ ranging from 90° to 140° seems to be affected by hindrance effects when δ is smaller than 1 Å, because the limitation of angle ϕ is obviously corrected with distance δ (Figure 4B). A statistical potential for the angle θ is shown in Figure 4C. The curve of the counts of θ was calculated using the data whose δ is smaller than 0 Å, then the maximum value was normalized to 1. The resulting curve was able to be fitted by the function of cos(180 − θ) as shown in Figure 4C. On the other hand, according to Figure 4B, the angle ϕ should be smaller than about 110° for H-bond interactions (δ < 0 Å); and it should be smaller than about 130° when δ < 1 Å. Figure 4D shows the molecular surface-distance-dependent potential (ΔWhbond) for H-bond pairs which was derived from the statistical distribution with δ using eq 4 as described in Methods. The average occurrence frequency of the H-bond pair at the atomic surface distance of 0.7−1.5 Å presents a leveling-off distribution and was used as the reference state for the H-bond potential (Figure 4D). The potential curve shows that H-bond paired pharmacophores began attracting each other at an atomic surface distance of about 0.7 Å. Interestingly, a surface distance ranging 0.1−0.7 Å seemed to be a minor peak, whereas the major peak was located at about −0.6−0.1 Å. The two-peak distribution possibly indicates that there were two types of interactions within paired H-bond pharmacophores. One was certainly the H-bond interaction distributed in the major peak, and the other should be the polar−polar interaction which comprised the minor peak. The major peak of the H-bond potential presented a sharp curve at −0.2−0.1 Å, and achieved an optimal binding state at −0.4 to −0.2 Å in atomic surface distance. Another smaller population distributed at about 1.5 to 2 Å was noted. This population should have been caused by Hbonds that were mediated by a water molecule. This watermediated H-bond at the far distance was ignored in calculating the potentials, because whether or not to retain water molecules was decided before molecular docking. Water molecules can be retained in the protein structure when calculating the Connolly surface, and thus, the water-mediated H-bonds can be calculated. The occurrence frequency at δ > 2 Å that did not exhibit a leveling-off distribution should have been due to the size limitation of the ligands. Compared to the potential function of H-bonding pairs in PMF,29 there is unfavorable interaction near about 4 Å (corresponding to about 1 Å of atomic surface distance in Figure 4D), where is within the leveling-off reference state of our function. This difference might be attributed to the use of different reference state. In our method, we ignored the H-bonding pair whose θ is < 90° and ϕ is > 130°, because the distributions of which are remarkably nonhomogenous compared to the other parts within the same distance (Figures 4A and B), including the noninteractive distance (1−4 Å). This nonhomogenous distribution might be

pharmacophore attributes and the definitions of statistical parameters applied in HotLig were described above (Figure 2). Such an arrangement enabled HotLig to take into account of some important parameters which are deficient in conventional paired-atom potentials, for example, angles in H-bonds, molecular surface distance, and area in hydrophobic effects. In order to eliminate the interferences of different atom radii, we introduced the parameter of molecular surface distance for the derivation of distance-dependent potential. For example, atomic surface distances were counted as the molecular surface distance in the assessment of H-bond, ion pair, and metal coordination. On the other hand, we applied the normal vector on the Connolly protein surface37 to measure molecular surface distance and area for hydrophobic contact with the ligand. Assuming that the binding energy (ΔE) of a protein−ligand complex is the sum of energy contributed by H-bonds (ΔEhbond), ionic pairs (ΔEion), metal coordination (ΔEmetal), and hydrophobic effects (ΔEvdW), the equation is shown below ΔE = FhbondΔE hbond + FionΔE ion + FmetalΔEmetal + FvdW ΔEvdW

(7)

In order to compile together the four energy potential functions with appropriate weighting factors, 200 complexes were manually collected from the statistical set by selecting the interactions involved between proteins and ligands as described above. Thus, this interaction-characterized training set of 200 protein−ligand complexes was employed to determine the weighting factors for each potential function. They were further classified into three subsets: simple set (94 complexes), ionic set (52 complexes), and metal set (54 complexes). The order of using of these three training sets (simple, ionic, and metal sets) is important for the determination of weighting factors (Figure 3). The Fhbond was set to 1 first. Since the ligands in simple set do not possess any ionic interactions or metal coordination, the contribution of ΔEion and ΔEmetal to ligand-binding energy were null. Therefore, the weighting factor, FvdW, could be optimized when prediction gave the maximal success rate for simple set. Similarly, the ionic set did not contain any metal coordination for ligand binding, so the ΔEmetal in binding energy was null. Since the Fhbond and FvdW are known, the optimal value of factor Fion could be subsequently determined through the evaluation of the ionic set. Finally, the factor Fmetal was then available from the evaluation using the metal set after all the other weighting factors were obtained. Thus, the sum of all individual functions with optimized weighting factors was the final equation for the assessment of molecular interactions (eq 7).



RESULTS AND DISCUSSION Molecular Surface-Distance-Dependent Potential Functions. To give a quantitative score for a protein−ligand complex, the Connolly surface of protein was first calculated by PscanMS as described above and was then applied to analyze molecular interactions. The pharmacophore attributes of atoms were identified as described in Methods. On the basis of the analysis of molecular interactions using the parameters defined in Figure 2 (see Methods), the occurrence frequencies of the parameters in each type of protein−ligand interaction were calculated from the statistical set. The statistical results of the distribution of parameters and the derivation of H-bond potential are shown in Figure 4. The distribution of the angle of the donor-H−acceptor (θ) vs the atomic surface distance (δ) H

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

mainly caused by steric hindrance effect from ligands and is influenced by solvent molecules and ligand properties (see Methods).36 To calculate the energy score of H-bonds (ΔEhbond), the distance, δ, and angle, θ, on an H-bond pair (h) were compiled in the following equation: ΔE hbond =

∑ (cos(180 − θh)ΔWhbond(δh)); h

θ = 90−180

(8)

The effects of angle θ were ignored for calculating the repulsive score (ΔW > 0). As the plot of eq 8 in Figure 4E shows, this Hbond energy-scoring function was able to simulate the distribution of H-bonds as presented in Figure 4A, the distribution of which is experimentally given by crystallized protein−ligand complex structures. On the other hand, the Hbond energy-scoring function plotted with the angle ϕ was shown in Figure 4F. A limitation of angle ϕlimit was set to be smaller than 130° using the equation, ϕlimit = 20δh + 110, while the δh is shorter than 1 Å (Figure 4F). The derived molecular surface-distance-dependent potential function, ΔWion, for ionic pairs is shown in Figure 5. Similar to

Figure 6. Derived potential function for metal coordination. The distance-dependent potential for metal coordination (Wmetal) was derived using eq 4. The major peak showed a wide range of −1.1 to 0.1 Å in atomic surface distances.

from the distribution, NvdW, using eq 6. The average occurrence frequency at 5.5−6.0 Å was counted as the reference state for the hydrophobic-effect potential. The optimal molecular surface distance was about 1 Å. The area (Av) which is associated with the normal vector (v) on the Connolly surface was then compiled with ΔWvdW in the energy-score function of the hydrophobic effect. Therefore, the equation for scoring hydrophobic interactions (ΔEvdW) is ΔEvdW =

∑ (A νΔWvdW(λν)) ν

(9)

Since the minimum of ΔWvdW was normalized to −1, the lowest value of ΔEvdW approximated the negative sum of Av and was dependent on the molecular surface distance but not the dot density of Connolly surface (see Methods). An example case presented in Figures 7C and D shows the ability of ΔEvdW to discriminate a hydrophobic ligand. The complex structure of retinol, an active form of vitamin A, with serum retinol-binding protein was crystallographically determined by X-ray diffraction (PDB entry: 1RBP).45 As shown in Figure 7C, retinol is a hydrophobic diterpene compound trapped by a hydrophobic pocket of a retinol-binding protein. The Connolly surface (dot surface in Figure 7C) calculated by PscanMS showed that the retinol molecule binds in this pocket with some distances between the molecular surfaces of the retinol and the pocket. This is reasonable because the optimal molecular-surface distance determined by the length of normal vector is about 1 Å as described above. The only hydroxyl group (colored in red), which was oriented toward the opening of pocket and exposed to the solvent side, did not interact with the retinolbinding protein through H-bonding (Figure 7C). Therefore, the molecular shape complementarity and hydrophobic effect were the major factors for molecular recognition of retinol. To test the ability of this molecular surface-directed scoring function (ΔEvdW) to discriminate hydrophobic interactions, rigid docking was performed to sample 5000 near-native poses of retinol within its binding pocket using the structure of 1RBP. Therefore, the binding energy of a ligand pose was expected to be correlated with its RMSD compared to the native pose because these near-native poses (RMSD < 1.2 Å) possess the same conformational energy. Then, HotLig was applied to retrieve the native pose of retinol from all sampled poses through calculating their binding energy scores. Figure 7D

Figure 5. Derived potential function for ion-pair interaction. The distance-dependent potential for ion-pair interaction (Wion) was derived using eq 4. The optimal atomic surface distance for ion-pair interaction was located at −0.4 to −0.2 Å.

the curve of the potential function of H-bonds, the optimal atomic surface distance was also located at −0.4 to −0.2 Å in the major peak. Compared to the minor peak of the H-bond potential function, ΔWhbond, the minor peak in ΔWion was more obviously distributed at 0.1−1.1 Å (Figure 5). On the other hand, the major peak of ΔWmetal shown in Figure 6 presents a wide range of −1.1 to 0.1 Å in atomic surface distances. This indicates the binding force of metal coordination was stronger than that of H-bonds or ionic interactions. Similarly, a minor peak was also evident near 0.5 Å of atomic surface distance (Figure 6). To assess hydrophobic effects that occur at the interface of paired hydrophobic pharmacophores, normal vectors on the protein Connolly surface were applied to estimate the molecular surface distance and ligand-contacted area. Figure 7A shows the distribution of the adapted lengths (λ) of normal vectors (λ < 6 Å) that fit the interfaces of the protein and ligand (i.e., molecular surface distance). As shown in Figure 7B, the potential of hydrophobic interactions (ΔWvdW) was derived I

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

Figure 7. Molecular surface-distance-dependent potential for hydrophobic effect. The hydrophobic effect was analyzed using normal vectors on the protein Connolly surface where the ligand has van der Waals contact with its binding pocket. The length (λ) of a normal vector that fits in the interface of a protein and ligand was counted as the molecular surface distance. (A) A plot shows the occurrence-frequency distribution of molecular surface distances. (B) Molecular surface-distance-dependent potential for hydrophobic effects was derived from part A using eq 6. (C) Surface model of the retinol molecule bound in the binding pocket (PDB entry: 1RBP). (D) Native pose of retinol in structure 1RBP. This was the no. 1 ranked pose retrieved by HotLig from 5000 sample poses. A limit of discrimination was found at about 0.15 Å; and the binding energy scores were linearly correlated with the root mean squared deviations (RMSDs) when the RMSD was >0.15 Å.

shows that the native pose of retinol was the topmost ranked pose according to the calculated score. Furthermore, a limit of discrimination was found for HotLig as about 0.15 Å, the pose energy score of which was approximately that of the native pose (with a difference of ΔE ≤ 0.1). When the RMSD was >0.15 Å, binding energy scores were linearly correlated with the RMSDs. These 5000 conformers were also scored by the DSX46 and as the plot of DSX scores versus RMSDs with these 5000 conformers is shown in Supporting Information Figure S7, the limit of discrimination of DSX is 0.51 Å. These results of comparisons (0.15 versus 0.51 Å) indicated that the hydrophobic potentials in HotLig might possess better discriminability than the DSX for scoring hydrophobic ligands. Calculation of the Total Binding Energy Score for Molecular Interaction Predictions. To compile the final scoring function with the four potential functions (eq 7), weighting factors for each kind of interaction were optimized as described in Methods and Figure 3. Generally, Fhbond was set to 1, and then, the others, FvdW, Fion, and Fmetal, were optimized one by one through iterative molecular docking using the three interaction-characterized training sets (in the order of simple, ionic, and metal set) as illustrated in Methods and Figure 3. Classification of the training set and molecular flexible docking

was as described in Methods. The RMSD was used to judge whether the retrieved ligand conformer was a near-native-pose conformer. A value of the RMSD of 5 is commonly considered as “active” when screening chemical compounds. In Figure 9, the average pKD of the ligands whose HotLig scores are above −20, is 5.07 ± 1.98 (mean ± SD; 95% CI = 4.68−5.47). In contrast, the average pKD of the ligands whose HotLig scores are below −20, is 7.67 ± 2.01 (mean ± SD; 95% CI = 7.27−8.06), which is significantly higher than five and can be considered as “active”. M

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

(5) Doman, T. N.; McGovern, S. L.; Witherbee, B. J.; Kasten, T. P.; Kurumbail, R.; Stallings, W. C.; Connolly, D. T.; Shoichet, B. K. Molecular Docking and High-Throughput Screening for Novel Inhibitors of Protein Tyrosine Phosphatase-1B. J. Med. Chem. 2002, 45, 2213−2221. (6) Gupta, S. K.; Dhawan, A.; Shanker, R. In Silico Approaches: Prediction of Biological Targets for Fullerene Derivatives. J. Biomed. Nanotechnol. 2011, 7, 91−92. (7) Gautam, B.; Singh, G.; Wadhwa, G.; Farmer, R.; Singh, S.; Singh, A. K.; Jain, P. A.; Yadav, P. K. Metabolic Pathway Analysis and Molecular Docking Analysis for Identification of Putative Drug Targets in Toxoplasma Gondii: Novel Approach. Bioinformation 2012, 8, 134− 141. (8) Krovat, E. M.; Steindl, T.; Langer, T. Recent Advances in Docking and Scoring. Curr. Comput.-Aided Drug Des. 2005, 1, 93−102. (9) Graves, A. P.; Brenk, R.; Shoichet, B. K. Decoys for Docking. J. Med. Chem. 2005, 48, 3714−3728. (10) Perola, E. Minimizing False Positives in Kinase Virtual Screens. Proteins 2006, 64, 422−435. (11) Wang, R.; Lu, Y.; Wang, S. Comparative Evaluation of 11 Scoring Functions for Molecular Docking. J. Med. Chem. 2003, 46, 2287−2303. (12) Cole, J. C.; Murray, C. W.; Nissink, J. W.; Taylor, R. D.; Taylor, R. Comparing Protein-Ligand Docking Programs Is Difficult. Proteins 2005, 60, 325−332. (13) Huang, S. Y.; Grinter, S. Z.; Zou, X. Scoring Functions and Their Evaluation Methods for Protein-Ligand Docking: Recent Advances and Future Directions. Phys. Chem. Chem. Phys. 2010, 12, 12899−12908. (14) Kuntz, I. D.; Blaney, J. M.; Oatley, S. J.; Langridge, R. L. A Geometric Approach to Macromolecule-Ligand Interactions. J. Mol. Biol. 1982, 161, 269−288. (15) Moustakas, D. T.; Lang, P. T.; Pegg, S.; Pettersen, E.; Kuntz, I. D.; Brooijmans, N.; Rizzo, R. C. Development and Validation of a Modular, Extensible Docking Program: DOCK 5. J. Comput.-Aided Mol. Des. 2006, 20, 601−619. (16) Jones, G.; Willet, P.; Glen, R. C.; Leach, A. R. Development and Validation of a Genetic Algorithm for Flexible Docking. J. Mol. Biol. 1997, 267, 727−748. (17) Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Belew, R. K.; Olson, A. J. Automated Docking Using a Lamarckian Genetic Algorithm and Empirical Binding Free Energy Function. J. Comput. Chem. 1998, 19, 1639−1662. (18) SYBYL, version 6.8; Tripos International: St. Louis, Mo, 2001. (19) Wang, R.; Lai, L.; Wang, S. Further Development and Validation of Empirical Scoring Functions for Structure-Based Binding Affinity Prediction. J. Comput.-Aided Mol. Des. 2002, 16, 11−26. (20) Eldridge, M. D.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee, R. P. Empirical Scoring Functions: I. The Development of a Fast Empirical Scoring Function to Estimate The Binding Affinity Of Ligands In Receptor Complexes. J. Comput.-Aided Mol. Des. 1997, 11, 425−445. (21) Gehlhaar, D. K.; Verkhivker, G. M.; Rejto, P. A.; Sherman, C. J.; Fogel, D. B.; Fogel, L. J.; Freer, S. T. Molecular Recognition of The Inhibitor AG-1343 by HIV-1 Protease: Conformationally Flexible Docking by Evolutionary Programming. Chem. Biol. 1995, 5, 317−324. (22) Cerius2, version 4.6; Accelrys: San Diego, CA, 2008. (23) Böhm, H. J. Prediction of Binding Constants of Protein Ligands: A Fast Method for The Prioritization of Hits Obtained from de novo Design or 3D Database Search Programs. J. Comput.-Aided Mol. Des. 1998, 12, 309−323. (24) Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S. Glide: a New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004, 47, 1739−1749. (25) Sippl, M. J. Knowledge-Based Potentials for Proteins. Curr. Opin. Struct. Biol. 1995, 5, 229−235.

protein. Thus, we were able to design cancer targeting peptides for cancer therapy and tumor imaging.49 The binding affinity of two molecular partners is determined by the equilibrium of their association and dissociation rates. Even though molecular interaction is just one factor that affects the protein−ligand binding affinity, it is still a significant factor that is positively correlated with binding affinities. Hence, knowledge-derived potentials, such as HotLig, ASP, and DrugScore, may also possess reasonable predictive power for ranking binding affinities as shown in the validation results using Spearman correlation coefficients. Since the ligandbinding affinity is still the ultimate goal of molecular-interaction predictions, combining knowledge-based with other types of scoring functions in future work is an interesting strategy. For example, the score obtained from HotLig, which estimates molecular interactions, can become one descriptor for an empirical-based scoring function to further predict ligandbinding affinities. Overall, HotLig provides a novel algorithm and a better choice for a protein−ligand scoring function in molecular docking studies. This program is expected to lead to more successful drug discoveries in the future.



ASSOCIATED CONTENT

S Supporting Information *

Results, figures, and tables further detailing this work. One example of application for structure-based design of cancertargeting peptides using HotLig was briefly introduced in the Results section. In addition, the analysis of molecular characteristics for the statistical set is shown in Figure S1. The related biological processes, cellular components, and molecular functions of the protein in the statistical set are shown in Figures S2−S4. On the other hand, the analysis of molecular characteristics for the training set is shown in Figure S5. Figure S6 shows the numbers of overlapping structures among the GOLD, Wang, and Cheng data sets. Figure S7 shows the discriminability of DSX. The detailed PDB codes of protein−ligand complexes in the statistical set and training set are listed in Table S1 and S3. Table S2 shows the resolution, Rfactor, and Rfree of the 600 structures in statistical set. Table S4 lists the overlapping structures among the data sets. Table S5 shows the scoring functions ranked by the Pearson correlation coefficients using the Wang data set. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*Tel.: 886-2-27899500. Fax: 886-2-27899503. E-mail: johnyu@ gate.sinica.edu.tw. Notes

The authors declare no competing financial interest.



REFERENCES

(1) Alvarez, J. C. High-Throughput Docking as a Source of Novel Drug Leads. Curr. Opin. Chem. Biol. 2004, 4, 365−370. (2) Huang, S. Y.; Zou, X. Advances and Challenges in Protein-Ligand Docking. Int. J. Mol. Sci. 2010, 11, 3016−3034. (3) Meng, X. Y.; Zhang, H. X.; Mezei, M.; Cui, M. Molecular Docking: A Powerful Approach for Structure-Based Drug Discovery. Curr. Comput.-Aided Drug Des. 2011, 7, 146−157. (4) Verdonk, M. L.; Giangreco, I.; Hall, R. J.; Korb, O.; Mortenson, P. N.; Murray, C. W. Docking Performance of Fragments and Druglike Compounds. J. Med. Chem. 2011, 54, 5422−5431. N

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article

Quantum Chemical Charge Models. J. Chem. Inf. Model. 2011, 51, 2528−2537. (49) Wang, S. H.; Chen, I. J.; Chang, N. C.; Yu, A. L. T.; Wu, H. C.; Yu, H. M.; Chang, Y. J.; Lee, T. W.; Yu, J. C.; Lee, A. C. L.; Yu, J. Structure-Based Design of Cancer-Targeting Peptides for Therapy and Imaging, submitted for publication.

(26) Gohlke, H.; Hendlich, M.; Klebe, G. Knowledge-Based Scoring Function to Predict Protein-Ligand Interactions. J. Mol. Biol. 2000, 295, 337−356. (27) Velec, H. F.; Gohlke, H.; Klebe, G. Drugscore(CSD)Knowledge-Based Scoring Function Derived from Small Molecule Crystal Data with Superior Recognition Rate of Near-Native Ligand Poses and Better Affinity Prediction. J. Med. Chem. 2005, 48, 6296− 6303. (28) Muegge, I.; Martin, Y. C. A General and Fast Scoring Function for Protein-Ligand Interactions: A Simplified Potential Approach. J. Med. Chem. 1999, 42, 791−804. (29) Muegge, I. PMF Scoring Revisited. J. Med. Chem. 2006, 49, 5895−5902. (30) Zhang, C.; Liu, S.; Zhu, Q.; Zhou, Y. A. Knowledge-Based Energy Function for Protein-Ligand, Protein-Protein, and ProteinDNA Complexes. J. Med. Chem. 2005, 48, 2325−2335. (31) Ishchenko, A. V.; Shakhnovich, E. I. SMall Molecule Growth 2001 (SMoG2001): an Improved Knowledge-Based Scoring Function for Protein-Ligand Interactions. J. Med. Chem. 2002, 45, 2770−2780. (32) Mitchell, J. B. O.; Laskowski, R. A.; Alex, A.; Thornton, J. M. BLEEP − Potential of Mean Force Describing Protein-Ligand Interactions: I. Generating Potential. J. Comput. Chem. 1999, 20, 1165−1176. (33) Huang, S. Y.; Zou, X. An Iterative Knowledge-Based Scoring Function to Predict Protein-Ligand Interactions: I. Derivation of Interaction Potentials. J. Comput. Chem. 2006, 27, 1866−1875. (34) Huang, S. Y; Zou, X. An Iterative Knowledge-Based Scoring Function to Predict Protein-Ligand Interactions: II. Validation of The Scoring Function. J. Comput. Chem. 2006, 27, 1876−1882. (35) Mooij, W. T. M.; Verdonk, M. L. General and Targeted Statistical Potentials for Protein-Ligand Interactions. Proteins: Struct. Funct., Bioinf. 2005, 61, 272−287. (36) Muegge, I. A Knowledge-Based Scoring Function for ProteinLigand Interactions: Probing The Reference State. Perspect. Drug Discovery Des. 2000, 20, 99−114. (37) Connolly, M. L. Solvent-Accessible Surfaces of Proteins and Nucleic Acids. Science 1983, 221, 709−713. (38) Wang, R.; Fu, Y.; Lai, L. A New Atom-Additive Method for Calculating Partition Coefficients. J. Chem. Inf. Comput. Sci. 1997, 37, 615−621. (39) Trepalin, S. V.; Gerasimenko, V. A.; Kozyukov, A. V.; Savchuk, N. P.; Ivaschenko, A. A. New Diversity Calculations Algorithms Used for Compound Selection. J. Chem. Inf. Comput. Sci. 2002, 42, 249−258. (40) Warren, G. L.; Do, T. D.; Kelley, B. P.; Nicholls, A.; Warren, S. D. Essential Considerations for Using Protein-Ligand Structures in Drug Discovery. Drug Discovery Today. 2012, 17, 1270−1281. (41) Bondi, A. van der Waals Volumes and Radii. J. Phys. Chem. 1964, 68, 441−451. (42) Shannon, R. D. Revised Effective Ionic Radii and Systematic Studies of Interatomic Distances in Halides and Chalcogenides. Acta Crystallogr. 1976, A32, 751−767. (43) Guha, R.; Howard, M. T.; Hutchison, G. R.; Murray-Rust, P.; Rzepa, H.; Steinbeck, C.; Wegner, J.; Willighagen, E. L. The Blue Obelisk-Interoperability in Chemical Informatics. J. Chem. Inf. Model. 2006, 46, 991−998. (44) Cheng, T.; Li, X.; Li, Y.; Liu, Z.; Wang, R. Comparative Assessment of Scoring Functions on a Diverse Test Set. J. Chem. Inf. Model. 2009, 49, 1079−1093. (45) Cowan, S. W.; Newcomer, M. E.; Jones, T. A. Crystallographic refinement of human serum retinol binding protein at 2A resolution. Proteins 1990, 8, 44−61. (46) Neudert, G.; Klebe, G. DSX: a knowledge-based scoring function for the assessment of protein-ligand complexes. J. Chem. Inf. Model. 2011, 51, 2731−2745. (47) Xie, Z. R.; Hwang, M. J. An Interaction-Motif-Based Scoring Function for Protein-Ligand Docking. BMC Bioinf. 2010, 11, 298. (48) Wang, J. C.; Lin, J. H.; Chen, C. M.; Perryman, A. L.; Olson, A. J. Robust Scoring Functions for Protein-Ligand Interactions with O

dx.doi.org/10.1021/ci400302d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX