Predicting Binding Poses and Affinities in the ... - ACS Publications

Nov 16, 2015 - Convex-PL: a novel knowledge-based potential for protein-ligand interactions deduced from structural databases using convex optimizatio...
0 downloads 12 Views 981KB Size
Subscriber access provided by UNIV OF LETHBRIDGE

Article

Predicting binding poses and affinities in the CSAR 2013-2014 docking exercises using the knowledge-based Convex-PL potential Sergei Grudinin, Petr Popov, Emilie Neveu, and Georgy Cheremovskiy J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.5b00339 • Publication Date (Web): 16 Nov 2015 Downloaded from http://pubs.acs.org on November 17, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Predicting binding poses and affinities in the CSAR 2013-2014 docking exercises using the knowledge-based Convex-PL potential Sergei Grudinin,∗,†,‡,¶ Petr Popov,†,‡,¶,§ Emilie Neveu,†,‡,¶ and Georgy Cheremovskiy†,‡,¶,§ †Univ. Grenoble Alpes, LJK, F-38000 Grenoble, France ‡CNRS, LJK, F-38000 Grenoble, France ¶Inria §Moscow Institute of Physics and Technology, Dolgoprudniy, Russia E-mail: [email protected] Phone: +33 (0)4 38 78 16 91

Abstract The CSAR docking exercise 2013-2014 was the opportunity to assess the performance of the novel knowledge-based potential we are developing, named Convex-PL. The data used to derive the potential consists only of structural information from protein-ligand interfaces found in the PDBBind database. As expected, our potential proved to be very efficient in the near-native pose detection exercises, where we correctly predicted two near-native poses in the 2013 exercise and also ranked 22 near-native poses first and 2 second in the 2014 exercise. Somewhat more surprisingly, we obtained a fair performance in some of CSAR affinity ranking exercises, where the Spearman correlation coefficient between our predictions and the experiments are greater than 0.5

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for several protein-ligand sets. Nonetheless, affinity prediction exercises turned out to be a challenge, and significant progress in the development of our method is needed before we can successfully predict the binding constants.

Introduction Drug discovery is a multidisciplinary field, w hich i ncludes m olecular b iology, biophysics, biochemistry, and pharmacology. It usually deals with the identification of a biological target that is known to play a critical role in a particular disease. In drug discovery, computational methods are increasingly used for the structure-based drug design from target identification and validation to the designing of new molecules. To identify molecules that inhibit a certain activity, hundreds of thousands candidates generated with docking protocols are virtually screened to filter out top-scoring hits.1–10 The latter molecules are tested in appropriate biological assays, and many cycles of optimization are performed to obtain the candidates for further clinical trials. The Community Structure-Activity Resource (CSAR) was created in 2008 from the desire to provide high-quality experimental data to the community that is developing new computational tools. Its main goal is to bring together scientists in order to better assess the computational methods. This helps the community to direct their efforts where needed in order to improve the predictions. CSAR organized the first benchmark exercises in 2010 that highlighted the difficulties of predicting binding free energies of protein-ligand complexes, and discovered that some properties such as hydrogen bonding cause more trouble than the ligand size. 11 The goal of the 2011-2012 blind exercises was to compare improvements for pose prediction, enrichment, and relative ranking of congeneric series of compounds with a set of four protein targets. One of its conclusion was that correct ranking and correct pose prediction are not necessarily correlated. 8 This leaded to the CSAR Benchmark Exercise 2013 and 2014, which focused on the further improvement of the ability of computational methods to correctly predict binding poses and affinities. 2 ACS Paragon Plus Environment

Page 2 of 28

Page 3 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Here we present the performance of Convex-PL - a knowledge-based scoring function for protein-ligand interactions (PLIs) in the 2013-2014 exercises. We start with a brief introduction of basic concepts related to the prediction of PLIs. In practice, one is interested in the estimation of the binding free energy ∆Gbind = ∆H − T ∆S, where ∆H is the enthalpic difference between the bound and the unbound states of the complex, T is the temperature, and ∆S is the entropic difference upon binding. The binding free energy is a much more sophisticated function compared to the potential energy, and involves not only interaction energy between the partners, but also changes in the internal energy of monomers, interactions with solvent, rearrangement of solvent molecules and changes of conformational degrees of freedom corresponding to the entropic loss upon binding. Direct computation of the binding free energy of molecular complexes is generally an intractable problem due to its high computational cost. Instead, to solve for putative docking poses and virtual screening candidates, there has been an extensive development of different scoring functions as an approximation to the binding free energy. 1,3,5,8–14 Scoring functions for the virtual screening and the selection of putative binding poses can be generally categorized into four groups. Even though many methods can be seen as hybrids, these four groups are commonly referred in literature, 14 i.e. physics-based methods, empirical scoring functions, knowledge-based potentials and descriptor-based scoring functions. Physics-based methods 15–19 aim to reproduce the physical phenomena implied in protein-ligand interactions. They take advantage of the progress in force-field developments, quantum mechanics and solvation models, 20,21 but typically suffer from a high computational cost and often need some empirical scaling to set some parameters. Empirical scoring functions consist of a linear combination of terms that are known to reflect important factors of binding, e.g. hydrophobic contacts, hydrogen bonding, accessible and buried surface area, etc. 22–25 Using multivariate regression analysis, all the terms are supplied with the corresponding weights to provide a good agreement with the training set of complexes for which experimentally determined binding constants are available. These

3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

terms are flexible, being easily adapted to a specific type of in teractions 26 and are calibrated to reproduce binding affinities. However, they depend on the data set and can be biased with the experimental errors. Knowledge-based potentials are typically a sum of pairwise statistical potentials that have been computed using structural information from databases of molecular complexes. 27–32 The main assumption behind these potentials is that the native molecular complexes possess distinct structural features with respect to the non-native structures. For example, the common assumption is that more frequently observed interactions are more important for the complex stability. They are often derived from the inverse Boltzmann statistics but can be combined with entropic terms. 33,34 Scoring with these potentials is generally more computationally efficient and less time demanding compared to the other methods. Hence, the knowledge-based potentials are very suitable for the docking and virtual screening protocols and even though they are derived to reproduce binding poses rather than binding energies, some methods seem to have a good correlation with binding affinities. 13 The last class of scoring functions, the descriptor-based methods, is inspired by machine learning technics. 35–37 These methods use different approaches such as neural networks, random forest or support vector machines to learn from a large set of data. The training process provides a relationship between the observed binding affinity data and the molecular descriptors that can be of any nature, e.g. geometrical, topological, electronic structure-based, etc. However, this new trend provokes some debate. First, the reasons for selecting a certain combination of descriptors are not clearly specified. Second, the not necessarily linear relationship between the descriptors is not physically interpretable. 38 In general, modern scoring functions demonstrate relatively high success rates for the identification of correct binding poses for different targets. 39 However, no high-degree correlation between predicted and experimental binding affinities is achieved. 40 Thus, accurate prediction of putative activity of various compounds remains to be a challenge. 5,10,31,41,42 In order to assess the predictive quality of a scoring function, one can validate and evaluate its performance on the structural or affinity benchmarks. Thus, we took the opportunity

4 ACS Paragon Plus Environment

Page 4 of 28

Page 5 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

of the CSAR 2013-2014 Benchmark Exercise to test our new knowledge-based scoring function Convex-PL. The benchmark exercise was focused on the assessment of the abilities of various methods to predict binding poses and affinities for a number of protein-ligand complexes. More specifically, it consisted in three phases in 2013 and two phases in 2014. For the first phase of 2014 and the second phase of 2013, participants were given a set of docking decoys from different proteins (two in 2013, 24 in 2014) and were asked to identify the near-native poses. Phase I of 2013 consisted in ranking predicted structures starting from their sequences and the goal of Phase III was to predict the binding affinity of the three top poses of ten complexes. Finally, Phase II of 2014 was about docking and scoring the same proteins as in Phase I with a congeneric series of compounds, starting from a PDB structure and a list of SMILES codes. More generally, this paper focuses on the assessment of the ability of our knowledgebased function called Convex-PL deduced solely from the structural data to predict not only the correct docking poses, but also the binding affinities. Convex-PL is an atomlevel linear pairwise distance-dependent scoring function derived such that its value can discriminate native binding poses from the non-native ones. It is deduced from a structural database of protein-ligand complexes by solving an optimization problem, which maximizes the energy separation between the native and non-native poses. Being inspired by support vector machines technics and already applied to protein-protein docking problems, 32 our scoring function can be seen as an hybrid method, which combines physically interpretable descriptors with convex optimization of a well-defined problem.

Method The Convex-PL scoring function used for the pose and binding affinity predictions is a pair-wise sum of distance-dependent interactions between atoms forming the interface, as explained below in more detail. The interactions are not only distance-dependent, but also

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 28

take into account the atom types. Depending on chemical properties, we define 48 types, and thus we use 48 × 47/2 = 1128 different distance-dependent interactions, which are deduced from structural databases with the help of an optimization procedure. The first step of the procedure is the generation of the non-native complexes (decoys). The next step is the extraction of geometric features from the structural data of native and non-native complexes. These features are the number density of interactions of a certain type at a certain distance. Then, an optimization algorithm is used to compute the optimal potentials such that the derived scoring functions can correctly discriminate the native complexes from the non-native ones generated at the first step of the procedure. Figure 1 summarizes the steps of the training process. Below, we first explain the design of the optimization problem along with the underlying physical assumptions. Then, we describe how we generated the data used to learn the potentials. Finally, we describe the typization of the protein–ligand interactions. classificator indicating natives/ non natives

rmax: cut-off distance

structural data of known complexes structural data of generated decoys

Extraction of features Pre-processing

X : (xij , yij )

, regularization parameter

Optimization of the Potentials

w knowledge-based potentials

geometric features

Figure 1: Schematic representation of the procedure for the computation of the knowledgebased potentials.

Theoretical Foundation Let us consider N native protein–ligand complex configurations Pi0nat , i = 1...N . For each native complex configurations Pi0nat we can generate D non-native configurations (decoys), Pijnonnat , j = 1...D, where the first index runs over different protein complexes and the second 6 ACS Paragon Plus Environment

Page 7 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

index runs over generated decoys. Our aim is to find a scoring functional F , defined for all possible configurations (the set P ), such that for each native complex i0 and its nonnative decoy ij (j > 0) the following inequality holds,

F (Pi0nat ) < F (Pijnonnat )

(1)

Generally, this is a very difficult problem. However, it can be simplified greatly under certain assumptions. First, we assume that F depends only on the interface between the protein and the ligand. The interface is a set of atom pairs at a distance smaller than a certain cutoff distance rmax such that the first atom in each pair belongs to the protein and the second atom in each pair belongs to the ligand. We use the value of the cutoff distance of 10 Å, which was adapted from previous works. 43–50 It has been previously demonstrated that even smaller distances of 6 Å work somewhat well and that the quality of the scoring function does not improve after the cutoff distance of 12 Å. Second, we assume that both a protein and a ligand can be represented as a set of discrete interaction sites located at the centers of atomic nuclei. All interaction sites are divided into M types according to the properties of corresponding atomic nuclei, such as element symbol, aromaticity, hybridization, and polarity, resulting in a total of M × (M + 1)/2 pairs of interactions. An interaction between two sites can be also regarded as an interaction between two atoms of certain types. Third, we assume that F depends only on the distribution of the distances between the interaction sites, i.e. the number of site pairs at a certain distance,

F (P ) ≡ F (n11 (r), .., nkl (r), .., nM M (r)) ≡ F (n(r)),

(2)

where nkl (r) is the number density of site-site pairs at a distance r between two sites k and l, with site k on the protein, and site l on the ligand, where M is the total number of interaction types. Finally, we assume that F is a linear functional, F (αn1 (r) + βn2 (r)) = 7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 28

αF (n1 (r)) + βF (n2 (r)). We write the functional F (n(r)) fulfilling these assumptions as

11

kl

F (n(r)) ≡ F (n (r), .., n (r), .., n

MM

M X M rZmax X (r)) = nkl (r)U kl (r) dr, k=1 l=k

(3)

0

where U kl (r) are unknown functions that are deduced from the training set of protein-ligand complexes. From now on, we will call these functions scoring potentials. We compute the sitesite number densities for each protein-ligand complex nkl (r) as a sum of Gaussian-distributed distances between interaction sites of types k and l as

nkl (r) =

X

e−

(r−rij )2 2σ 2

,

(4)

uv

where σ is the standard deviation, which is constant for all distributions. The sum is taken over all pairs of sites uv of types k and l separated by a distance rij smaller than a certain threshold distance rmax, with the first site located on the protein, and the second site located on the ligand. The goal of the Gaussian distribution in Eq. 4 is to smooth the structural statistics of pair-wise distances rij , which can otherwise have sharp features and various inaccuracies. In order to determine unknown scoring potentials U kl (r) (Eq. 3), we decompose them along with the number densities nkl (r) in a polynomial basis, U kl (r) =

X

wqkl ψq (r),

r ∈ [0; rmax ]

q kl

n (r) =

X

(5) xkl q ψq (r),

r ∈ [0; rmax ],

q

where ψq (r) are orthogonal basis functions on the interval [0; rmax ], and wqkl with xkl q are the expansion coefficients of U kl (r) and nkl (r), respectively. Given this, the scoring functional

8 ACS Paragon Plus Environment

Page 9 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

F can be expanded up to the order Q as

F (P ) ≈

Q M X M X X k=1 l=k

wqkl xkl q = (w · x),

w, x ∈ RQ×M ×(M +1)/2

(6)

q

We will refer to the vector w as to the scoring vector, whose value is to be determined, and to the vector x as to the structure vector that is computed from the structural data. Finally, we can rewrite the set of inequalities (1) as a convex optimization problem, which consists in finding w that minimizes the empirical risk with a regularization penalty 51 as

min w

λ Ω(w) + L(X; w), 2

(7)

where X = (xij , yij ) is the training set of labeled structure vectors, yij indicating whether the structure Pij is native or not, L(X; w) is the empirical risk function, and Ω(w) is the regularization penalty, which is aimed to prevent overfitting and to compensate the lack of statistics for rare events. We used the hinge-loss function to describe the empirical risk and the two-norm for the regularization. Parameter λ is the only adjustable parameter in our model that determines the importance of the regularization term with respect to the empirical risk. Its optimal value is determined using the cross validation procedure. We solved problem (6) using an iterative block sequential minimal optimization algorithm as it is explained elsewhere. 32 This is a robust convex optimization procedure proved to converge towards the optimum independently of the starting configuration, since our problem (6) is convex by construction, which guarantees the existence of the single minimum. 51,52

Training Solution of optimization problem (7) requires a generation of native and non-native structure vectors. We collected all the structural information for the problem (7) from the PDBBind 53,54 database, which provides experimentally determined protein-ligand complexes de9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 28

posited in the Protein Data Bank supplied with the measured binding affinity data. For the training, we used 6004 complexes from the ‘general set’ of PDBBind release 2011, which contains three-dimensional structures of resolution equal to or better than 2.5 Å for 6051 protein-ligand complexes along with the corresponding binding data, which includes Kd , Ki , and IC50 values. In the current version of our potential we used only the three-dimensional information from this database, i.e. no experimental binding affinity data was used. Generation of the decoy conformations is based on the idea from our recent work,32 where we demonstrated that structural information collected from native and near-native proteinprotein complexes allows to construct powerful predictive models of protein-protein interactions. The near-native complex was defined as a complex with a small ligand-RMSD value, typically about 1 Å, with respect to the native structure. For the protein-protein case, we generated near-native protein-protein conformations using deformations of the na-tive structure along some finite number of collective motions computed using the Normal Mode Analysis.32 For the protein-ligand case, however, in order to generate near-native conformations, we consider ligand molecules as rigid bodies and rotate them about some axes defined such that the ligand-RMSD from the native pose is kept constant. To do so, we chose six axes inside a unit sphere corresponding to its icosahedral tessellation. More pre-cisely, to generate the axes, we first aligned the principal axes of inertia of the icosahedron with the coordinate axes, and then connected its opposite vertices. The weighted RMSD for a pure rotation about axis n by an angle α is55

RMSD2 =

α 4 sin2 I(n), M 2

(8)

where the ligand molecule of mass M is considered as a rigid body, whose inertia tensor relative to the axis n passing through its center of mass is given as

I(n) = nT In

10 ACS Paragon Plus Environment

(9)

Page 11 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

To obtain a set of decoys with a certain ligand-RMSD from the native structure, we first rotated the ligand about each rotational axis by an angle of ±α and then translated along coordinate axes by the lengths of ±RMSD. Thus, for each native structure we generated 18 decoy conformations, which means that the total amount of the training structure vectors was (18 + 1) × 6004 = 114, 076. In order to determine the optimal value of decoy RMSD, we carried out the two-fold cross-validation procedure using the training data. More precisely, we split the training data into two sets, and solved optimization problem (7) while scanning through different parameters of RMSD and the regularization parameter λ. We found that, in principle, we can use any value of RMSD for decoy generation inside the [0.2 Å, 1.0 Å] interval, provided that the value of the regularization parameter λ is chosen correspondingly.

Atom Types Our knowledge-based potential requires definition of the atom types. We assigned the types to the atoms of both proteins and small molecules according to their physical and chemical properties as well as their functional groups. To do so, we started with 158 internal atom types provided by the fconv library 56 and grouped them by measuring the statistical similarity of distance distribution functions between different atom types in the training data set. Overall, our parameterization consists of 48 atom types. More precisely, we have 17 types for nitrogen, 9 types for oxygen, 8 types for carbon, 4 types for sulphur, 2 types for phosphorus, and 8 types for halogens. Table S1 from Supporting Information lists all the used atom types along with the chemical properties of the corresponding atoms.

Pose and Affinity Predictions This sub-section presents the prediction methods used during the CSAR experiments. Apart from the deduced Convex-PL scoring function, our docking pipeline also includes alreadyexisting software packages as described below.

11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Generation of Candidate Poses To generate putative binding poses of a ligand with a protein, we used an in-house modification of the AutoDock Vina software package with the default scoring function. 6 Our modifications concern the exhaustiveness of its sampling and allowed us to generate 10,000 docking solutions for the subsequent re-scoring. To start the Vina docking, we detected the binding pockets of each protein-ligand complexes by a rough search over the protein surface. We chose all the ligands and some of the residues inside the binding pockets to be flexible during the docking procedure. If a ligand was given using the SMILES code, we used the Open Babel library 57 to create the initial three-dimensional model. If no three-dimensional protein structure was available, we modeled it using the MODELLER package. 58 We re-scored the docking solutions with the Convex-PL scoring function as described below, without any further structure optimization.

Scoring Procedure We scored the protein-ligand binding poses using an approximation to the binding free energy as given by Eq. 6. More technically, for each of the poses, we first computed the corresponding structure vector x using Eqs. 4 and 5, which describes the interfaces of each of the complexes. Then, using Eq. 6, we computed the approximation to the binding free energy as a scalar product between vectors x and w. Finally, we ranked the poses according to the computed values. To see if our scoring function can be used in binding affinity predictions, we also computed a set of control protein-ligand complex’ descriptors. 5,10 These are the number of atoms in the ligand, the difference of the solvent accessible surface (SAS) upon binding, and the difference of the molecular surface upon binding. We computed these descriptors using PyMOL ’get_area’ and ’count_atoms’ commands. 59 For the surface calculations, we used the default PyMOL approximation of the atomic volumes with dots.

12 ACS Paragon Plus Environment

Page 12 of 28

Page 13 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Details of the Implementation We implemented the scoring method using the C++ programming language and compiled it using the clang compiler. We ran the program on a MacBook Pro laptop.

Results and Discussion In the Pose Prediction section we present results of the Phase II of CSAR Exercise 2013 and results of the Phases I and II of CSAR Exercise 2014, where the goal was to determine the structure or to predict which decoys correspond to the near-native poses for several proteinligand complexes. In the Affinity Prediction section we present results of the Phases I and III of CSAR Exercise 2013 and results of the Phases II of CSAR Exercise 2014, where the goal was to predict binding affinities for a set of protein-ligand complexes. Table 1 summarizes the goals and the methods of each exercise. More precisely, we introduce three types of methods, (i) generation of a set of decoys for the subsequent scoring, (ii) scoring of the provided decoys with the Convex-PL potential, and (iii) affinity prediction of the provided decoys with the Convex-PL potential. In each exercise, in order to generate docking poses or score the decoys, we used the same structure-based methodology, as described in Methods section. Table 1: Summary of the CSAR 2013-2014 Exercises. We also specify which target proteins and which exercises correspond to the different phases. CSAR Exercise Associated targets Pose prediction (docking) Pose prediction (scoring) Affinity prediction

Phase I

2013 Phase II

Phase III

DIG 1-10/12-14/17-19

DIG 18/20

DIG 10

×

2014 Phase I Phase II FXa/Syk/TrmD

FXa/Syk/TrmD

×

× × ×

× ×

×

×

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 28

Pose Predictions Second Phase 2013 Here we present results of the second phase of the 2013 exercise. Given two protein structures of a derivative of the steroid digoxigenin (DIG) and the corresponding ligand decoys, the goal was to determine which decoys correspond to the near-native poses. Decoys were generated using the DOCK (version 6.5) program.60 A diverse set of 200 poses was then selected with MOE 2011.10.61 To make a clear distinction between native and decoy poses, all the poses with RMSD < 2 Å were discarded by the organizers except for the near-native pose (RMSD < 1.0 Å). To determine the native poses, we only use our standard scoring procedure. Figure 2 shows the obtained scores of the decoys versus their RMSD from the native pose. As we can see, we correctly determined the near-native poses for the two sets of decoys. To see if our scoring function can be potentially used for a subsequent gradient-based refinement of binding poses, we did a further analysis and computed the Pearson r correlation coefficient, which measures a linear relationship, and Spearman ρ correlation coefficient, which measures a monotonic relationship, between the scores and the corresponding RMSD from the native pose only for the decoys with RMSD < 10 Å. We keep only these decoys in the analysis to see if our scoring function can potentially improve the quality of the near-native predictions by the subsequent pose refinement. Table 2 lists the obtained correlation c oefficients. As we can see from this table, the correlation coefficients are fairly high, especially for the DIG18 complex, showing an effective linear or non-linear correlation between the two sets of values. Thus, the Convex-PL scoring function can also be potentially used for pose refinement.

First Phase 2014 We repeated the same procedure in the first phase of the 2014 exercises. The goal of this first phase was the same as in the previous year. More precisely, given three Factor Xa complexes

14 ACS Paragon Plus Environment

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Score

Page 15 of 28

10 0 10 20 30 40

DIG18

0

5

RMSD,

10

0 10 20 30 40 50

15

DIG20

0

5

10 RMSD,

15

20

Figure 2: Analysis of pose predictions in the second phase of CSAR 2013 exercise. The value of the Convex-PL potential is plotted for each decoy as a function of the RMSD to the native ligand pose, which has RMSD of 0 Å. A vertical line separates the near-native decoys with RMSD lower than 10 Å from the other decoys. Table 2: Analysis of near-native pose predictions in the second phase of CSAR 2013 exercise. Pearson and Spearman correlation coefficients between the values of the Convex-PL potential and the corresponding RMSD to the native ligand pose are listed keeping only the near-native decoys with RMSD < 10 Å. Complex DIG18 DIG20

Pearson r 0.61 0.33

Spearman ρ 0.68 0.2

sample size 23 58

(FXa), five human spleen tyrosine kinase (Syk) complexes, and 14 tRNA methyltransferase (TrmD) complexes with 200 pre-generated decoys of the ligand, the goal was to determine the native decoy for each complex. The proteins were prepared using the MOE 2011.10 61 with MMFF94x 62 force field and AM1-BCC charges for the l igands. The docked p oses were generated with DOCK (version 6.5). 60 Similarly to the 2013 exercise, we did not do any pose refinement and only scored the given decoys using our scoring function. Figure 3 shows the scores of the poses versus their RMSD from the native pose, which is the point with the null RMSD. Here, we correctly determined the native poses for 22 out of 24 complexes. In two other cases, we ranked the native pose second. To analyze the behavior of the scoring function in more details, we computed the Pearson r and Spearman ρ correlation coefficients between the scores and the corresponding RMSD values for all the decoys with RMSD lower than 10 Å to test if our method can be used in pose refinement. Table 3 lists the obtained correlation coefficients for these near-native decoys. Again, we obtained rather high correlation coefficients generally 15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 28

greater than 0.5 with a few exceptions for the FXA_gtc398 complex and several TrmD complexes. This analysis again suggests that our scoring function can, in principle, be used for a gradient-based binding pose optimization.

Figure 3: Analysis of pose predictions in the first phase of CSAR 2014 e xercise. The value of the Convex-PL potential is plotted for each decoy as a function of the RMSD to the native ligand pose, which has RMSD of 0 Å. A vertical line separates the near-native decoys with RMSD lower than 10 Å from the other decoys. In the first row, results for the three FXa targets with different bound ligands are plotted. In the second row, results for the five Syk targets with different bound ligands are plotted. In the bottom three rows, results for the 14 TrmD targets with different bound ligands are plotted.

Second Phase 2014 The goal of the second phase of the 2014 exercise was to predict the structures of several receptor-ligand complexes and to rank the compounds within congeneric series by the binding 16 ACS Paragon Plus Environment

Page 17 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 3: Analysis of pose predictions in the first p hase o f C SAR 2 014 e xercise. Pearson and Spearman correlation coecients between the values of the Convex-PL potential and the corresponding RMSD to the native ligand pose are listed keeping only the near-native decoys with RMSD < 10 Å. Data for FXa, Syk, and TrmD complexes with different bound ligands are listed. Complex Pearson r FXA_gtc101 0.56 FXA_gtc398 0.46 0.53 FXA_gtc401 SYK_gtc224 0.84 SYK_gtc225 0.77 0.71 SYK_gtc233 SYK_gtc249 0.65 0.59 SYK_gtc250 TRMD_gtc445 0.32 TRMD_gtc446 0.33 TRMD_gtc447 0.3 0.5 TRMD_gtc448 TRMD_gtc451 0.67 TRMD_gtc452 0.42 0.54 TRMD_gtc453 TRMD_gtc456 0.66 0.31 TRMD_gtc457 TRMD_gtc458 0.39 TRMD_gtc459 0.51 TRMD_gtc460 0.64 0.4 TRMD_gtc464 TRMD_gtc465 0.43

Spearman ρ 0.53 0.43 0.49 0.84 0.65 0.69 0.63 0.35 0.15 0.19 0.18 0.49 0.6 0.38 0.45 0.6 0.07 0.31 0.46 0.47 0.25 0.33

17 ACS Paragon Plus Environment

sample size 139 145 111 112 97 111 107 83 88 86 94 79 73 75 72 95 44 91 80 63 78 98

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 28

affinity. Overall, we succeeded in the pose prediction for the targets FXa and Syk (with mean RMSD of ligand poses to the experimental structure 1.73 Å and 1.9 Å, respectively). However, we missed the binding site for the TmrD target (the corresponding mean RMSD is 10.6 Å). This is due to the blind search methodology that we used without any additional information about homologous complexes or the binding sites of the same protein. Table 4 lists the detailed information about the RMSD of the highest-scored poses with respect to the experimental structure of the ligand. Table 4: RMSD of the best-scoring pose with respect to the experimental structure of the ligand in the second phase of CSAR 2014 exercise. Data for FXa, Syk, and TrmD complexes with different bound ligands are listed. Complex FXA_gtc101 SYK_gtc224 SYK_gtc225 SYK_gtc233 SYK_gtc249 SYK_gtc250 TRMD_gtc445 TRMD_gtc446 TRMD_gtc447 TRMD_gtc448 TRMD_gtc451 TRMD_gtc452 TRMD_gtc453 TRMD_gtc456 TRMD_gtc457 TRMD_gtc458 TRMD_gtc459 TRMD_gtc460 TRMD_gtc464 TRMD_gtc465

RMSD, Å 1.73 1.94 1.58 0.34 1.74 4.05 9.71 9.23 9.34 9.93 11.87 9.70 9.75 13.81 9.91 10.33 9.90 10.92 9.77 13.79

18 ACS Paragon Plus Environment

Page 19 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Affinity Predictions First Phase 2013 Given 16 sequences that had been designed to bind a derivative of the steroid digoxigenin (DIG), the goal of this exercise was to predict which ones bind DIG and to rank the predictions by affinity. Here, we used our standard structure-based methodology. First, we modeled the structure of the proteins. Then, we docked the DIG ligand on the protein models and kept 10,000 poses for the final re-scoring. After, we re-scored the putative binding poses with our scoring function and selected the best binding pose for each protein. Finally, we ranked the proteins by the value of the best binding pose. As the results, we gave the active sequences the following ranks, 2, 4, 6, and 7.

Third Phase 2013 Here the goal was to predict the binding affinity (or relative ranking) for each ligand with a protein and to submit the top three poses for each protein-ligand complex (for a total of 30 files) in a PDB format file. For the affinity predictions, we used the standard structurebased methodology. Table 5 lists the Pearson r and the Spearman ρ correlation coefficients between the predicted and the experimental affinities for the three best binding poses. We obtained a significant correlation with the Spearman ρ greater than 0.5. To see if our scoring procedure is any better compared to very naive approaches that measure the binding affinity using the ligand weight, 5 or the difference of the solvent-accessible surface upon binding (∆ SAS), 10 we performed a control set of computational experiments. More precisely, we measured the correlation of the observed binding affinities with the number of atoms in the ligands, ∆ SAS , and the difference of molecular surface upon binding (∆ Molecular Surface) for the predicted binding poses. Table 5 demonstrates that the Convex-PL correlations are significantly better compared to the control values.

19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 28

Table 5: Analysis of affinity predictions in the third phase of CSAR 2013 exercise. The Pearson and Spearman correlation coefficients between the values of the Convex-PL potential and the experimental pKd constants are listed for the best three putative binding poses. As a reference, control correlation coefficients for the number of ligand atoms, ∆ SAS and ∆ Molecular Surface are listed. In the last column, correlation coefficients between the sum of the scores over the three poses and the experimental pKd constants are listed. Method Convex-PL ∆ Molecular Surface ∆ SAS Number of ligand atoms

Pose 1 Pearson r Spearman ρ 0.464 0.6 0.11 -0.09 0.36 0.27 0.32 0.45

Pose 2 Pearson r Spearman ρ 0.466 0.685 0.11 -0.13 0.39 0.31 0.32 0.45

Pose 3 Pearson r Spearman ρ 0.448 0.6 0.10 -0.09 0.02 -0.38 0.32 0.45

Sum over three poses Pearson r Spearman ρ 0.462 0.700 0.11 -0.09 0.27 0.09 0.32 0.45

Second Phase 2014 In this phase, the structures of the target in the PDB format from Phase I along with the SMILES strings for the ligands were provided. The goal of this exercise was to predict the structures of the receptor-ligand complexes and to rank the compounds within congeneric series. Here we discuss the ability of the Convex-PL potential to predict the binding affinity for five sets of complexes, FXa (3 sets), Syk and TrmD. Figure 4 presents scatter plots of the log-values of our scoring potential as a function of measured pIC50 constants for each of the sets. The detailed correlation characteristics along with the control values are listed in Table 6 for Pearson r and Spearman ρ values. Despite rather good predictions of the correct poses for the FXa complexes (see Table 4), the affinity predictions appeared to be much more challenging, especially for the two first sets where the obtained correlation coefficients are around zero. However, for FXa set3, Syk, and TrmD, the value of Spearman ρ is respectively 0.35, 0.27, 0.52, showing a correlation, even weak, between the values of the scoring potential and experimental affinities. Interestingly, we obtained a high correlation for the TrmD complex, for which no correct binding poses were produced. We should draw readers attention, however, to the fact that only for the FXa set3 our scoring method produced correlations superior to the control values. The bad results on the FXa sets, especially the set1 and set2, can be partially explained by the specificity of the complex, which has solvent-exposed binding pockets. Furthermore, all FXa

20 ACS Paragon Plus Environment

Page 21 of 28

crystal structures are missing several domains.

Overall, we believe that more accurate

pose predictions improve the correlation between the predicted and experimental affinities, as suggested by the results of the 2013 experiment.

FXA set1 log(Score)

7.4 7.2

8.0

7.8

7.8

7.6

7.6

7.4

7.4

7.2

7.2

6.0 6.5 7.0 7.5 8.0 8.5

7.0 4

SYK

7.0

7.5

FXA set2

8.0

7.6

log(Score)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

6

8

10

7.0 4

FXA set3

6

8

10

TRMD

6.5

7.0

6.0 6.5 6.0 4

5.5 6

pIC50

8

10

5.0 2

4

6 pIC50

8

10

Figure 4: Analysis of affinity predictions in the second phase of CSAR 2014 exercise. The log-value of the Convex-PL potential is plotted as a function of experimentally measured pIC50 constant for three sets of FXa complexes, one set of Syk complexes and one set of TrmD complexes.

Conclusions We have assessed the performance of our knowledge-based potential Convex-PL on the CSAR docking exercise 2013-2014. The potential was derived using only structural information from 6,004 training protein-ligand examples found in the PDBBind database. Not surprisingly, our potential demonstrated a very good performance in near-native pose detection exercises. To a certain level of accuracy, our method can also predict the near-native binding poses using the exhaustive blind search. Somewhat more surprisingly, it has also a fair performance 21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 28

Table 6: Analysis of affinity predictions in the second phase of CSAR 2014 exercise. The Pearson and Spearman correlation coefficients between the values of the Convex-PL potential and the experimentally measured pIC50 constants are listed for each set of the complexes. As a reference, control correlation coefficients for the number of ligand atoms, ∆ SAS and ∆ Molecular Surface are listed. Complex FXa set1 FXa set2 FXa set3 Syk TrmD

Convex-PL Pearson r Spearman ρ -0.06 0.02 -0.05 -0.04 0.3 0.35 0.29 0.27 0.54 0.52

Number of ligand atoms Pearson r Spearman ρ 0.43 0.39 -0.06 -0.05 -0.13 -0.09 0.13 0.11 0.65 0.48

∆ Molecular Surface Pearson r Spearman ρ 0.29 0.17 0.05 0.09 0.03 0.08 0.30 0.28 0.66 0.48

∆ SAS Pearson r Spearman ρ 0.42 0.40 0.08 0.10 -0.11 -0.06 0.02 0.07 -0.39 -0.28

in some of the exercices that rank affinities. More precisely, the values of our potential for the near-native poses provide a fair correlation with the experimentally measured binding affinities, as the 2013 exercise demonstrated (Spearman ρ=0.6). In the 2014 exercise we only obtained a significant correlation for the TrmD set with ρ=0.52, where we did not dock the ligands into the right binding site with the mean RMSD of 10.6 Å, for the Syk set with ρ=0.27, and for the FXa set3 with ρ=0.35. However, only in the 2013 exercise and FXa set3 of the 2014 exercise the obtained values were significantly better compared to the control set of correlations. For two other FXa sets correlations were zero. We should also note that we only submitted binding affinities for 174 ligands out of 276 in the Syk set, however, we do not think it affects the overall behavior of our ranking. We should emphasize that we trained our potential only with the structural information, i.e. no binding affinity data was used. Nonetheless, the Convex-PL potential demonstrated a fair predictive power when also applied to the prediction of binding affinities. However, we believe that the performance of our ranking method was limited by the inaccuracies in the pose predictions. Thus, in the future, we will study if a gradient-based binding pose optimization using our potential improves the quality of the binding poses and if it also improves the ranking of poses by affinity.

22 ACS Paragon Plus Environment

Page 23 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Acknowledgement This work was supported by the Agence Nationale de la Recherche, grants ANR-2010-JCJC0206-01 and ANR-11-MONU-006-01. GC and PP thank the 5TOP100 program of the Ministry for Education and Science of the Russian Federation and the framework of ERA.Net RUS PLUS grant (ID 323, Russian Federal Target Program "Research and Development" contract 14.587.21.0011, RFMEFI58715X0011).

23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References (1) Wang, R.; Lu, Y.; Wang, S. J. Med. Chem. 2003, 46, 2287–2303. (2) Jorgensen, W. L. Science 2004, 303, 1813–1818. (3) Warren, G. L.; Andrews, C. W.; Capelli, A.-M.; Clarke, B.; LaLonde, J.; Lambert, M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S. et al. J. Med. Chem. 2006, 49, 5912–5931. (4) Jorgensen, W. L. Acc. Chem. Res. 2009, 42, 724–733. (5) Cheng, T.; Li, X.; Li, Y.; Liu, Z.; Wang, R. J. Chem. Inf. Model. 2009, 49, 1079–1093. (6) Trott, O.; Olson, A. J. J. Comput. Chem. 2010, 31, 455–461. (7) Cheng, T.; Li, Q.; Zhou, Z.; Wang, Y.; Bryant, S. H. The AAPS J. 2012, 14, 133–141. (8) Damm-Ganamet, K. L.; Smith, R. D.; Dunbar Jr, J. B.; Stuckey, J. A.; Carlson, H. A. J. Chem. Inf. Model. 2013, 53, 1853–1870. (9) Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Nat. Rev. Drug Discov. 2004, 3, 935–949. (10) Li, Y.; Han, L.; Liu, Z.; Wang, R. J Chem Inf Model 2014, 54, 1717–36. (11) Smith, R. D.; Dunbar, J., J. B.; Ung, P. M.; Esposito, E. X.; Yang, C. Y.; Wang, S.; Carlson, H. A. J. Chem. Inf. Model. 2011, 51, 2115–2131. (12) McInnes, C. Curr. Opin. Chem. Biol. 2007, 11, 494–502. (13) Zheng, Z.; Merz, K. M. J. Chem. Inf. Model. 2013, 53, 1073–1083, PMID: 23560465. (14) Liu, J.; Wang, R. J. Chem. Inf. Model. 2015, 55, 475–482, PMID: 25647463.

24 ACS Paragon Plus Environment

Page 24 of 28

Page 25 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(15) Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M. J. Comput. Chem. 1983, 4, 187–217. (16) Jorgensen, W. L.; Maxwell, D. S.; Tirado-Rives, J. J. Am. Chem. Soc. 1996, 118, 11225–11236. (17) Ewing, T. J.; Makino, S.; Skillman, A. G.; Kuntz, I. D. J. Comput.-Aided Mol. Des. 2001, 15, 411–428. (18) Case, D. A.; Cheatham, T. E.; Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M.; Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R. J. J. Comput. Chem. 2005, 26, 1668–1688. (19) Hess, B.; Kutzner, C.; Van Der Spoel, D.; Lindahl, E. J. Chem. Theory Comput. 2008, 4, 435–447. (20) Kuhn, B.; Gerber, P.; Schulz-Gasch, T.; Stahl, M. J. Med. Chem. 2005, 48, 4040–4048, PMID: 15943477. (21) Chaskar, P.; Zoete, V.; Röhrig, U. F. J. Chem. Inf. Model. 2014, 54, 3137–3152, PMID: 25296988. (22) Böhm, H.-J. J. Comput.-Aided Mol. Des. 1994, 8, 243–256. (23) Eldridge, M. D.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee, R. P. J. Comput.Aided Mol. Des. 1997, 11, 425–445. (24) Wang, R.; Lai, L.; Wang, S. J. Comput.-Aided Mol. Des. 2002, 16, 11–26. (25) Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K. et al. J. Med. Chem. 2004, 47, 1739–1749. (26) Korb, O.; Stutzle, T.; Exner, T. E. J. Chem. Inf. Model. 2009, 49, 84–96. 25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(27) Muegge, I.; Martin, Y. C. J. Med. Chem. 1999, 42, 791–804. (28) Mooij, W.; Verdonk, M. L. Proteins: Struct., Funct., Bioinf. 2005, 61, 272–287. (29) Huang, S.-Y.; Zou, X. Annu. Rep.- Comp. Chem. 2010, 6, 280–296. (30) Zhou, H.; Skolnick, J. Biophys. J. 2011, 101, 2043–2052. (31) Huang, S.-Y.; Zou, X. J. Chem. Inf. Model. 2011, 51, 2097–2106. (32) Popov, P.; Grudinin, S. J. Chem. Inf. Model. 0, 0, null, PMID: 26353078. (33) Gohlke, H.; Hendlich, M.; Klebe, G. J. Mol. Biol. 2000, 295, 337–356. (34) Huang, S.-Y.; Zou, X. J. Chem. Inf. Model. 2010, 50, 262–273, PMID: 20088605. (35) Kinnings, S. L.; Liu, N.; Tonge, P. J.; Jackson, R. M.; Xie, L.; Bourne, P. E. J. Chem. Inf. Model. 2011, 51, 408–419, PMID: 21291174. (36) Zilian, D.; Sotriffer, C. A. J. Chem. Inf. Model. 2013, 53, 1923–1933, PMID: 23705795. (37) Li, G.-B.; Yang, L.-L.; Wang, W.-J.; Li, L.-L.; Yang, S.-Y. J. Chem. Inf. Model. 2013, 53, 592–600, PMID: 23394072. (38) Gabel, J.; Desaphy, J.; Rognan, D. J. Chem. Inf. Model. 2014, 54, 2807–2815, PMID: 25207678. (39) Sotriffer, C. In Protein-Ligand Interactions, First Edition; Wiley Online Library, 2012; Chapter Scoring Functions for Protein–Ligand Interactions, pp 237–263. (40) Sotriffer, C.; Matter, H. In Virtual Screening: Principles, Challenges, and Practical Guidelines, 10th ed.; Wiley Online Library, 2011; Chapter The Challenge of Affinity Prediction: Scoring Functions for Structure-Based Virtual Screening, pp 177–221. (41) Cross, J. B.; Thompson, D. C.; Rai, B. K.; Baber, J. C.; Fan, K. Y.; Hu, Y.; Humblet, C. J. Chem. Inf. Model. 2009, 49, 1455–1474. 26 ACS Paragon Plus Environment

Page 26 of 28

Page 27 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(42) Hsieh, J.-H.; Yin, S.; Liu, S.; Sedykh, A.; Dokholyan, N. V.; Tropsha, A. J. Chem. Inf. Model. 2011, 51, 2027–2035. (43) Huang, S.-Y.; Zou, X. Proteins: Struct., Funct., Bioinf. 2008, 72, 557–579. (44) Chuang, G.-Y.; Kozakov, D.; Brenke, R.; Comeau, S. R.; Vajda, S. Biophys. J. 2008, 95, 4217–4227. (45) Maiorov, V. N.; Grippen, G. M. J. Mol. Biol. 1992, 227, 876–888. (46) Qiu, J.; Elber, R. Proteins: Struct., Funct., Bioinf. 2005, 61, 44–55. (47) Rajgaria, R.; McAllister, S.; Floudas, C. Proteins: Struct., Funct., Bioinf. 2006, 65, 726–741. (48) Tobi, D.; Bahar, I. Proteins: Struct., Funct., Bioinf. 2006, 62, 970–981. (49) Ravikant, D.; Elber, R. Proteins: Struct., Funct., Bioinf. 2010, 78, 400–419. (50) Chae, M.-H.; Krull, F.; Lorenzen, S.; Knapp, E.-W. Proteins: Struct., Funct., Bioinf. 2010, 78, 1026–1039. (51) Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press, New York, 2004. (52) Nocedal, J.; Wright, S. Numerical Optimization; Springer Series in Operations Research and Financial Engineering; Springer New York, 2000. (53) Wang, R.; Fang, X.; Lu, Y.; Yang, C.-Y.; Wang, S. J. Med. Chem. 2005, 48, 4111–9. (54) Wang, R.; Fang, X.; Lu, Y.; Wang, S. J. Med. Chem. 2004, 47, 2977–80. (55) Popov, P.; Grudinin, S. J. Comput. Chem. 2014, 35, 950–956. (56) Neudert, G.; Klebe, G. Bioinformatics 2011, 27, 1021–1022.

27 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(57) OLBoyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. J. Cheminform. 2011, 3, 33. (58) Webb, B.; Sali, A. Curr. Protoc. Bioinformatics 2014, 5–6. (59) Schrödinger, LLC, 2011. (60) Moustakas, D. T.; Lang, P. T.; Pegg, S.; Pettersen, E.; Kuntz, I. D.; Brooijmans, N.; Rizzo, R. C. J. Comput.-Aided Mol. Des. 2006, 20, 601–619. (61) Molecular Operating Environment (MOE), 2013.08 ; Chemical Computing Group Inc.: 1010 Sherbooke St. West, Suite #910, Montreal, QC, Canada, H3A 2R7, 2015. (62) Halgren, T. A. J. Comput. Chem. 1996, 17, 490–519.

28 ACS Paragon Plus Environment

Page 28 of 28