Generation of Pairwise Potentials Using Multidimensional Data Mining

Sep 5, 2018 - The rapid development of molecular structural databases provides the chemistry community access to an enormous array of experimental dat...
0 downloads 0 Views 8MB Size
Subscriber access provided by University of South Dakota

Statistical Mechanics

Generation of Pairwise Potentials Using Multi-Dimensional Data Mining Zheng Zheng, Jun Pei, Nupur Bansal, Hao Liu, Lin Frank Song, and Kenneth M. Merz J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.8b00516 • Publication Date (Web): 05 Sep 2018 Downloaded from http://pubs.acs.org on September 8, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Generation of Pairwise Potentials Using MultiDimensional Data Mining Zheng Zheng, Jun Pei, Nupur Bansal, Hao Liu, Lin Frank Song and Kenneth M. Merz Jr.* Department of Chemistry, Michigan State University, 578 S. Shaw Lane, East Lansing, MI 48824 Keywords: Structure-Derived Potential, Graphical Model, Bayesian Field Theory, ProteinLigand Binding.

Abstract The rapid development of molecular structural databases provides the chemistry community access to an enormous array of experimental data that can be used to build and validate computational models. Using radial distribution functions (RDFs) collected from experimentally available X-ray and NMR structures, a number of so-called statistical potentials have been developed over the years using the structural data mining strategy. These potentials have been developed within the context of the two-particle Kirkwood Equation by extending its original use for isotropic monoatomic systems to anisotropic biomolecular systems. However, the accuracy and the unclear physical meaning of statistical potentials have long formed the central arguments against such methods. In this work, we present a new approach to generate molecular energy

ACS Paragon Plus Environment

1

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 92

functions using structural data mining. Instead of employing the Kirkwood Equation and introducing the “reference state” approximation, we model the multi-dimensional probability distributions of the molecular system using graphical models, and generate the target pairwise Boltzmann probabilities using the Bayesian field theory (BFT). Different from the current statistical potentials that mimic the “knowledge-based” PMF based on the 2-particle Kirkwood Equation, the graphical-model-based structure-derived potential developed in this study focuses on the generation of lower-dimensional Boltzmann distributions of atoms through reduction of dimensionality. We have named this new scoring function GARF and in this work we focus on the mathematical derivation of our novel approach followed by validation studies on its ability to predict protein-ligand interactions. The raw GARF potential distributions and the resultant fitted parameter set can be obtained at: https://www.dropbox.com/sh/efcjy5n36ji5y2x/AAApznftbm1SgOjj-QVm5Z7a?dl=0.

Introduction Given the size of biological systems, calculations using quantum mechanics to accurately estimate the energies of these systems, rapidly becomes computationally challenging.1 This realization has long driven the development and application of simple energy models based on classical mechanics or numerical simulations to macromolecular systems.2-10 A simple question, that has long held the attention of theoreticians, is can structural data directly derive energy functions, since the answer to the reverse is clearly yes? X-ray crystallography has long been regarded as the gold standard for structural information to evaluate the performance of energy functions.11-13 Amongst all molecular energy models, structure-derived potentials use data

ACS Paragon Plus Environment

2

Page 3 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

mining methods against molecular structure databases to generate geometric probability distributions to derive representations of molecular systems.

The much debated statistical potential, or knowledge-based potential, is a representative category of structure-derived potentials, which employs the two-particle Kirkwood Equation to generate a potential of mean force (PMF) like distributions representing atom or residue pairs which are then used to estimate the preferences of molecular configurations.14-30 Despite broad application and success in protein design31-36,65-67 and protein-ligand binding studies37-39,68, it remains that current statistical potentials are limited to one-dimensional, or pairwise geometric distributions. In other words, all terms employed in the two-particle Kirkwood Equation, or any variant, are either constants, or single-variable functions using the geometry of the target particles as input. The two-particle Kirkwood Equation can be rewritten as:

g(

2)

( 2) r , r ( 1 2)

( r1 , r2 ) = e− β w

(1)

with w(2)( r1, r2) defined as the PMF for the two particles in the canonical ensemble. With the particles being identical and isotropic, the PMF for two particles can also be evaluated using the Boltzmann distribution (equation 2).  V 2N !  w(2) ( r1 , r2 ) = − RT log  2  − RT log ( P ( r1 , r2 ) )  N ( N − 2 )! 

(2)

where V is the configuration space for each particle, N the number of particles and P(r1,r2) the probability of the degenerate energy levels. Given its probabilistic nature, a two-particle Kirkwood Equation is conceptually a conditioned probability. For instance, an atom pairwise statistical potential is a structural spatial probability

ACS Paragon Plus Environment

3

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 92

distribution under the constraint of the atom pairwise contacts, which can be described using Bayes’ theorem.40 P ( structure | condition ) = =

P ( condition | structure ) P ( structure )

(3)

P ( condition )

P (C | S ) P ( S ) P (C )

Equation 3 illustrates the general functional form of Bayes’ theorem applied to a statistical potential. P(S) is the prior probability of the unconditioned structural feature under study, in many cases it is simulated as a constant or a probability distribution integrating over all particle pairwise probabilities observed in structural database. P(C) describes how nature characterizes the molecular systems under study and has normally been regarded as a constant meaning that structural databases follow uniform connection types and structural characteristics. The likelihood function P(C|S) is the key term in this method. An expansion using Bayes’ theorem against the P(C|S) derives terms that can be generated from structural databases. P (C | S ) =

P ( S | C )0 P ( C )0 P ( S )0



P ( S | C )0 P ( S )0

= ∏∏ i

j

P ( sij | cij ) P ( sij )

0

(4)

0

P(sij|Cij)0 and P(sij)0 in equation 4 are the conditional and unconditional probability functions of each individual atom pair i and j obtained from a database of structures. When compared to the “master equation” for the statistical potential (see equation 5) first proposed by Sippl,35 P(sij|cij)0 represents the radial probability P(rij) directly collected from the structural database, and P(sij)0 represents the radial probability QR(rij) in the “reference state”:

ACS Paragon Plus Environment

4

Page 5 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

 P ( rij )  Z   − RT log  ijR  Eij ( rij ) = − RT log  R  Zij   Q ( rij )       P ( rij )   ∝ − RT log  R  Q ( rij )   

(5)

 P ( sij | cij )   ≈ − RT log   P ( sij )    Given this formulation, the energy is decomposed into individual probabilities representing each atom pair ij, a.k.a. the statistical potentials. Nonetheless, for each target atom pair ij, the condition term cij in equation 4 and 5 is a multivariable function describing not only the geometry of the target atom pair ij, but also other atoms in the training data set, namely the background atoms. The pairwise distribution of sij in such cases is conditioned on the geometry of the entire molecular system, SN, including all the contacts with respect to ij, and the contacts among the background atoms:

(

) (

)

P ( sij | cij ) = P sij | C/ ( SN )ij = P si | c1 ( sij ) , c2 ( sα −ij ) , c3 ( sβ −ij )L, cn ( sα −β ) ,L

(6)

where c1 ( sij ) represents the pairwise contact between the target atoms i and j; c2 ( sα −ij ) and c3 ( sβ −ij ) represent the contact between the background atom, namely α or β, and the target atom

pair ij; cn ( sα −β ) represent the contacts between two background atoms α and β. Modeling of the term P ( sij | cij ) using one or multiple independent one-dimensional functions ignores the correlation between the pairwise contact with respect to the target particles, and the contact from any background atom α regarding the target particle pair. Hence, the two-particle Kirkwood Equation applied to statistical potentials inevitably generates PMF-like distributions for particle pairs, where the background contacts are entangled

ACS Paragon Plus Environment

5

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 92

with the pairwise potentials for the target particles, which renders these potentials training-set dependent. 41 In order to explicitly model a geometric distribution in the presence of a large number of background contacts requires the collection of high-dimensional structural data to study the correlation of all contact types from a structural database. However, due to the “curse of dimensionality” for data analysis, the scarcity of available data soon becomes severe due to the rapid increase of the sampling space volume. No current protein-ligand structural database can provide a large enough data size to support such high-dimensional data mining. In light of these considerations the key for structure-derived potential development is to create new data mining procedures to facilitate multi-dimensional geometric distributions in order to study the correlation between pairwise contacts, while introducing appropriate assumption that relies on lower-dimensional data mining, for the purpose of generating an energy-based potential function using structural data mining. Graph theory provides a powerful tool to model pairwise relationships between objects and we introduce this approach in this study for the purpose of developing a new framework for the development of structure-derived potential functions. This approach also offers a new practical data mining procedure beyond the use of the two-body Kirkwood Equation employing the “reference state” approximation.

Method In this section, we first introduce a graphical representation for molecular systems and formulate the molecular Boltzmann probability using a group of lower-dimensional joint probabilities via introduction of the distribution decomposition theorem. Then we introduce the Bayesian Field Theory in order to provide a method to model a pairwise energy-driven distribution function using these lower-dimensional joint probabilities derived by data mining against structural database containing protein-ligand structures.

ACS Paragon Plus Environment

6

Page 7 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Notations and Definitions We use a single letter (e.g. a or B) for a single variable or node name, a single letter with a slash (e.g. a/ or B/ ) or a single letter in braces (e.g. {n} or {N}) for a set of variables, and a specific notation scheme representing several variable sets with certain variable numbers, defined as in Table 1.

Table 1. Notations and their corresponding representations used in this paper. Notation

Definition

M /

a set of 1 variable

B/

a set of 2 variables

T/

a set of 3 variables

Q/

a set of 4 variables

P/

a set of 5 variables

N / or { N}

a set of N variables

When the variables in Table 1 are bolded, each of them represents a union of variable sets with arithmetic progression numbers of variables in each set. Probabilities for such unions can be C12

C23

C12

i

j

k

/ ) , P ( B ) = P ( B/ ) ∏ P ( M written as: P ( M ) = P ( M / k ) and / i ) , P ( T ) = P ( T/ i ) ∏ P ( B/ j )∏ P ( M so on, with Ckn as the number of k-combinations given a set of n elements. When applied to a

/ ) represents the Boltzmann probability molecular system of N atoms, the joint probability P ( N of the molecule containing all pairwise potentials and all many body effects among the N atoms. With Z as the molecular partition function and V as the inter-particle potential, we have:

ACS Paragon Plus Environment

7

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(

− RT log ( P ( N / ) × Z ) = − RT log e N

− β V ( r1 , r2 ,LrN )

= ∑V2 ( ri , rj ) + i≠ j

) = V ( r , r ,L r ) 1

2

N

N

∑ V3 ( ri , rj , rk ) + L +

i≠ j ≠k

Page 8 of 92

N



i ≠ j ≠ k ,L, n

Vn ( ri , rj , rk ,L rn ) (7)

with V2 representing the pairwise potentials, V3 for three body potentials, etc.

A Graphical Representation for Molecular Systems A graphical model is a powerful tool to study a set of connected variables with complicated dependencies. Any group of connected variables can be modeled as a graph, e.g. a molecular system can be represented using atomic coordinates as vertices and edges as the atom pairwise potentials. This subsection illustrates a procedure to reduce the dimensionality of a molecular probability distribution using low-dimensional structural probabilities. We describe a dimensionality reduction procedure using the decomposition properties of Markov networks, and analyzed the error induced using this approximation (see Supporting Information). Because any molecular pairwise contact is bi-directional, without a need to further specify the direction of the variable dependencies, we introduce a Markov network to represent a molecular system. This means that a molecule is an undirected graphical model (UG), in which nodes (atoms) are under the influence (contact energies) from all the direct connections in the network. The Boltzmann probability of a molecular energy state is then modeled as the joint probability of all the atoms given a set of geometries: P ( si , s1 , s2 ,L, sN ) . In this work, we use the decomposable feature of the Markov network to separate the joint probability (molecular energy) of the molecular network thereby simplifying the data structure used to train a structure-derived potential.

ACS Paragon Plus Environment

8

Page 9 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Any UG or part of a UG such that every two distinct nodes are connected is defined as a “clique”. Given the inherent nature of a molecular system that pairwise contacts exist between every two atoms, a molecular system is a union of joint cliques with the smallest clique having only two nodes to the largest clique containing the entire molecule. Initially proposed and subsequently proven by Pearl42 the joint probability of a Markov network with such connections can be decomposed as a product of all intra-clique probabilities of the network, divided by a product of the probabilities of the intersections between cliques. Using a five-atom molecule as an example, the Boltzmann probability of any molecular configuration of the system can be decomposed as: C45

P ( P/ ) =

∏ P (Q ) i

i

C35

(8)

C25

∏ P ( T )∏ P ( B ) j

j

k

k

Any joint probability within any product from equation 8 can be expanded in the following manner: C45

C45

C35

C45

C35

C25

∏ P ( Q ) = ∏ P ( Q/ )∏ P ( T ) = ∏ P ( Q/ )∏ P ( T/ )∏ P ( B/ ) i

i

i

i

j

j

i

i

j

j

k

(9)

k

/ k ) is the Boltzmann probability containing the pairwise potential between the two where P ( B particles in subset k, P ( T/ j ) contains all pairwise potentials and the three-body effects within the 3 atoms in subset j, and so forth. With the knowledge that N-body effects become less important as the number of particles N increases, we ignore the 4-body effects and above in our molecular system:

ACS Paragon Plus Environment

9

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 92

C35

P ( P/ ) ≈

∏ P (T ) i

i

(10)

C25

∏ P (B ) j

j

A detailed deduction from equation 8 to equation 10 is included in the Supporting Information. Merging equation 8 and equation 10 generates:

  ∏i P ( Qi ) ≈  ∏j P ( Tj )   

2

(11)

∏ P ( Q/ ) ≈ ∏ P ( T ) i

j

i

j

By ignoring 4-body effects in a five-atom molecular system, the product of a higher dimensional joint probability,

∏ P ( Q/ ) , is approximated as the product of a union of the lower i

i

∏ P (T ) .

dimensional joint probability,

j

Generalizing this approximation to an N-atom

j

molecular system, the dimensional reduction of a molecular Boltzmann probability from an N-1 joint probability to an N-2 joint probability is performed by ignoring the N-1 body effect in the molecular system. CNN−1

CNN−2

∏ P ({N − 1} ) ≈ ∏ P ({N - 2} i

i

j

j

)

(12)

We can also generalize the functional form of equation 8 as follow: CNN−1

CNN−1

CNN− 2

∏ P ({N -1} ) = ∏ P ({N − 1} )∏ P ({N - 2} i

i

i

i

j

j

)

(13)

Hence the Boltzmann probability of an N-body molecular system can be decomposed into lower-dimensional joint probabilities through dimensional reduction by repeatedly applying equations 12 and 13 to give the general form for the UG decomposition function:

ACS Paragon Plus Environment

10

Page 11 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

C NN−1

P(N / )=

∏ P ({N - 1} ) i

i

C NN− 2

C NN−3

C4N

C3N

C2N

∏ P ({N - 2} )∏ P ({N - 3} )L ∏ P ( Q )∏ P ( T )∏ P ( B ) j

r

k

j

k

r

s

s

(14)

t

t

For instance, when only including 3-body effects among particles, while ignoring all higher dimensional many-body effects, the molecular Boltzmann probability simplifies to: C3N

P(N / )≈

∏ P (T ) i

i

C2N

∏ P (B )

C3N

= ∏ P ( T/ i )

(15)

i

j

j

To form the molecular Boltzmann probability for two-particle joint probabilities, we only use the pair potentials amongst the atoms: C2N

C2N

i

i

P(N / ) ≈ ∏ P ( Bi ) = ∏ P ( B/ i )

(16)

In the above we described a molecular Boltzmann probability using a graphical model, which provides a very important dimensional reduction feature, such that any high-dimensional joint probability can be decomposed iteratively until reaching lower-dimensional probabilities that can be directly quantified through training on a structural database. The expansion of a joint probability follows the chain rule of basic probability theory:

P ( T/ ) = P ( a1 , a2 , a3 ) = P ( a1 | a2 , a3 ) P ( a2 | a3 )

(17)

P ( B/ ) = P ( a1 , a2 ) = P ( a1 | a2 )

(18)

From a structural database the radial probability with the same number of “target atoms” is conditioned by a mix of background contacts, e.g. P ( T/ ') = P ( a1 , a2 , a3 | C30 )

(19)

ACS Paragon Plus Environment

11

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

P ( B/ ' ) = P ( a1 , a2 | C20 )

Page 12 of 92

(20)

As mentioned in the introduction, it is challenging to generate a converged conditional term C0 from available protein-ligand databases given the chemical space of biomolecules and the distinct molecular configurations presented by each structure in the training set. Hence, our approach is to circumvent this issue by cancelling the conditional term C0 during the data mining procedure, so the Boltzmann probabilities of the “target atoms” can be generated.

A Structural Database Network Simulation Using Bayesian Field Theory Bayesian networks, with the richer language of directed graphs, provide better tools for studying the conditional probability distributions of the geometries of the “target atoms” using directed edges to indicate variable dependencies. By definition, a Bayesian network has to be a directed acyclic graph (DAG), a graph with directed edges (conditions) while not forming a directed cycle, i.e. a “cause” will never be the cause of itself through any causal chains in the network. In a molecular DAG, because (1) pairwise contacts exist between every two atoms in a molecular system, i.e. a molecular graphical network contains no uncoupled vertices, and (2) the particle number and pairwise contact number/types of a molecular system remain constant in a certain equilibrium state, the direction of any atom pairwise arrow is exchangeable while still remaining a DAG. In other words, all DAGs representing a molecular system satisfying the two criteria above are Markov equivalent according to the theorem of Pearl.42 Figure 1 shows all 6 equivalent DAGs with different arrow directions for a triatomic molecule. Therefore it allows the application of Bayesian networks to, for example, protein-ligand systems with specifically designed arrow directions representing each target atom pair. Each independent pairwise Boltzmann distribution is then generated using the probabilistic molecular graphic model.

ACS Paragon Plus Environment

12

Page 13 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 1. Equivalence class of models for a triatomic molecule. Expansion of the network by adding contact nodes in each coordinate → coordinate chain still keeps the Markov equivalence of the networks.

A pairwise Boltzmann probability is a conditional radial probability dependent on the pairwise contact energy. To apply a more detailed DAG model against molecular systems, we separate the atomic vertices into two categories: the atomic coordinates ri and the atom pairwise contacts Ci,j. The arrows indicate the correlation between the contacts and atomic coordinates. Every two atomic coordinates ri and rj are connected through a chain node Ci,j in the molecular network. A pairwise contact probability P(Ci,j) is 1 or 0 indicating that a contact exists/does not exist

ACS Paragon Plus Environment

13

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 92

between atom pair i, j. The conditioned probability P(ri ,rj|Ci,j) represents the Boltzmann probability of the atom pair driven by its pairwise contact. Its likelihood probability is then generated according to Bayes’ rule:

P ( Ci , j | ri , rj ) =

P ( ri , rj | Ci , j ) P ( Ci , j ) P ( ri , rj )

∝ P ( ri , rj | Ci , j )

(21)

P ( ri , rj ) is a constant for any ri and rj, because this probability represents the unconditioned

probability distribution of the atom pair i, j with no restraints; hence, P ( Ci , j | ri , rj ) is proportional to P ( ri , rj | Ci , j ) in a molecular system. A basic triatomic molecular graph is expanded in this way and shown in Figure 1. To study the radial distribution of the atom pair (r0 − r1) under consideration, a Bayesian network is established using a target atom in the pair (r0) as the fork node (ancestor of all nodes) and the other (r1) as the “end node” with no descendants (i.e. all paths start from the target atom r0 and meet at the target atom r1 in the end). Figure 2 gives a graphical representation of a molecular contact map for a “target” atom pair. The target atom pair is circled in red and further separated into two vertices: the pairwise contact (C1,0) and the atomic 3-D coordinates (r0 and r1). All other atoms contacting the target atom pair are classified as the background vertices in orange circles. Full expansion of the molecular network in Figure 2B makes a complicated graph with every atomic coordinate node connected with each other through the corresponding contact nodes. For a better representation, the molecular network is shown as a 3-D nautilus surface in Figure 2D, with the target atom pair nodes r0, r1 and three background atom nodes r2, r3 and r4, linked via pairwise contact nodes.

ACS Paragon Plus Environment

14

Page 15 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 2. Converting the representation of a molecular contact map regarding the target atom pair (in red circle) to a Bayesian network, with the background atoms and contacts marked in orange. A 3-D nautilus shell molecular network is presented in 3C and 3D with all atom coordinate nodes chained up through the contact nodes (semitransparent nodes). The target atom pairs (coordinates and the contact) are marked in red and the background atoms in orange.

ACS Paragon Plus Environment

15

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 92

0 For convenience, we use one variable, r1 ∈ ( R1 )0 , instead of two, r0 ∈ R0 and r1 ∈ R1 , to

represent the target atom pair coordinate by setting r0 as the fixed point of reference for the molecular geometry, and the configurations of all other atoms are relative to the location of r0 . Thus the conditional probability regarding r10 is expressed as: P ( r , c | C/ , R/ ) = 0 1

=

0 1

P ( C/ , R/ | r10 , c10 ) P ( r10 , c10 ) P ( C/ , R/ )

P ( c/ v , r/v , c/ v +1 , r/v +1 , c/ v + 2 , r/v + 2 ,L , c/ n , r/n | r10 , c10 ) P ( r10 | c10 ) P ( c10 )

(22)

P ( c/ v , r/v , c/ v +1 , r/v +1 , c/ v + 2 , r/v + 2 ,L , c/ n , r/n )

P ( c/ v , r/v , c/ v +1 , r/v +1 , c/ v + 2 , r/v + 2 ,L , c/ n , r/n ) = P ( c/ v , r/v | c/ v +1L , r/v+1L , c/ n , r/n ) × P ( c/ v +1 , r/v +1 | c/ v+ 2L , r/v + 2L , c/ n , r/n ) ×L × P ( c/ n , r/n )

(23)

According to the chain rule of probability, the expansion of the numerator and denominator in equation 22 both go through the "contact layers" from the direct contact path "target ← v" with the atom set r/v (coordinates of atom set v with respect to the coordinate r10 ) and contact set c/ v , to longer paths "target ← v ← v+1" until the longest path in the molecular network "target ← v ← v+1 ← … ← n-1 ← n", as shown in equation 23. P ( r10 | c10 ) , in the numerator in equation 22, is the Boltzmann probability of the target atom pair, that the radial distribution is exclusively driven by its atom pairwise contact. Our goal in this study is to generate the term P ( r10 | c10 ) , from the joint probability P ( r10 , c10 | C/ , R/ ) . As a matter of fact, due to the inseparable nature of the atom-pairwise contact layers in a molecular network, in which each atom is included in multiple layers, and atoms with different types are connected under different manners (bond, angle, dihedral and nonbonding contacts), quantifying each individual atom pairwise probability distribution via training against a structural database is almost impossible due to the “curse of

ACS Paragon Plus Environment

16

Page 17 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

dimensionality” given such a complicated data structure. Simplification against the network is crucial for any data mining procedure. The best option is to eliminate the requirement to train the complicated C/ and R/ variable sets, and develop a data mining procedure to directly quantify the Boltzmann distribution term P ( r10 | c10 ) . The Bayesian field theory (BFT), first introduced by J. C. Lemm,43 is used herein to address this issue. In a molecular network, the atomic coordinates are directly observable given the structure yet the underlying potential governing them is a challenge to perceive. BFT simplifies the molecular network by introducing hidden nodes that represent all the indirect contacts among the background atoms. The molecular network is modified such that all the direct paths leading to the “target” atom pair are separated from the indirect background contacts, and each background path is formulated using two models, likewise from Bayes' theorem, a prior model, providing the direct probabilistic description of the background atoms contacting on the target atom pair, and a likelihood model, providing the information necessary to simulate the effects of the background contacts on the conditional probability of the target atom pair. In the molecular network, each background atom i affects the target atom pairwise (1−0) probability distribution through a direct contact path, and multiple longer indirect paths by impacting other background atoms. The BFT assumes a "background contact field" set of nodes (

D / ) with each node Di including all the other background pairwise contacts against the background atom i, thus replacing the complicated paths among the background atoms. The background atom i then goes through each individual path, "target(1−0) ← vi ← Di", and meet at the target atom pair (1−0) in the simplified molecular network. This is accomplished iteratively, by selecting each background particle i with the remaining particles assigned to the Di node to form every "target(1−0) ← vi ← Di" path until all particles in the molecular system are selected.

ACS Paragon Plus Environment

17

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 92

Figure 3 illustrates such a molecular network transformation. The background probabilities of

(

)

the network regarding r10 is modeled using a prior probability, P r10 | ( ci ,1 )0 , ( ri )0 , and a likelihood function, P ( ( ri )0 | Di ) , for each Di → (ri)0 → (ci,1)0 → r10 path. In a directed graph, a BFT-modified multiple conditioned probability, namely P ( r10 | C/ , R/ , D / ) , can be expanded following the chain rule of probability including the probabilities of all the chain nodes tracing back along each path until reaching the end or a collider node (i.e, a node with arrows that collide "head-to-head"). The approximation applied to the D nodes assumes each Di → (ri)0 →

(ci,1)0 → r10 path is independent from each other. Hence, this approximation ignores many body effects among the background contacts in the D nodes and ignores the correlation between pairs of D nodes.

ACS Paragon Plus Environment

18

Page 19 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 3. Illustration for the transformation from a regular molecular network to a BFT modified network. A→B: from the 3D to 2D representation of a molecular network; B→C: using relative coordinates referring to the atom r0; C→D: transformation from the regular molecular network to a BFT modified network, where indirect contacts are represented using an independent “hidden node” D, all contacts regarding the target atom pair (1−0) are then independently connecting to the node (r1)0.

ACS Paragon Plus Environment

19

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 92

Using the BFT-modified molecular network in Figure 3D, the conditioned probability of r10 can be expanded as in equation 24:

P ( r10 | c10 , C/ , R/ , D / ) = P ( r10 | c10 ) P ( r10 | C/ , R/ , D /) Nv

= P (r | c 0 1

0 1

)∏ P (r

0 1

| ci , ri , Di )

i

= P (r | c 0 1

Nv

0 1

)∏ P (r

0 1

| ci ) P ( ci | ri )P ( ri | Di )

(24)

i

Nv

∝ P ( r10 | c10 ) ∏ P ( r10 | ci ) P ( ri | ci )P ( ri | Di ) i

Compared to equation 22, such transformation yields a much simpler functional form, as shown in equation 24, with the D set containing all the details concerning the indirect variables contained in the molecular network. To eliminate the need to explicitly generate the probabilities of the D nodes, we perform a series of algebraic manipulations and solve for the Boltzmann probability of the target atom pair P ( r10 | c10 ) benefitting from the structure of the simplified network we derived. From the BFT-modified molecular network, we can also prepare the conditional probabilities for each background atom coordinate ri according to the connections in the network (Figure 4): P ( ri | ci , r10 , Di ) = P ( ci , r10 | ri ) P ( ri | Di ) =

P ( ri | ci , r10 ) P ( ci , r10 ) P ( ri

)

P ( ri | Di )

(25)

∝ P ( ri | ci ) P ( r10 | ci ) P ( ri | Di ) Given that equation 25 is the same as the term inside the product notation of equation 24, we can directly generate the independent atom pairwise probability of the target atom pair (1−0) using equation 24 divided by equation 25 including all the background atoms i through their

ACS Paragon Plus Environment

20

Page 21 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

individual paths, with Nv being the total number of background atoms with respect to the “target” atom pair. P ( r10 | c10 , C/ , R/ , D /) Nv

∏ P ( ri | ci , r , Di )

P (r | c 0 1

=

0 1

0 1

Nv

)∏ P (r

0 1

Nv

∏ P (r

i

i

| ci , ri , Di )

i

| ci , r10 , Di )

i

(26)

Nv



P ( r10 | c10 ) ∏ P ( r10 | ci ) P ( ri | ci )P ( ri | Di ) i

Nv

∏ P (r

i

| ci ) P ( r10 | ci ) P ( ri | Di )

=P ( r10 | c10 )

i

Figure 4. Two conditional probabilities formulated in (A) equation 24 and (B) equation 25, with

the circled node as the target variable and squared nodes as the conditional variables. Probabilities observed for the circled node is conditioned by the squared nodes depending on the connections observed in different situations.

ACS Paragon Plus Environment

21

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 92

When looking at the left-hand side of equation 26, the numerator P ( r10 | c10 , C/ , R/ , D / ) contains multiple conditions with respect to r10 , including all the background coordinate variables distinguished by atom and interaction type e.g. bond, angle, torsion or nonbonded; While each term, P ( ri | ci , r10 , Di ) , in the denominator, is a two-dimensional radial probability of two spatial variables: (1) ri, the coordinate of a particular background atom i, joint with the target atom pair and (2) r10 , the “target” atom pairwise distance. As an independent atom pairwise probability generation protocol, every time a 2D spatial probability P ( ri | ci , r10 , Di ) is prepared - from a structural database - for each background contact i with respect to the target atom pair (1-0), which is then eliminated by the joint probability P ( r10 | c10 , C/ , R/ , D / ) using equation 26, ultimately derives the pairwise probability P ( r10 | c10 ) . Given we assume that the D nodes are independent, this procedure bypasses the obstacles associated with modeling the background contact probabilities of the D nodes. This is accomplished by using the ratio of different two-dimensional radial distribution functions to cancel these terms as given equation 26 and represented in Figures 5 and 6 below. Because of this the particle pairwise probabilities we generate are not knowledge-based potentials containing PMF-like structural information, but are radial distributions driven by the pairwise contacts to the target particle. In this study, we used the PDBBind database v2015 for generating the protein-ligand nonbonded heavy atom pairwise probabilities. All of the 11742 X-ray crystal/NMR structures excluding the test cases applied in the later validations were utilized for the structural data collection step. Due to the unreliability of the hydrogen atom locations in the crystal structures, only heavy atoms were studied and parameterized for, with the hydrogen atoms only used to

ACS Paragon Plus Environment

22

Page 23 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

predict the protonation state (see Table 2). 20 heavy atom types for protein atoms and 24 heavy atom types for ligand atoms were defined and are listed in Table 2. This yielded a total of 480 protein-ligand non-bonded atom-type pairs. Other atom types, especially ligand atom types, were merged into similar types in this list because of the paucity of experimental structural data.

Table 2. List of atom types in the GARF energy function. Atom Type

Description Protein Atom Types

C

sp2 carbonyl carbon and aromatic carbon with hydroxyl substituent in tyrosine

C*

sp2 aromatic carbon in 5-membered ring with 1 substituent

CA

sp2 aromatic carbon in 6-membered ring with 1 substituent

CB

sp2 aromatic carbon at junction between 5- and 6-membered rings

CC

sp2 aromatic carbon in 5-membered ring with 1 substituent and next to a nitrogen

CN

sp2 aromatic junction carbon in between 5- and 6-membered rings

CR

sp2 aromatic carbon in 5-membered ring between 2 nitrogens and bonded to 1 H (in HIS)

CT

sp3 carbon with 4 explicit substituents

CV

sp2 aromatic carbon in 5-membered ring bonded to 1 N and bonded to an explicit hydrogen

CW

sp2 aromatic carbon in 5-membered ring bonded to 1 N-H and bonded to an explicit hydrogen

N

sp2 nitrogen in amide group

N2

sp2 nitrogen in base NH2 group or arginine NH2

N3

sp2 nitrogen with 4 substituents

NA

sp2 nitrogen in 5-membered ring with hydrogen attached

NB

sp2 nitrogen in 5-membered ring with lone pairs

O

carbonyl oxygen

O2

carboxyl oxygen

OH

alcohol oxygen

S

sulfur in disulfide linkage or methionine

ACS Paragon Plus Environment

23

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

SH

Page 24 of 92

sulfur in cysteine Ligand Atom Types

C.3

sp3 carbon without polar group substituent

C.2

sp2 carbonyl carbon without polar group substituent

C.1

sp carbon

C.ar

sp2 aromatic carbon without polar group substituent

O.3

alcohol oxygen

O.3p

ether oxygen

O.2

carbonyl oxygen

O.co2

carboxylate oxygen

O.2v

sulfate/phosphate oxygen

N.2

sp/sp2/aromatic nitrogen

N.1h

sp3 nitrogen with 1 hydrogen attached

N.2h

sp3 nitrogen with 2 hydrogens attached

N.3h

sp3 nitrogen with 3 hydrogens attached

P

phosphorus

F

fluorine

Cl

chlorine

Br

bromine

I

iodine

C.cat

carbon cation

S.3

thiol/thioether sulfur

S.o

sulfoxide sulfur

C.3X

sp3 carbon with polar group substituent

C.2X

sp2 carbonyl carbon with polar group substituent

C.arX

sp2 aromatic carbon with polar group substituent

In order to generate the two-dimensional molecular configuration probabilities, the “target” atom pairwise distance and the configurations of all its corresponding background atoms need to

ACS Paragon Plus Environment

24

Page 25 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

be recorded simultaneously. In this study, for every “target” atom pairwise distance collected from a protein-ligand crystal structure, all background atoms in both the protein and ligand within 6Å from either of the “target” atom pairs were also collected. In order to differentiate these background contacts based on their contact types against the “target” atom pair, ci (see equation 26), we classified these background atomic configurations into different categories, i.e. bond, angle, torsion contacts with respect to the “target” atom pair, or nonbonded contacts with either of the “target” atoms. The configuration of a background atom relative to the “target” atom pair (rα) was calculated as rα = dα sin (θ ) , where dα is the distance between atom α and the centroid of atom 1 and atom 2, and θ the angle between the vector atom 0→atom 1 and Ct10 →atom α. Two-dimensional radial distribution functions (RDF) were then generated using the procedure described above and transformed into the probabilities P ( r10 | c10 , ri 0 , C/ v ≠ i , R/ v ≠i , D / v ≠i ) and P ( ri | ci , r10 , Di ) through normalizations along the two vectors, r10 and ri, respectively (see Figure Nv

5). Subtracting the

∏ P (r

i

| ci , r10 , Di ) term from the P ( r10 | c10 , C/ , R/ , D / ) term then generates the

i

Boltzmann probability distribution of the “target” atom pair (see Figure 6). Finally, in order to generate a smooth potential function, the generated probability distributions were fit to a Lennard-Jones (LJ) form:

 σ  ε1  r 0 0 P ( r1 | c1 ) = exp     

α β  σ    − ε2     r   − RT  

(27)

The reason that we apply an α−β LJ functional form instead of a 12−6 LJ function plus an electrostatic term is that, an “α” exponential term is more flexible in fitting the repulsive distance

ACS Paragon Plus Environment

25

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 92

range, than using a fixed power; and the fitting process is facilitated with only one repulsive term and one attractive term in the α−β LJ functional form.

Figure 5. 2D-probability distributions targeting the atom pair of an amide nitrogen (type “N”)

from the protein and a carbonyl oxygen (type “O2”) from the ligand with sp3 carbon atoms (type “CT”) as the background atoms contacting the amide nitrogen atoms through bond interactions. The intensity is stronger in lighter regions while weaker in darker regions (A) 2D-RDF generated from collecting the PDBBind v2015 database, along two radial vectors, r10 and rα0 , where rα0 is 0 0 calculated using rα = dα sin (θ ) , with θ the angle between the vector atom0→atom1 and the

vector Centroid of atom0 & atom1 ( Ct10 )→atomα. (B) Conditioned probability distribution

ACS Paragon Plus Environment

26

Page 27 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

P ( r10 | c10 , rα0 , C/ v ≠α , R/ v ≠α , D / v ≠α ) obtained by normalizing the 2D-RDF along r10 . (C) Conditioned

probability distribution P ( rα0 | ci , r10 , Di ) by normalizing the 2D-RDF along rα0 .

ACS Paragon Plus Environment

27

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 92

ACS Paragon Plus Environment

28

Page 29 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 6. The computational procedure illustrated in Figure 5 was performed for all background

atom contacts on both the protein and ligand sides with respect to the “target” atom-pair type until the pairwise Boltzmann probability P ( r10 | c10 ) was finally generated, which is then fitted against the Lennard-Jones α−β function, in order to generate a smooth potential function.

Figure 7. Comparison of the 1-D radial distribution function (RDF) for the N−O2 atom pair

collected from the PDBBind database to the probability distribution generated using the Bayesian Field Theory. Noise in the raw RDF from the background contacts is effectively eliminated.

Using the approach detailed above we created the GARF energy function. From Figure 7, we can clearly see the background-induced noise in the radial distribution function is effectively eliminated using such approach. The major difference between GARF and traditional statistical potentials lies in the fact that for the latter for each atom pairwise contact, (r10), all background contacts were merged into an averaged one-dimensional conditional function in a “black box” manner. The problem with such an assumption is, that for a particular atom pair in a molecular

ACS Paragon Plus Environment

29

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 92

system, the background condition could be very different from the conditional function. On the other hand, the BFT-based structural-derived potential generation method assumes that every atom pairwise probability distribution is under a group of independent conditions modeled as different pairwise contacts using orthogonal 2-D radial distributions. The “black box” nature of the reference state is then quantified and eliminated using multi-dimensional atom pairwise RDFs conditioned on the target contact. Even though both methodologies rely heavily on the quality of the database used in training, the BFT-based method removes the background effects present in the structural database; hence, the so-generated distribution is not biased to the training set. This method, in the authors’ opinion, goes one step further to capture the physics contained in the structural database by generating independent atom pairwise probability distributions.

BFT-based Structural-Derived Potentials from Using Different Structural Data Sources In order to explore the BFT-based method further, we applied it to two different structural data sources, and then compared the atom pairwise distributions from these two data sources. Specifically, the goal of this analysis was to show that this approach is able to generate similar atom pairwise Boltzmann distributions for the same atom type pair from multiple structural data sources. Two structural data sources were used. The first is the PDBBind protein-ligand structural database containing 10000+ protein-ligand complexes. The second was derived from a MD simulation of alanine capped with N-terminal acetyl (ACE) and C-terminal N-Me amide (NME) in a mixed solvent system of water and methanol (250 snapshots were used in total). We used a

ACS Paragon Plus Environment

30

Page 31 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

mixed solvent system to create a more complicated “background” to challenge the BFT-based method. The AMBER ff14sb force field was used in the simulation and AMBERTools16 was used to generate the topologies. The particle mesh Ewald (PME) method was used to treat the long-range electrostatic interactions. All bonds involving hydrogen atoms were constrained using SHAKE. The Nonbond cutoff was set to 8 Å and a 2 fs time step was used. PMEMD.CUDA was utilized to run the MD simulations on a K80 GPU. The capped alanine was placed into the center of a box of water molecules, which was then placed into a box of methanol molecules. The size of the solvent boxes were such that a ratio of ~ 1:1 water/methanol was achieved. This resulted in 36Å cube containing 667 TIP3P water molecules and 671 methanol molecules surrounding the capped alanine. Minimization was performed to remove bad steric clashes before the simulation was begun. This was followed by a 1 ns NVT simulation going from 0 K to 300 K where a restraint of 10 kcal/(mol·Å2) was placed on the peptide. The density of the system was equilibrated via another 3 ns of NPT MD simulation, after which a 50 ns NPT simulation was performed to collect snapshots every 0.2 ns, yielding 250 snapshots in total generating our MD derived structural database. In this comparison, we targeted atom pairs that both existed in the two structural data sources; a hydrophobic contact, CT ↔ C.3X, and a polar-polar contact, O ↔ O.3. Figure 8A compares the raw RDF generated directly using the structural data sources while Figure 8B gives the BFT generated atom pairwise distribution where all background effects have been removed. Figure 8A shows that, from different structural data sources the obtained RDFs are quite different for a given atom pair. This is not unexpected because the two data sources present different background contacts that affect the derived RDFs accordingly. Traditional statistical potentials

ACS Paragon Plus Environment

31

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 92

ignore such correlations making it difficult to generate independent pairwise potentials. Our work eliminates these two-body contact correlations between the target particle pair and the background contacts. Even though many body effects were ignored, the results in Figure 8B show that irrespective of the data source employed, once the background effects were removed using the BFT approach, good-quality pairwise probability distributions were obtained. Importantly, this method generates maxima in probabilities at 3.7 ~ 3.9 Å for both CT ↔ C.3X nonbonding distributions and 2.8 ~ 2.9 Å for both the O ↔ O.3 nonbonding distributions. Using pairwise distributions generated from the PDB the maxima at shorter distances relative to the AMBER-MD derived data set differed, indicating basic geometric differences between experiment and force field derived data sets.

Figure 8. Comparison of the (A) raw radial distribution functions (RDFs) generated directly

using two different structural data sources for the CT ↔ C.3X atom type pair and the O ↔ O.3

ACS Paragon Plus Environment

32

Page 33 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

atom type pair, to the (B) probability distribution generated using the Bayesian Field Theory. The distributions from the alanine-mixed solvent data source are illustrated in blue color and the distributions from the PDBBind database are in orange color.

BFT-based Method Backtracking Force Field Reversible Work Beyond removing background effects from different structural data sources, it is also important to demonstrate that this method can backtrack to the force field using MD-derived structural information. For this analysis, we studied the O ↔ O.3 pairwise distribution obtained from the MD simulation of the peptide mixed solvent simulation. By removing the background effects on a pairwise distribution derived from MD-generated structures, the BFT method generates a distribution representing the reversible work done by the target particle pair in vacuum. In particular, we generate the reversible work that describes the pairwise distribution ( g ( rO ↔O.3 ) in equation 28) for the O ↔ O.3 pair from the AMBER ff14sb force field, and compare it to the BFT-generated pairwise distribution over 250 MD snapshots (see Figure 8). g ( rO ↔O.3 ) = e

− β w( rO ↔O .3 )

(28)

The BFT method generates a heavy-atom-based pairwise distribution, so that the O ↔ O.3 distribution curves generated in Figure 8B are united atom distributions with the hydroxyl hydrogen merged into the O.3 atom type. For an apples-to-apples comparison, we “pulled” the hydroxyl group away from the carbonyl oxygen atom in vacuum using the ff14sb force to generate the reversible work profile. The force field based reversible work with respect to two chemical groups in vacuum requires an estimate of the work done by each pairwise force term along with all the associated translational and rotational degrees of freedom.

ACS Paragon Plus Environment

33

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

w=∫

RO↔O .3

rO ↔O .3

Page 34 of 92

electric RO↔O .3 dE dEOvdw O ↔O .3 ( rO ↔ O .3 ) ↔O .3 ( rO ↔ O .3 ) drO ↔O.3 + ∫ drO ↔O.3 rO↔O .3 drO ↔O.3 drO ↔O.3

 ΘO−O .3−H dwOelectric − β wOelectric ↔H ↔H e dθO −O.3− H  RO↔ H ∫θO−O .3− H dr O↔ H  +∫ ΘO−O .3− H − β welectric rO↔ H  e O↔H dθO −O.3− H ∫ θO−O .3− H  

   drO ↔ H   rO↔H

(29)

tric where wOelec ↔ H is the electric potential energy for each rO ↔ H and θO −O.3− H .

electric O↔ H

w

c ∂EOelectri ↔ H ( rO ↔ H ) cos (θ O −O .3− H ) drO ↔ H ( rO ↔ H ,θO −O.3− H ) = ∂rO ↔ H

(30)

Using equation 29 we calculated the work by pulling the hydroxyl group and the carbonyl oxygen atom away from each other along the O ↔ O.3 distance from 2 Å to 10 Å. The potential energy of the hydroxyl hydrogen atom and the carbonyl O has two degrees of freedom along this pathway (equation 30). We first examined the 250 MD snapshots to understand the integral ranges for all the independent geometric variables in equation 29. Knowing that the vibration of the H-O.3 bond length in the hydroxyl group has very limited contribution to the total reversible work, and rO↔H can be generated using the law of cosines given the values of rO↔O.3 , θO −O.3− H along with the bond length of H-O.3 in the hydroxyl group (equation 31), the integral space for equation 29 can be interpreted by the degrees of freedom of rO↔O.3 and θO −O.3− H . rO ↔ H =

( rO ↔O.3 ) + ( lH −O.3 ) 2

2

− 2 × rO ↔O .3 × lH −O .3 × cos (θO −O .3− H )

(31)

Figure 9A shows the values of rO↔O.3 and the corresponding θO −O.3− H for each hydroxyl group and carbonyl oxygen atom pair sampled with the rO↔O.3 values distributed from 2.5 Å to 5 Å. This plot clearly illustrates the restrained distribution for θO −O.3− H when O ↔ O.3 is at close range (< ~3.3 Å). Moreover, a two dimensional ( rO↔O.3 , θO −O.3− H ) distribution plot (see Figure

ACS Paragon Plus Environment

34

Page 35 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

9B) shows the geometric preference for the formation of a hydrogen bond with a O ↔ O.3 distance distributed roughly between 2.5 Å and 3.2 Å, θO −O.3− H and between 0 and 25 degrees. From this analysis, the integral ranges for equation 29 are clearly revealed using the MD snapshots. The reversible work is then calculated and used to estimate g ( rO ↔O.3 ) using equation 28, which is then compared to the BFT generated distributions. Figure 10 shows the agreement between the two distributions with a RMSD of 0.28 and a Pearson’s R value of 0.86. Both have maxima in their probabilities at 2.9 Å for the O ↔ O.3 interaction. Using the BFT-based method, we compared the pairwise distributions generated against different data sources, and backtracked out the force field driven reversible work, which provides further evidence that the BFT-based method is able to reproduce the pairwise Boltzmann distribution by removing background effects.

ACS Paragon Plus Environment

35

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 92

Figure 9. Structural information regarding the hydroxyl group and carbonyl oxygen from the

250 MD snapshots with the capped alanine in the mixed solvent system of water and methanol.

ACS Paragon Plus Environment

36

Page 37 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(A) rO↔O.3 and θO −O.3− H for each sampled hydroxyl group and carbonyl oxygen atom pair. (B) a two-dimensional number distribution with respect to rO↔O.3 and θO −O.3− H .

Figure 10. Probability distributions for the hydroxyl group and carbonyl oxygen from 2 Å to 10

Å. The blue curve represents the BFT generated distribution with the background effects removed. The red curve represents the distribution of the reversible work generated using the ff14sb force field.

In this study, we introduced the BFT approach to generate the pairwise Boltzmann probability using available structural databases. Many-body effects can also be modeled using the BFT approach with, instead of 2-dimensional RDFs, hyper-dimensional probability distributions

ACS Paragon Plus Environment

37

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 92

collected from structural databases. However, the estimation of many-body effects suffers from the “curse of dimensionality”, and as a result would require a larger database representing atom pairwise interactions. When going from two-dimensional probabilities, which is used in GARF, to 3-dimensional probabilities while keeping the same data quality as realized in this study, the data collection would need to cover at least the size of the training set in this study raised to the power of 3/2, or ~40 million examples of protein-ligand complex structures. This data requirement exceeds the size of the current Protein Data Bank or CSD. Given this realization, GARF represents a reasonable compromise given the available data.

Method Validation and Discussion Herein, we apply GARF as an energy function to protein-ligand binding affinity studies. Binding affinity estimation is an attempt to predict the free energy change associated with the formation of a protein-ligand complex. However, an accurate free energy estimation protocol requires sophisticated sampling procedures to obtain a converged energy landscape, which is too expensive for typical scoring function evaluation. Different simplifications have been introduced over the past several decades for fast estimation of protein-ligand binding free energies. In this work, we introduce several protocols with different sets of approximations to explore both enthalpy and entropy evaluations using the GARF energy function.

End-State Free Energy Method An end-state free energy strategy uses either only the structure of the optimized bound state, or the two end states (free and bound states) of the protein and ligand. Even though neither procedure considers structural flexibility for both the protein and ligand, it is still a fast, practical and broadly used approach in the computer-aided drug discovery field. However, it is

ACS Paragon Plus Environment

38

Page 39 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

appreciated that not considering the conformational entropy brought by the structural flexibility of the ligand and/or active site introduces a significant error source.44-46 In order to control for the influence of entropy on the prediction of the binding affinity, a relative binding free energy change study is a good way to circumvent such issues via the study of systems with similar structures, which are more likely to share similar microstate distributions in the entropy estimation. Notably the successes of FEP approaches applied to relative free energy calculations against protein targets involving congeneric ligand series involve, in part, cancelation of enthalpic and entropic errors given the structural similarity of the ligands involved.47-49 Herein, not only were large scale test sets involving diverse protein and ligand structures explored to evaluate the robustness of binding mode prediction and binding affinity reproduction of our new potential, but test sets with selected homologous protein families and congeneric ligands with respect to common receptors were also examined, to illustrate the calculation of enthalpy using the end-state methods. Validations were performed using a series of docking and scoring studies. Docking was performed using the Heatmap docking program developed in our group,50 which applies a gridbased technique to place the ligand. This program first builds and evenly deploys grid points within the binding site of the protein receptor. The grid points are then probed with different ligand atom types and scored using the GARF energy function accounting for all of the surrounding protein atoms. The scored grid points for the various ligand atoms are further used to guide the placement of the ligand. Scoring was performed using equation 32 to calculate the binding free energies, where the free energy change was divided into enthalpy and entropy components associated with the proteinligand complexation. Because only the end state structures were employed in this study, the free

ACS Paragon Plus Environment

39

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 92

energy equation was modified to assume that the enthalpies and entropies for both apo-state and holo-state structures were the same. The calculation of ∆Gbinding was therefore approximated by including the protein-ligand binding enthalpies and the solvation enthalpy change before and after binding. By ignoring the pressure-volume work against the molecular systems, the change in enthalpy was further approximated by the change in potential energies.

∆Gbinding = ∆H binding − T ∆Sbinding holoprotein hololigand holoprotein − hololigand apoprotein apoligand = ( H intra + H intra + H inter − H intra − H intra ) protein − ligand complex apoprotein apoligand + ( H solvation − H solvation − H solvation ) − T ( S protein−ligand complex − S apoprotein − S apoligand ) holoprotein − hololigand protein −ligand complex apoprotein apoligand ≈ H inter + ( H solvation − H solvation − H solvation ) holoprotein − hololigand protein −ligand complex apoprotein apoligand ≈ Einter + ( Esolvation − Esolvation − Esolvation )

(32) The gas phase binding potential energy was directly calculated using the GARF energy function, while the solvation energy was calculated using the KMTISM model,51 an implicit solvation model previously developed in our group, with the GARF energy function replacing the old KECSA energy function52 applied in the original publication.51 We compared the performance using the GARF-KMTISM model to the KECSA-KMTISM model against the solvation free energy validation set used in our previous publication.51 The results and data analysis for this new solvation model are included in the Supporting Information. Briefly, against charged and neutral solutes, the GARF-KMTISM model generated an R2 of 0.98 and a RMSE of 2.14 kcal/mol versus a R2 of 0.97 and a RMSE of 2.86 kcal/mol using the KECSA-KMTISM model.51 Thus, the GARF-KMTISM model represents a significant improvement over the KECSA-KMTISM solvation model. The first benchmark contains ~900 high-resolution protein-ligand X-ray crystal structures with emphasis on the diversity in structure and binding affinity distributions – this type of analysis

ACS Paragon Plus Environment

40

Page 41 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

will be most sensitive to inaccurate entropy estimations. All crystal structures were selected from the PDBBind v2015 database and the validation structures were spit into two test sets. One contains 159 test cases chosen from the “core set” of the PDBBind v2015 database53 while excluding all metalloproteins. The second set increased the diversity of the complexes examined by selecting 795 test cases from the “refined set” of the PDBBind v2015 database,53 with the experimental binding affinity evenly distributed from -3 kcal/mol to -16 kcal/mol excluding all metalloproteins. All crystal structures had resolutions better than 2.5 Å and no missing residues in the binding sites. Structures were simply prepared by adding the missing hydrogen atoms. We understand that the protein-ligand complex crystal structures may not best represent the binding modes in solution; however, for the use of large-scale validation, they still provide one of the best validation benchmarks available. Scoring against the crystal structures was first examined to assess our model’s ability to reproduce the binding affinity at the “correct” binding mode. Heatmap docking was then performed followed by scoring of the top-five poses. The top-5-scored docked poses and their corresponding structural root-mean-square deviation (RMSD) values were collected separately and compared, to study the capability of GARF to identify the “global minimum” provided as the crystal complex structures. For crystal structure scoring, GARF scoring against the 159 “core set” test cases had a correlation coefficient R of 0.64, a Kendall's tau of 0.45 and a root-mean-square error (RMSE) of 2.74 kcal/mol, when compared to the experimental binding free energies. Against the 795 “refined set”, GARF gave a correlation coefficient R of 0.73, a Kendall's tau of 0.54 and a RMSE of 2.56 kcal/mol.

ACS Paragon Plus Environment

41

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 92

The Heatmap docking procedure successfully generated the binding modes for all of the 159 test cases in the “core set”, and 792 out of 795 test cases in the “refined set” within its top-5 best poses. When looking at the best-scored docked poses, an average structural RMSD of 2.82 Å was obtained for the “core set”, and an average structural RMSD as 2.57Å was obtained for the “refined set”. Taking the top 5 poses, another evaluation was done by selecting the binding mode with the lowest structural RMSD out of the top-5-scored docked poses for each test case, the average RMSD for the “core set” test cases is 1.93 Å and the average RMSD for the “refined set” test cases is 1.62 Å. The scoring results against the best-scored docked poses showed a correlation coefficient R of 0.62, a Kendall's tau of 0.44 and a RMSE of 2.86 kcal/mol for the “core set”, and a correlation coefficient R of 0.68, a Kendall's tau of 0.50 and a RMSE of 2.66 kcal/mol for the “refined set”. Such results were quite close to the scoring performance against crystal structures (Figure 11). For a better evaluation, we compare our results to Glide, a widely used docking and scoring program. In this work, we applied both Standard Precision (SP) and Extra Precision (XP) docking methodologies using GLIDE version 6.9 in the Schrodinger 2015-4 suite54-56. Prior to the computation, the receptor structures were prepared using the Protein Preparation wizard utility from the Maestro interface54, 57 . Protonation was performed at pH 7.0 with the PROPKA utility program and energetic penalties of alternate protonation states were assigned using Epik58, 59

. Crystal waters, if present, were removed from the receptor structures. The OPLS 2005 force

field was used to minimize the hydrogen atom positions while the heavy atoms were kept fixed8, 60

. The top-5-scored docked poses were retained for each test case to compare with the

Heatmap/GARF docking and scoring results.

ACS Paragon Plus Environment

42

Page 43 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Both Glide SP and XP procedures successfully docked all the 159 test cases from the “core set”, and the SP procedure successfully docked 794 out of 795 test cases, while the XP procedure successfully docked 793 out of 795 test cases from the “refined set”. Against the best-scored docked poses, the SP procedure generated an average RMSD of 2.86 Å for the “core set”, and 2.81 Å for the “refined set”; the XP procedure generated an average RMSD of 2.46 Å for the “core set”, and 2.54 Å for the “refined set”. Taking the lowest RMSD binding modes among the top-5-scored docked poses, the SP procedure generated an average RMSD of 2.17 Å for the “core set” and 2.19 Å for the “refined set”; the XP procedure generated an average RMSD of 2.27 Å for the “core set” and 2.39 Å for the “refined set”. For scoring against the best docked poses, the SP algorithm generated a correlation coefficient R of 0.39, a Kendall's tau of 0.30 and a RMSE of 2.97 kcal/mol for the “core set”, and a correlation coefficient R of 0.46, a Kendall's tau of 0.33 and a RMSE of 2.65 kcal/mol for the “refined set”; the XP algorithm generated a correlation coefficient R of 0.41, a Kendall's tau of 0.28 and a RMSE of 3.19 kcal/mol for the “core set”, and a correlation coefficient R of 0.42, a Kendall's tau of 0.29 and a RMSE of 3.04 kcal/mol for the “refined set”. We summarize all of the Heatmap/GARF and Glide SP/XP docking and scoring results in Table 3 for ease of comparison. The GARF/Heatmap method, employed as an end-point method, showed comparable docking performance with both Glide SP and XP. In the case of the top-5-scored docked poses, the GARF/Heatmap method was able to generate binding modes closer to the crystal structures, according to the “Ave. lowest RMSD (Å) for top 5 docked poses” for both test sets (see Table 3). For scoring and ranking the binding affinities, the GARF/Heatmap method showed an overall better performance when compared to both Glide SP and XP for both test sets.

ACS Paragon Plus Environment

43

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 92

Table 3. Statistical Results for the GARF/Heatmap docking and scoring, compared with the

Glide SP and XP procedures. Docking results Methods

GARF/Heatmap

Glide SP

Glide XP

Core set (159 test cases) Ave. RMSD (Å) for the best docked poses

2.82

2.86

2.46

Ave. lowest RMSD (Å) for top 5 docked poses

1.93

2.17

2.27

Refined set (795 test cases) Ave. RMSD (Å) for the best docked poses

2.57

2.81

2.54

Ave. lowest RMSD (Å) for top 5 docked poses

1.62

2.19

2.39

Glide SP

Glide XP

Scoring results for the best docked poses Methods

GARF Score Core set (159 test cases)

Correlation coefficient R

0.62

0.39

0.41

Correlation coefficient Kendall's tau

0.44

0.30

0.28

RMSE (kcal/mol)

2.86

2.97

3.19

Refined set (795 test cases) Correlation coefficient R

0.68

0.46

0.42

Correlation coefficient Kendall's tau

0.50

0.33

0.29

RMSE (kcal/mol)

2.66

2.65

3.04

ACS Paragon Plus Environment

44

Page 45 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 11. Plot of GARF scoring vs. experimental binding affinity values (∆G) in kcal/mol. (A)

shows scoring against the 159 “core set” test cases using the experimental crystal structures; (B) scoring against the 159 “core set” test cases using the best scored GARF docked poses; (C) scoring against the 795 “refined set” test cases using the experimental crystal structures; (D) scoring against the 795 “refined set” test cases using the best GARF scored docked poses.

The validation results from the first benchmark revealed a general picture of the docking and scoring performance using the GARF energy function against a large number of protein-ligand

ACS Paragon Plus Environment

45

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 46 of 92

complexes with diverse structural features and binding data. However, regardless of how fast the end-state binding affinity prediction scheme may be, a potential function, which most scoring functions use, only predicts the internal energy change between the initial and final states, while the entropy changes are not properly incorporated using these methods because of the lack of a conformational sampling procedure to estimate entropy changes. As a fast estimation scheme, the end-state binding affinity scoring method against a general test set is not the best way to assess such an energy function, while, in the authors’ opinion, the better way to explore the ability of a potential based scoring function is to compare amongst protein-ligand complexes from the same protein families with congeneric ligands. This reduces, somewhat, the errors associated with the neglect of entropy effects. In light of this, we introduced our second benchmark focusing on congeneric ligands bound to the same receptor. Homologous proteins were selected from the first validation benchmark. Out of the two large test sets, we extracted 8 sets representing protein families: the heat shock protein (HSP) (14 test cases), mitogen-activated protein kinase (p38) (15 test cases), glutamate receptor (15 test cases), urokinase (22 test cases), protein-tyrosine phosphatase 1B (PTP1B) (25 test cases), thrombin (26 test cases), trypsin (61 test cases) and HIV-1 protease (133 test cases). In the prediction of binding affinity, we made ∆∆G comparisons by referencing the weakest binder in each test set, whose binding ∆Gs were shifted to 0 kcal/mol for both the experimental and calculated value, and the binding ∆Gs for all other test cases in the same test set were shifted analogously. Such an approach cancels out the common energy among the proteins, as well as the common energy error during calculation, which better illustrates the capability of the GARF energy function to capture the relative free energy change in each protein test set. The docking and scoring results are summarized in Table 4.

ACS Paragon Plus Environment

46

Page 47 of 92 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Table 4. Statistical results for GARF scoring against several protein families. heat shock proteins

p38

glutamate receptor

urokinase

PTP1B

thrombin

trypsin

HIV-1 protease

14

15

15

22

25

26

61

132

R (crystal structure scoring)

0.59

0.63

0.66

0.78

0.80

0.81

0.84

0.41

Kendall's τ (crystal structure scoring)

0.21

0.41

0.39

0.59

0.60

0.59

0.66

0.18

∆∆G RMSE (crystal structure scoring) in kcal/mol

2.65

1.55

2.62

1.54

1.53

1.68

1.43

3.03

ave. RMSD (docking) in Å

1.89

1.48

1.14

0.87

2.52

2.11

1.88

0.80

R (docked poses scoring)

0.65

0.78

0.60

0.71

0.51

0.75

0.80

0.40

Kendall's τ (docked poses scoring)

0.23

0.60

0.30

0.49

0.38

0.47

0.63

0.16

∆∆G RMSE (docked poses scoring) in kcal/mol

2.32

1.15

2.31

1.73

1.95

1.90

1.56

2.92

Homologous family no. of test cases

Without further classification referencing ligand similarities, the binding affinity prediction results for most of the homologous subsets, except for the HIV-1 protease set, are relatively good, with correlation coefficient R values higher than 0.5, and RMSE values for the ∆∆G’s