Protein–Ligand Informatics Force Field (PLIff) - ACS Publications

Jun 29, 2016 - Astex Pharmaceuticals, 436 Cambridge Science Park, Milton Road, Cambridge CB4 0QA, United Kingdom. ‡. Dipartimento Farmaco-Chimico ...
0 downloads 0 Views 1MB Size
Subscriber access provided by University of Sussex Library

Article

Protein-Ligand-Informatics force field (PLIff): towards a fully knowledge driven “force field” for biomolecular interactions Marcel L Verdonk, R. Frederick Ludlow, Ilenia Giangreco, and Prakash Chandra Rathi J. Med. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.jmedchem.6b00716 • Publication Date (Web): 29 Jun 2016 Downloaded from http://pubs.acs.org on July 5, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Medicinal Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

Protein-Ligand-Informatics force field (PLIff): towards a fully knowledge driven “force field” for biomolecular interactions Marcel L. Verdonk,*,ǂ R. Frederick Ludlowǂ, Ilenia Giangrecoǂ,§,† and Prakash Chandra Rathiǂ ǂ Astex Pharmaceuticals, 436 Cambridge Science Park, Milton Road, Cambridge CB4 0QA, United Kingdom § Dipartimento Farmaco-Chimico, University of Bari, Via Orabona 4, I-70125 Bari, Italy * Corresponding author † Current address: Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, United Kingdom

ABSTRACT The Protein Data Bank (PDB) represents a wealth of data on non-bonded biomolecular interactions, and if this information could be distilled down to non-bonded interaction potentials then these would have some key advantages over standard force fields. However, there are some important outstanding issues to address in order to do this successfully. This paper introduces the Protein-Ligand-Informatics “force field”, PLIff, which begins to address these key challenges. As a result of their knowledge-based nature, the next-generation non-bonded potentials that make up PLIff automatically capture a wide range of interaction types, including special interactions that are often poorly described by standard force fields. We illustrate how PLIff may be used in structure-based design applications, including interaction fields, fragment mapping and proteinligand docking. PLIff performs at least as well as state-of-the art scoring functions in terms of pose predictions, as well as in terms of ranking compounds in a virtual screening context.

ACS Paragon Plus Environment

Journal of Medicinal Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

INTRODUCTION Intermolecular interactions are at the heart of biological processes, and understanding and predicting such interactions is of paramount importance to many aspects of rational drug discovery. Force fields and scoring functions are typically central to any methodology attempting to predict non-bonded interactions. Molecular Mechanics (MM) based force fields (e.g. AMBER1 or OPLS2) have the advantage that they are designed to produce true potential energies, which means that they can be used for simulation-based methods like molecular dynamics. Therefore, in theory, these methods are able to capture the enthalpic as well as the entropic contributions to binding affinity. However, such calculations require long time scale simulations and rely heavily on the accuracy of the force field. Two important downsides to force fields are that the functional form of their energy terms is fixed, and that each new type of interaction requires separate parameterisation. Empirical scoring functions like the one developed by Böhm,3 the Chemscore function4 or Glidescore,5 on the other hand are much more coarse grained and are unable to provide precise predictions for specific interactions. However, most empirical scoring functions do have an entropic component built into their non-bonded potentials, allowing them to deal with some aspects of the hydrophobic effect implicitly. The wealth of X-ray structures available in the Protein Data Bank6 (PDB, www.rcsb.org) is a great resource of non-bonded interactions that are of direct relevance to the drug discovery field. If all this information could be distilled down into statistical interaction potentials, then in theory these would have two key advantages. Firstly, they would require virtually no parameterisation and, given enough examples in the PDB, all types of interactions would automatically be represented; and as the PDB continues to grow, the coverage of interaction types would further improve with time. Secondly, as the statistics directly reflect interactions in an aqueous environment, the potentials should implicitly cover some aspects of the hydrophobic effect (see, e.g. Boer et al.7). However, in practice it is far from trivial to convert the structural data from the PDB into potentials that are

ACS Paragon Plus Environment

Page 2 of 37

Page 3 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

meaningful and widely applicable. over the past 20 years, many examples of such statistical potentials, widely known as knowledge-based potentials, have been reported in the literature – e.g. PMF,8 DrugScore,9 ASP,10 and DSX11 – and they are a popular choice for protein fold predictions and virtual screening. It is beyond the scope of this work to review all the methodologies described in the literature, but we feel that there are some key outstanding issues that, to the best of our knowledge none of them fully address. One of these issues is the choice of the reference state, which traditionally is a distance-dependent function that can introduce a significant bias. Also, secondary contacts contributing to the potentials introduce further bias. Additionally, in the vast majority of cases, only distance-based potentials are provided, i.e. the potentials are independent of the relative orientation of the interacting functional groups. Finally, assigning atom types for protein-ligand complexes is highly error prone. Here, we introduce a set of next generation statistical non-bonded interaction potentials, which together form the Protein-Ligand Informatics “force field” (PLIff). For the construction of the potentials, we addressed each of the key outstanding issues outlined above: (i) Distance-dependent bias introduced by the reference state is avoided by separating the geometry parts of the potentials from the atom-type specific part; (ii) Secondary contacts are removed by using a Voronoi tessellation algorithm; (iii) As well as distance-dependent potentials, PLIff also contains potentials that depend on the relative orientation of the interactingons groups; (iv) Sophisticated atom typing protocols are used – including the automatic assignment of charge states, tautomeric forms and side chain flips – as well as an elaborate atom-type generalisation scheme to optimise the use of atom-type-specific data. PLIff clearly is not a force field in the traditional sense, in that it is not designed to produce true potential energies. However, its potentials are highly atom type specific and have full angular dependence, which are characteristics one would associate with a force field. PLIff is a surface-area based function, where only primary contacts are considered, and its potentials are smooth, interpretable and make intuitive sense. The PLIff source code is publicly available from https://bitbucket.org/AstexUK/pli. As the name suggests, the PLIff potentials are derived from

ACS Paragon Plus Environment

Journal of Medicinal Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

interactions between proteins and ligands as observed in PDB X-ray structures, so its main applications will be in structure-based drug design. Here, we show that the PLIff potentials capture a wide range of interaction types, including special interactions that are often poorly characterized in MM force fields. In addition, we discuss the use of PLIff for generating interaction fields and for fragment mapping, and we validate the use of PLIff in protein-ligand docking and virtual screening.

METHOD The section below describes the fundamentals of the methodologies we used to derive the nonbonded PLIff potentials and how they are combined to calculate the non-bonded PLIff energy. More detailed descriptions of our methodologies are provided in the Supplementary InformationSupplementary Information.

A

B

C

E

F

ACS Paragon Plus Environment

Page 4 of 37

Page 5 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

D Figure 1. Deriving PLIff potentials. (A) A Voronoi tessellation algorithm identifies the primary interactions between protein and ligand. (B) The methodology introduced by McConkey and colleagues projects Voronoi contact areas on the Van-der-Waals sphere around each atom, and caps solvent exposed parts with a spherical 12

cap; Note that the Voronoi tessellation removes the secondary contacts between atom 1 and atoms 3 and 4; secondary contacts cause significant bias in most standard knowledge-based potentials as they result in energy minima that do not represent directly the interaction between two contacting atoms. (C) Definition of the nonbonded contact geometric parameters R, α1 and β1 for a secondary amide NH contacting a water molecule; R is the distance between the nitrogen atom and the water oxygen atom; α1 and β1 are the polar coordinate angles of the NH→O vector with respect to the two axes u and v, u being the direction of the N-H bond and v being the normal to the plane through the C-N(H)-C system. (D) Heat map for the type propensities ( ln  ). Atom types are only included here if there is sufficient data to calculate the propensities with reasonable precision. Highly favourable contacts are shown in dark blue, whereas highly unfavourable contacts are shown -1

-2

in dark red. The colour legend is also shown (kcal.mol .Å ). A complete spreadsheet with all type propensities is provided in the Supplementaryy InformationSupplementary Information. (E) and (F) Geometry propensity functions for R and α1, respectively, for four different types of contacts; For >NH(am) and >CH(ar) atoms, α1 is …

the angle between the N/C-H vector and the N/C X vector (where X is the position of the contacted atom); For …

CH3 atoms, α1 is the angle between the C-CH3 vector and the CH3 X vector. Atom type definitions are provided in the Supplementary InformationSupplementary Information.

Atom typing. Atom types covering all key functional groups occurring in proteins and ligands were defined, and this list of atom type definitions is provided in the Supplementary Information. In brief, the atom typing process involved the following steps: (i) Bond types were assigned using a combination of the PDB dictionary files and a set of predefined rules; (ii) For amino acids, the atom name and residue name were checked against a lookup table; (iii) If no atom type had been assigned yet, the atom was checked against a number of tautomeric atom types (for example an N atom in an imidazole ring); these are ambiguous atom types for which it is not a priori known whether the atom

ACS Paragon Plus Environment

Formatted: Space After: 0 pt

Journal of Medicinal Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

is a donor or an acceptor; (iv) If no atom type had been set yet, the atom was checked against a list of atom type definitions, based on connectivity rules; (v) Ambiguous, tautomeric atom types were assigned by assessing the hydrogen-bonding networks they are involved in; this also involved flipping of histidine, asparagine and glutamine side chains in cases where this clearly improves hydrogen-bonding.

Data set. All protein-ligand PDB structures with a resolution of 3Å or better and that passed our additional quality filters (see Supplementary Information) were processed. All hydrogen atoms in the structures were ignored. All atoms from symmetry related molecules within 10Å were added to the structure. Atom types were assigned to all atoms in each included structure using the procedure described above. Next, a Voronoi tessellation procedure was run on all unique atoms (i.e. excluding atoms generated by expanding the crystallographic symmetry) in the structure. Voronoi tessellation partitions a set of points in space, giving each their own polyhedron volume. This method can be applied to proteinligand complexes, where the atoms are the points (see Figure 1A). The approach we used here is the one published by McConkey and colleagues12 (see Figure 1B). Briefly, the method applies the following steps to each atom in the system: (i) find all atoms in the vicinity of the central atom; (ii) place planes between the central atom and each of the surrounding atoms; (iii) find all intersecting lines and points between the different contact planes – together these make up the polyhedron containing the central atom; (iv) remove contacts to atoms for which the contact plane falls outside the polyhedron; (v) cap solvent exposed parts of the polyhedron with a spherical cap. Each face of the polyhedron now represents a contact of the central atom with another atom, as well as the surface area involved in the contact. Because the Voronoi method only produces the faces (contacts) between atoms that are in direct contact, it is very effective at removing secondary contacts (see

ACS Paragon Plus Environment

Page 6 of 37

Page 7 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

Figures 1A and 1B). The McConkey approach has the additional advantage that it also provides the solvent exposed surface area of each atom (see Figure 1B). Statistics were recorded for protein-ligand contacts as well as for protein-water, ligand-water, ligand-ligand and water-water contacts, and include contacts to symmetry-related molecules. When an atom is involved in a contact, three parameters – the polar coordinates R, α1 and β1 – are used to define the position of the second atom, relative to the first (see Figure 1C); a further two parameters – α2 and β2 – are used to define the location of the first atom, relative to the second atom. Depending on the atom types, β1 and/or β2 may be omitted; For example, if the first atom is a halogen atom, β1 is omitted (see also Tables 3.1 and 6.1 in the Supplementary Information). Hence, each contact is characterised by its contact area and up to five geometry parameters (R, α1,

β1, α2 and β 2). PLIff functional form. PLIff aims to provide complete, interpretable, fully knowledge-derived potentials for non-bonded interactions. The key features of PLIff are (i) PLIff is entirely derived from structural data on protein-ligand complexes; (ii) Only primary contacts, determined via a Voronoi tessellation algorithm, are considered; (iii) Each primary contact is weighted according to the surface area involved in the contact; (iv) The potentials describe the full geometric dependence of the interactions; (v) All hydrogen atoms are implicit. Mostly for reasons of convenience we will refer to the PLIff terms as “energies”. In its simplest form, the non-bonded PLIff energy for a system is defined as:

 =  ()

(1)



where the summation is over all  primary non-bonded contacts in the system and  () is the non-bonded PLIff energy for contact c. In the PLIff formalism the non-bonded energy for a contact between two atoms is defined as:

ACS Paragon Plus Environment

Journal of Medicinal Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

 =  ∙  ∙ ln 

Page 8 of 37

(2)

where  is the surface area involved in the contact and  is the non-bonded contact propensity for the contact. k is a global proportionality constant, which was derived to be -0.0034 (kcal.mol-1.Å-2) from a regression against >4000 protein-ligand complexes with affinity data from our internal database. This functional form is analogous to the one used by McConkey and colleagues,13 with the difference that our version also describes the geometry of the interaction. It is useful to split  into two parts as follows:

 (, , ) =  (, ) ∙  (|, )

(3)

where  and  are the atom types of the two atoms involved in the contact and  represents the

geometry of the contact.  (, ) is the type propensity for the contact, i.e. the propensity to observe

a contact for this combination of two atom types, regardless of the contact geometry.  (|, ) is

the geometry propensity for the contact, i.e. the propensity that a contact has geometry  , given

that the contact is between two atoms of types  and . A key advantage of separating the type

propensity from the geometry propensity is that it eliminates the need for a geometry-dependent reference state, and hence prevents significant biases in the geometry propensityterms. Type propensities. Analogously to the definition used by McConkey and colleagues,13 the type propensity for contacts between atoms of types  and  is simply defined as:

 (, ) =

 (, )  !" (, )

(4)

where  (, ) is the total observed surface area, summed over all contacts in the PDB between

atoms of types  and . 

!" (,

) is the total expected surface area for contacts between atoms of

types  and . The type propensities are derived from surface areas, rather than from simple contact counts, in order to weigh each contact according to its contact area, and to make the propensities consistent with the way they are used in equation (2).  (, ) and 

!" (,

) can be written as:

ACS Paragon Plus Environment

Page 9 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

      (, ) =  #$ (, ) + "& (, ) + &" (, ) + $# (, ) + &$ (, ) + $& (, )



!" (,

!" !" !" !" !" !" ) = #$ (, ) + "& (, ) + &" (, ) + $# (, ) + &$ (, ) + $& (, )

(5a) (5b)

where  #$ (, ) is the total observed surface area of protein atoms of type i that is buried by

forming contacts to ligand atoms of type j, and  "& (, ) is the total observed surface area of ligand atoms of type j that is buried by forming contacts to protein atoms of type i (i.e. these two terms both reflect the same set of contacts, but the first term represents the protein surface area and the second the ligand surface area). The remaining terms in equation (5a) are defined analogously, and each term has a counterpart in equation (5b), where instead of an observed surface area, each term now represents an expected surface area. Note that we do not keep separate sets of atom types for protein and ligand, making the type propensity matrix symmetric (see below). Note also that we include ligand-ligand contacts. This generally only makes a small contribution to the overall statistics, but it does mean that we can also generate type propensities for certain ligand specific contacts like Cl…Cl. The observed surface areas in equation (5a) are directly obtained from our analysis of all proteinligand complexes in the PDB. The expected surface area terms in equation (5b) are derived from the total available surface area for the two atom types, relative to other atom types. The method assumes that the protein molecules in the crystal lattice are fixed entities between which ligands and water molecules can move around and pick interaction partners. Hence, we do not consider protein-protein contacts, but we do consider protein-ligand, protein-water, ligand-water, ligandligand and water-water contacts. For contacts between atom types i and j: !" #$ (, ) = '# () ∙ '& () ∙ (&) ∙ ( ∙ #

(6a)

!" "& (, ) = '# () ∙ '& () ∙ (#$) ∙ ( ∙ &

(6b)

ACS Paragon Plus Environment

Journal of Medicinal Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

!" &$ (, ) = '& () ∙ '& () ∙ (&") ∙ ( ∙ &

Page 10 of 37

(6c)

where # and & are the total available protein and ligand surface areas, respectively. '# () is the

fraction of # that is of atom type i, . '& () is the fraction of & that is of atom type i, etc. ( is

fraction of the overall surface area that is involved in the formation of contacts. Finally, (&) , (#$)

and (&") are defined as follows:

(&) =

& & + *

(#$) = (&") =

(7a)

# # + & + *

(7b)

& # + & + *

(7c)

!" !" where * is the total available area of explicit water molecules. The &" (, ), $# (, ) and

!" !" !" !" $& (, ) terms in equation (5b) are defined analogously to #$ (, ), "& (, ) and &$ (, ),

respectively. More detail, in particular on calculating type propensities for contacts involving water molecules, is provided in the Supplementary Information. When, for a given combination of atom types, there is insufficient data in the PDB to calculate the type propensity, an elaborate atom type generalisation scheme is used. This scheme relies on a number of pre-defined atom typing groups, ranging from very specific (e.g. a secondary amide oxygen) to very generic (e.g. any acceptor). For each combination of specific atom types, data from increasingly more generic atom types is mixed in, using a weighting scheme that depends on the already achieved precision in the type propensity. This scheme is designed to achieve a pre-defined level of precision in the calculated type propensities, whilst keeping the generalised atom types as close as possible to the original atom types. Atom type generalizing and weighting schemes are described in more detail with an example in the Supplementary Information.

ACS Paragon Plus Environment

Page 11 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

Geometry propensities. For all combinations of atom types, separate histograms were produced for each recorded geometric parameter (up to five: R, α1, β1, α2, β2). Apart from some empirical corrections (see below), we treat each geometric parameter independently, which is an unavoidable approximation as in the vast majority of cases there would be insufficient data to produce multidimensional histograms. Each histogram for geometric parameter + is then converted into a geometry propensity function:

 (+|, ) =

' (, , +)

'

!"

(, , +)

(8)

where ' (, , +) is the frequency with which contacts between atom types  and  are observed to

have geometry +; ' (, , +) has been corrected to remove bias originating from the same ligand interacting with the same protein multiple times (see Supplementary Information). '

!"

(, , +) is

the corresponding expected frequency, and is effectively the average frequency over the accessible range of this geometric parameter, corrected for geometric factors where necessary, i.e.:

'

!"

(, , +) =

,(+) ∙ -+

!/!1(2,3)

.

!/!(2,3)

,(+ / ) ∙ -+′

(9)

where ,(+) is a geometrical factor that describes what the distribution is expected to look like if the data were randomly distributed. +1(, ) and +2(, ) represent the expected accessible range of the

parameter x. Radial distribution functions normally need to be normalised for the increasing volume available to surrounding atoms as R increases (i.e. ,(6) = 6 1 ). However, because we measure the total contact area that is projected onto a sphere of constant radius, this correction is not applicable here. Hence we use ,(6) = 1. The β terms are also expected to be uniform, so we also use

,(7) = 1 for both β1 and β2. The α terms on the other hand need a conical correction to account

for the fact that as α gets smaller, the accessible area becomes smaller, hence ,(8) = sin 8 for both

ACS Paragon Plus Environment

Journal of Medicinal Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 37

α1 and α2. The range parameters +1(, )and +2(, ) are set empirically, and depend on the geometry – and therefore the type – of the atom. It is clear from equations (4) and (8) that, whereas thea type propensity for contacts between atom types  and  is externally normalised relative to the propensity of observing contacts between other atom types, the geometry propensity is internally normalised against the propensity of observing other geometries for contacts between the same atom types. Atoms that have high B-factors, occupancies less than 100%, or for which there is insufficient support in the electron density map, were excluded from the geometry propensity analyses. A similar atom type generalisation scheme is used in cases where there is insufficient data for the specific atom types. Each geometry propensity function is then smoothed using an algorithm that is described in detail in the Supplementary Information, but briefly works as follows. A Gaussian function with a given width, σ, is used to smooth the raw distribution, which is stored at high granularity. The algorithm starts with a very small value for σ, which is then iteratively increased until one of the following criteria are fulfilled: (i) The average standard error in the propensities in the smoothed distribution is below 5%, or (ii) The smoothed distribution differs by more than three standard deviations from the original raw distribution, or (iii) σ exceeds a maximum, predefined value (0.3Å for R distributions and 15° for α and β distributions. This process generally results in smooth, interpretable and well behaved functions that balance the precision of the propensities against the consistency with the original raw distributions. To combine the geometry propensity functions for the (up to) five individual geometric parameters, we define the geometry propensity function for a contact between two atoms of types  and  as:  (|, ) =  (6|, ) ∙  (81|, , 6) ∙  (71|, , 6, 81) ∙  (82|, , 6) ∙  (72|, , 6, 82)

It can be seen from equation (10) that the geometric propensity functions for the individual geometric parameters are multiplied to obtain the overall geometric propensity. Note, however,

ACS Paragon Plus Environment

(10)

Page 13 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

that the α and β terms are not the original functions from equation (8). Instead, they have been corrected to account for cross-parameter correlations. For example, any dependency on α1 can be expected to become weaker with increasing R, and we apply the following empirical correction to account for this effect:

ln  (81|, , 6) = ℎ< (6) ∙ ln  (81|, )

(11)

where  (81|, ) is the geometry propensity for 81 at short distance, and ℎ< (6) is a simple attenuation factor which we define as:

1, ℎ< (6) = =(62< − 6)⁄(62< − 61< ) , 0,

6 < 61< 61< ≤ 6 < 62< 6 ≥ 62
NH(am)) than it is for contacts between two methyl groups (-CH3). A clear shortening of the preferred contact distance and a sharpening of the propensity function is seen for contacts between aromatic CH donors (>CHD(ar), these are aromatic CH atoms adjacent to nitrogen atoms in aromatic rings) and amide oxygens, in comparison to CH(ar)…=O(am) contacts, confirming the weak H-bonding donating potential of >CHD(ar) atoms.

ACS Paragon Plus Environment

Page 14 of 37

Page 15 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

To illustrate the effect angular preferences can have on the PLIff potentials Figure 1F shows the α1 geometry propensity functions for the same four types of contacts shown in Figure 1E. For -CH3…CH3 contacts, the propensity function is almost uniform, but for >NH(am)…=O(am) contacts, by contrast, there is a clear preference to occur at α1=0°. This is consistent with a hydrogen-bond formation where the amide proton points directly towards the amide oxygen atom. A similar, but less pronounced preference is also seen for >CHD(ar)…=O(am) contacts and the trend is even weaker for >CH(ar)…=O(am) contacts. Figures 1D, 1E and 1F show that, due to the knowledge-based nature of the PLIff function, many types of interactions and their geometrical preferences are all represented automatically. As long as there is sufficient data in the PDB, any interaction will be “parameterised” automatically within PLIff. This is a significant advantage over standard force fields where each new type of interaction needs separate parameterisation. Even interactions like CH…O bonds, which are omnipresent in biological systems, are rarely parameterised adequately in existing force fields. We will illustrate the automatic coverage of a wide range of interaction types further using two examples: (i) the hydrogen-bond accepting strengths of different atom types and (ii) an example of a special interaction: halogen bonds. Two key characteristics of hydrogen bonds are that they are directional and that they occur at distances shorter than their VdW radii would suggest. Figure 2 plots both these effects for >NH(am) atoms interacting with a range of atom types. We used the geometry propensity at α1=0° as an indicator of how directional the interaction is (see Figure 1F). The shortening of the optimum contact distance was derived from distance propensity functions like those displayed in Figure 1E; For each atom type its optimum contact distance to a standard hydrogen bond donor (amide >NH) is compared to its optimum contact distance to a standard non-hydrogen bond forming atom type (CH3). There is a good correlation between distance shortening and directionality, consistent with hydrogen bonds of increasing strength. The position in the graph in Figure 2 is indicative of the

ACS Paragon Plus Environment

Journal of Medicinal Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

hydrogen bond accepting strength of the different atom types plotted. The order of the atom types is in line with experimental hydrogen bond acceptor strengths14 and those calculated by high level ab initio calculations,15 indicating that the acceptor strengths are captured effectively by the PLIff potentials.

Figure 2. Hydrogen-bonding strength captured in PLIff. Analysis of the hydrogen-bond accepting nature of a number of selected atom types X. Results shown are for NH(am) X contacts. The shortening, ∆∆G , of the …

optimum contact distance is plotted against the geometry propensity for contacts to occur at 81 = 0°. ∆∆G is defined as ∆∆G = ∆G (CH3…X) − ∆G (NH(am)…X), where ∆G (NH(am)…X) = G (NH(am)…X) − ENH(am) − EX and ∆G (CH3…X) = G (CH3…X) − ECH3 − EX . Atom type definitions are provided in the Supplementary

Information.

An example of a special type of interaction that is observed quite regularly in drug-protein interactions are halogen bonds, which are short, favourable interactions between chloro, bromo or iodine substituents and electronegative atoms, e.g. carbonyl oxygen atoms (see for example

ACS Paragon Plus Environment

Page 16 of 37

Page 17 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

Scholfield et al.16). These interactions are primarily driven by electrostatics and are directional in nature because of the so-called sigma hole – an electropositive area located on the halogen atom, opposite the point where it is attached to its parent carbon atom. The sigma hole becomes more electropositive as the halogen atom becomes more polarisable, i.e. -Cl