Knowledge-Based Optimization of Molecular ... - ACS Publications

Mar 15, 2016 - investigated with iterated racing using irace,22 a freely available ... entries via ConQuest or by using the CSD-System Python API. The...
0 downloads 0 Views 2MB Size
Subscriber access provided by CITY UNIV LIB

Article

Knowledge Based Optimization of Molecular Geometries using Crystal Structures Jason Cole, Colin R. Groom, Oliver Korb, Patrick McCabe, and Gregory P Shields J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.5b00712 • Publication Date (Web): 15 Mar 2016 Downloaded from http://pubs.acs.org on March 20, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Knowledge Based Optimization of Molecular Geometries using Crystal Structures Jason C. Cole, Colin R. Groom, Oliver Korb, Patrick McCabe*, Gregory P. Shields Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, United Kingdom

KEYWORDS Cambridge Structural Database, CSD, knowledge base, bond length, valence angle, torsion angle, PDF, minimization, optimization

ABSTRACT

This paper describes a novel way to use the structural information contained in the Cambridge Structural Database (CSD) to drive geometry optimization of organic molecules. We describe how CSD structural information is transformed into objective functions for gradient based optimization to provide good quality geometries for a large variety of organic molecules. Performance is assessed by minimizing different sets of organic molecules reporting RMSD movements for bond lengths, valence angles, torsion angles and heavy atom positions.

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 39

Introduction The Cambridge Structural Database (CSD1) contains hundreds of thousands of crystal structures with molecules whose atomic coordinates have been reliably determined and so represents an invaluable source of molecular geometry information. In Mogul2, the geometric data implicit in the CSD has been developed into multiple libraries including libraries for bond lengths, valence angles and torsion angles, and presented in the form of histograms. This paper describes a method that utilizes the raw geometric data to generate probability density function (PDFs), using the method of kernel density estimation (KDE)3. The generated PDFs are continuous and smooth functions. Peaks in the PDFs are indicative of favorable values for the underlying geometric attribute, reflecting the high frequency of occurrence of this attribute with this value. Conversely, regions where the PDFs are smaller in value represent regions where the frequency of occurrence is lower and correspond to less likely values for the geometric attribute. Thus, the PDFs contain information about the average geometric preferences of a molecule according to the data contained in the CSD. In principle, it should be possible to exploit this information in an optimization framework by converting the PDF into an objective function which can be supplied to an optimization (minimization) algorithm. We attempt just that. One of the most common methods for generating good geometries of small organic molecules is energy minimization using a force field. Numerous force fields exist based on energetics (e.g. Tripos4, MMFF945, GAFF6.) Knowledge-based potentials also exist relating to protein folding7 and to protein-ligand complexes8. These generally make use of statistical Boltzmann type equations and/or radial pair distribution functions for calculating energetics of the differences between an object (protein) and a reference state of that object, the reference state often being

ACS Paragon Plus Environment

2

Page 3 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

difficult to define. An intramolecular knowledge based field (KBF) based upon distributions of internal degrees of freedom, as opposed to radial pair distribution functions, and which makes no assumptions about functional forms is developed. Such a KBF would provide an additional and perhaps valuable tool in the computational chemist’s toolbox for generation of optimal molecular structures. In this context, optimum would be according to the large array of structural distributions contained within the CSD. Mogul2 is well known for providing structural information in the form of histograms for bond length, valence angles and torsion angles. A user may compare his or her molecule of interest against the Mogul distributions9,10. This may highlight structural features in the molecule that lie in regions which are significantly displaced away from the most populated regions. This “molecular health check” can signal poorly modelled molecules, errors and/or regions of strain. The latter case can be used to indicate opportunities for optimization of molecular geometry in the context of molecular recognition. Of course, structural features in a molecule are often interdependent. For instance changing a particular valence angle (e.g. atoms A or B change positions in valence angle A-B-C) may cause a corresponding change in another valence angle (A-B-D). It is entirely possible to have competing objectives in this sense where the optimum for one structural feature may cause another feature to be sub-optimal. A chemist’s intuition may provide the ideal answer in some situations but it is of course not practical for a chemist to individually examine every bond length, valence angle and torsion angle in every molecule. Also, given the existence of large repositories of high quality data such as the CSD it would be prudent to develop methods that make use of this data to guide the decisions about optimality. Many techniques exist for algorithmically determining solutions to optimization problems11,12. Local optimization methods include interval reduction methods, gradient descent

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 39

methods (e.g. Newton-Raphson, BFGS) etc. Global optimization methods include the various branch and bound methods, simulated annealing, evolutionary methods (e.g. genetic, ant colony) etc. It is the intent of this paper to combine the data in the CSD and a local optimization algorithm, the gradient based Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, and describe a methodology for automatically finding a locally optimum structure. It is important to note this CSD-KBF is not a traditional force field based upon energies. We deliberately do not use the term “energy” or “force” since the objective functions employed are knowledge based, not energy based. Some correlation may be expected between energy minima and populated regions of geometric space which provides additional rationale of the method, see for example the paper by Allen et al.13, Figure 10 in the paper by Cruz-Cabeza and Bernstein14 and Figure 3 in the paper by Kazantsev et al15. These figures show explicit examples of ab initio energy minima coinciding with maxima in CSD distributions.

Materials and Methods Objective Functions Transformation of a PDF into an objective function could be achieved in many ways. For example, taking the negative of the PDF would generate a well in the regions corresponding to high frequency of occurrence. A minimizing algorithm could then be applied to drive the geometric attribute to the minimum of the well, as in traditional energy minimization methods. This particular approach would however suffer from the drawback of exerting little influence in extended regions of low probability (close to zero) density. Whereas, in reality, such regions represent very unlikely domains. There should, in fact, be a strong influence from any such

ACS Paragon Plus Environment

4

Page 5 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

objective function in this case. A transformation that overcomes this difficulty is achieved by taking the negative of the (natural) log of the PDF i.e. − log  ( ) and several figures

illustrating this transformation will be presented in later sections of this paper (see Case Study Tolfenamic Acid). In this case the regions of probability density closer to zero are transformed to very large positive values. Regions of high probability density are transformed in to wells. Given these two properties of the negative log transformation we chose to adopt this to transform the PDFs into objective functions. Since the PDFs themselves are smooth and the transformation generates smooth functions, the objective functions so generated are differentiable and can be used in gradient-based minimization frameworks. The data used to generate the PDFs comes from the Mogul libraries. Mogul in fact contains four data libraries, one for bond lengths, one for valence angles, one for torsion angles and one for unfused, unbridged rings. Recently a customized version (“CV”) was developed for the highthroughput generation of conformers16. The customized version of the Mogul libraries is used in this study. The CV contains multiple libraries of molecular dimensions, built from CSD entries published before 2011, limited, for performance reasons, to 500 observations for bond lengths, 250 for valence angles and 1000 for torsion angles (more precisely “rotamers” described below.) The multiple libraries form a hierarchy with respect to precision ranging from the most precise description of the fragment to the most generic description. In the CV there are four bond-length libraries and four valence-angle libraries. The CV also introduced the notion of “rotamer” libraries where each central bond X-Y in the fictitious fragment RA(RB)X-Y(RC)RD contributes just one distribution instead of four torsion distributions. Taking bond lengths as an example and a molecule with bonds then in principle it is possible to extract up to bond length distributions from the bond length libraries. Each distribution would have a corresponding PDF

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 39

for the bond in question and each PDF would be transformed into an objective function.

Denoting the length of the  bond by  , the sum total of these transformed PDFs would form the overall bond length objective function  ( , … .  ) which is a function of each bond length.   , … ,   =  − log  !" #$% ()( &$'

#$% &$')

(1)

Similarly, using the same notation, we can form overall objective functions for valence angles

( + ), torsion angles ( , ) and out-of-plane atom distances (  ) giving the objective functions

below,

-.&$/ + , … , +0  =  −1 log  !"-.&$/ (+)( .$'&

-.&$/ .$'&)

.$'&

#3)#$ , , … , ,4  =  −5 log  !"#3)#$ (,)( .$'&

#3)#$ .$'&)

##7  , … , 8  =



#; #< 7&.$ .#=)

−9 log  :!"

.$'&

#; #< ()> 7&.$ .#=

(2)

(3)

(4)

where each  is a weighting factor. In formula (3) the summation includes objective functions

based on rotamer distributions and objective functions based on the remaining torsion angles in

ACS Paragon Plus Environment

6

Page 7 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the molecule. In this work, out-of-plane distributions for 3-coordinate atoms were modelled with a single normal, i.e. Gaussian kernel, distribution. The out-of-plane distributions for acyclic nitrogen atoms exist in the form of two-dimensional data but this study concentrates entirely on one-dimensional distributions. In this first model such atoms were not assigned out-of-plane objective functions. Of course the Mogul library data represents local information; atoms and groups far from the geometric attribute under consideration could be quite different from molecule to molecule. Application of local information without considering the global conformation may generate molecules with unrealistically short interatomic distances. Thus there is a potential for intramolecular clashes and a term is required to model the clash. We adopted the following, non-probabilistic, simple functional form to model the intramolecular clash between atom pair i-j /&.)

A 1 = @  A B1 − 1 D , A ≤ CA CA /&.) 7.3)

1

(5)

where @ is an atom independent weighting factor, A is a weighting factor in principle

dependent on the atom pair, CA is a cut-off range (sum of VDW radii – 0.2) and A the distance

between atom i and atom j. VDW radii are taken from Bondi17, except the value for H, which is

taken from Rowland and Taylor18. Radii that are not available in either of these publications have

VDW radius = 2Å. This functional form represents a purely repulsive interaction of finite range.

For A ≥ CA the clash value is zero. A typical clash profile is displayed in Figure 1.

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 39

Figure 1. Typical profile of a clash function for one clash pair with @ = 1, CA = 2, A = 50.

See equation (5).

We can form an overall CSD probabilistic knowledge-based intramolecular objective function determined by the bond length, valence angle, torsion angle and out-of-plane terms, and an additional term to control clashing interactions.

 , … ,  , + , … , + - , , , … , ,  ,  , … ,  #  =  #$%  , … ,   J -.&$/ + , … , +0  &$'

.$'&

J #3)#$ , , … , ,4  J ##7  , … , 8  J /&.) .$'&

(6)

ACS Paragon Plus Environment

8

Page 9 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The KDE approach to approximating the underlying PDF centers a suitable kernel function at each observation with an appropriate bandwidth. This procedure was described in detail in an earlier publication3 and the reader is referred there for further details. For those bond lengths and valence angles involving terminal hydrogen bonds, being so numerous in the database, the underlying observations are not exposed by the Mogul libraries but instead a mean and standard deviation are supplied which summarize the distribution with just two numbers. This is of course less informative than the actual observations, but PDFs can still be constructed from these parameters. In this case the PDF for the population is modelled as a normal (Gaussian) distribution centered on the mean and with standard deviation equal to the sample standard deviation. In other instances the Mogul libraries may provide no data about the geometric attribute in question. This may not imply there are no corresponding instances in the CSD but rather that the data has not been included in the Mogul library. This can occur for several reasons. There may be too few observations in the data, implying they cannot be considered as significant, or the appropriate search has not been carried out. One advantage of this knowledge based approach to molecular structure optimization is that “missing” data can be included as it becomes available in the database from which the geometry libraries were generated (the CSD in this case). Searches that may not have been carried out can be performed as and when they become apparent. An important question to address is what strategy to adopt if there are neither observations nor sample parameters available upon which to model a PDF and thus generate an objective function. We have adopted the strategy of using the value of the geometric attribute contained in the input molecular structure, with a small degree of variation permitted around that value. Figures will be presented in later sections quantifying the number of PDFs based on the input structure alone.

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 39

Each PDF based on actual observations is constructed by placing a suitable kernel function at every data point and summing over the kernels. A further step necessary for gradient-based minimization is the evaluation of the derivative of each kernel function and a further summation over the derivatives. When many data points are involved this is inefficient and to overcome this piecewise cubic splines are subsequently constructed to model the KDE of the PDF. Cubic splines are used for calculating the KDE of the PDF for all internal coordinates except out-ofplane distances. Analytic derivatives19,20 with respect to the Cartesian coordinates of the atoms for each objective function have been implemented and we used a limited memory version of the commonly used Broyden-Fletcher-Goldfarb-Shanno (BFGS) minimization algorithm21 (denoted as L-BFGS). Solutions to the optimization problem can be determined to different degrees of

accuracy. The parameter epsilon, L, supplied to the L-BFGS code controls this accuracy. A

minimization will terminate when the norm of the gradient vector is sufficiently small on a scale defined by L. Initially the value of L is set to 0.01. However, if the optimization fails to converge

at this level, L is relaxed geometrically, doubling in value, and a further attempt at optimization

performed, up to an upper limit currently set to 10. All of the structures in the two test sets

described below optimize using the CSD-KBF with L = 0.01 so allowing a geometric relaxation of L may be an unnecessary precaution, in addition it may be questionable how optimal the

structures would be if many doublings of L had been required to achieve successful termination of the optimization algorithm.

Equations (1)-(4) above contain parameters  which can be used to control the

weightings of the various objective functions. It may be argued that given each term is a summation over probability density functions that each PDF is naturally scaled and no further

ACS Paragon Plus Environment

10

Page 11 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

weighting is required. This premise was investigated with iterated racing using irace22, a freely available R23 package. Data Sets Several sets of molecules and one specific molecule were used for investigating the utility of this knowledge based optimization approach to molecular geometry. One set comprised druglike molecules extracted from the CSD and a matching conformation generated by Corina24 version 3.49. Properties used to define druglike are given in the Supporting Information. The druglike set comprised 5990 molecules and was randomly subdivided into a training set of 3594 molecules and a test set of 2396 molecules. The training set was used in experiments to

determine optimum  values and the test set was used for assessing the outcome of applying this

knowledge based field. In 1989 the general purpose Tripos force field, which has been widely used6, was described and validated4. We used molecules based on the set described therein as

another set. Three molecules were not included from the original Tripos set. The published diagram of AAXTHP in the Tripos paper4 appears to have an extra CH2 group compared to AAXTHP in the CSD. In order to remove any ambiguity in results, AAXTHP was excluded from this study. ACTHCP could not be located and ABINOR02, after addition of H atoms, was a different enantiomer of ABINOS01 so both ACTHCP and ABINOR02 were also excluded. The resulting set of 73 molecules, which will be referred to as the Tripos set, is in fact very similar to the 74 molecules used by Wang et al.6 The original structures corresponding to the 73 molecules in the Tripos set can be downloaded freely through the CSD-Community Access Structures service at www.ccdc.cam.ac.uk. Alternatively, users of the CSD-System can extract entries via ConQuest or by using the CSD-System Python API. The refcode list is provided in the

Supporting Information. First we describe the experiments performed to determine the  values ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 39

and then describe the application of the resulting knowledge based field to test sets and describe some individual cases. CSD-KBF minimized coordinates are also provided in the supporting information as mol2 files. Refcodes of the 5990 compounds in the druglike set including training set/test set indicator and RMSD values obtained are also included in the Supporting Information. Parameter Optimization Automatic configuration of optimization algorithms can be achieved using irace, finding the most appropriate settings given a set of input instances of a problem. The goal is to find parameter sets that perform well over a large set of instances. A cost measure is required by irace which assigns a cost value to one run of a particular algorithm configuration on a particular instance. We used both heavy atom overlay RMSD and combinations of RMSDs of internal degrees of freedom as cost measures. The definition of internal degree-of-freedom RMSD will be given below. The basic idea is to minimize a molecule using the knowledge based field and compare the minimized structure against a reference structure. The reference structure is the CSD crystal structure. Assuming the reference molecule represents a local minimum then the knowledge based field should generate a new molecular configuration very close to the reference measured in terms of an RMSD (either heavy atom or based on internal degrees of freedom). The druglike training set of Corina generated conformations were minimized and overlaid on to the original un-minimized CSD molecule using heavy atom RMSD as the cost measure. Additional experiments with the cost measure taken as different combinations of internal RMSDs, such as the sum of bond length plus valence angle plus torsion RMSDs or a weighted sum of these RMSDs, were performed. A Lennard-Jones functional form for the clash was also investigated in the irace experiments. A clash is an unfavorable, repulsive interaction,

ACS Paragon Plus Environment

12

Page 13 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

so on physical grounds the Lennard-Jones form with its attractive well was not used further in this study. The internal degree-of-freedom RMSDs and “Maximum”, reported in the Results and Discussion section, for a set of molecules are defined as follows. For bond lengths MNO"() = P

1

#$%)

$V4XYZ4YX[V $8TUV(V)





)W

W

(=$=)%) 1

Q

− 

(/3R).&)

S

(7)

where ])3;/;3) is the number of molecules in the set, and the total number of bonds

considered, #$%) , is the number of bonds in each individual structure summed over the number of structures.

#$%) =

$V4XYZ4YX[V

 )W

] #$%) (^)

(8)

There are corresponding formulae for valence angle and torsion RMSDs. The “Maximum” value reported below is defined as follows. First, the maximum absolute difference in bond lengths is determined for a particular structure.

`ab `c`(^) = max g #$%) 

(/3R).& )

− 

(=$=)%)

g

(9)

Second, the maximum of this set of values across all structures is determined. `ab `c` =

max

)3;/;3) )

`ab `c`(^)

(10)

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 39

There are corresponding formulae for valence angle and torsion maxima.

Results and Discussion It should be stressed that the purpose of this paper is not to prove the superiority or inferiority of any specific force field or methodology. It can be difficult to objectively compare force fields due, for example, to differences in atom type and/or bond type conventions, optimization algorithms used and optimization termination criteria. The purpose of the paper is to illustrate a novel use of crystal structure databases for optimizing molecular geometries, the problems one may encounter and the kind of results one may expect to obtain. Determination of  and clash parameters

In all the experiments using irace to train the  and clash parameter values there was no

discernable effect on the resulting RMSD distributions compared to a default set of parameters. The chosen default values were  = 1 in equations (1)-(4) and @ = 1, A = 50 in equation (5).

All subsequent computations carried out in this work used these parameters unless otherwise stated.

Minimization results for the Tripos set Table XII published in the Tripos paper4 contains summary statistics for differences between crystallographic geometries and geometries minimized with the energy based Tripos force field. The relevant items from this table are shown below in Table 1.

ACS Paragon Plus Environment

14

Page 15 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1. Subset of Tripos Table XII showing the number of features and the corresponding RMSD and maximum deviation for that feature. Feature

N

RMSD

Maximum

Valence angles °

1483

0.025

0.127

2174

2.50

12.24

2888

9.54

56.26

Coordinates Å

76

0.25

1.088

Bond lengths Å

Torsion angles °

Table 2 shows the corresponding results when the CSD-KBF is applied to the Tripos set. The convergence parameter for minimization used throughout this study for the Tripos force field is L = 0.01 (see Supporting Information).

Table 2. Results for the CSD-KBF applied to the Tripos set. Feature

n

RMSD

Maximum

Valence angles °

1436

0.015

0.15

2108

1.59

23.31

2808

7.9

75.34

Coordinates Å

73

0.30

1.612

Bond lengths Å

Torsion angles °

We use an expression equivalent to equation (7) for the “Coordinates” RMSD and equivalent to equation (10) for the “Coordinates” Maximum, including heavy atoms only. For consistency with Table 1, n in this case refers to the number of structures rather the number of heavy atom pairs. The CSD-KBF compares favorably with the Tripos energy based force field (which was selected as perhaps the best known benchmark) with respect to bond length and valence angle

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 39

RMSDs and is comparable with respect to torsion angle RMSDs. The complete distributions of the internal degree-of-freedom RMSD values for the individual structures are summarized in the boxplots in Figure 2. This figure shows the improvement obtained by applying the CSD-KBF to this particular set of molecules compared to the Tripos force field implementation used in this study.

Figure 2. Distributions of RMSDs of internal degrees-of-freedom for the Tripos set minimized by the CSD-KBF and the Tripos force field. For comparison, heavy atom RMSD distributions for overlaying the minimized structures on the CSD structure are shown in Figure 3 below. The boxplots in Figure 3 illustrate heavy atom

ACS Paragon Plus Environment

16

Page 17 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

RMSDs may be improved by using the CSD-KBF compared to the Tripos force field in that the resulting range is smaller and has a smaller maximum value, and all the corresponding quartiles are also lower. We include a brief discussion of statistical significance in the Supporting Information.

Figure 3. Distributions of heavy atom RMSDs for the Tripos set minimized by the CSD-KBF and the Tripos force field.

It is important to emphasize that the CSD-KBF does not produce energies, but scores. The scores of the molecules in the Tripos set prior to and post minimization with the CSD-KBF have

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 39

been examined and the scores for all 73 molecules are displayed in Figure 4. It is desirable that scores, like energies, are reduced during optimization. Each point in this figure corresponds to one of the 73 molecules. All points lie below the diagonal line representing equal final and initial scores. This tells us the score for every structure was decreased by the CSD-KBF. It is conceivable for the score to decrease yet the structure fail to optimize according to the termination criteria imposed on the algorithm. For the set of 73 Tripos molecules however every structure successfully optimized with L = 0.01.

Figure 4. Final versus initial score for the set of 73 molecules in the Tripos set minimized with the CSD-KBF.

ACS Paragon Plus Environment

18

Page 19 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Minimization Results for the Corina druglike test set The 2396 Corina generated conformations in the druglike test set are independent of the CSD structure and thus provide a test of the CSD-KBF when applied to non-CSD structures. The Corina generated conformation was overlaid on to the CSD molecule and the resulting distributions of the internal degrees-of-freedom RMSDs computed. These distributions provide a baseline indication of the structural difference between the Corina conformation and the CSD molecule. The Corina conformation was subsequently minimized with the CSD-KBF and the Tripos force field. The distributions of the internal degree-of-freedom RMSDs of the minimized conformation relative to the CSD molecule were computed in each case and Figure 5 shows the distributions.

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 39

Figure 5. Distributions of internal degree-of-freedom RMSDs for the druglike set of molecules without minimization (Corina) and with minimization by either the CSD-KBF or Tripos force field. The distribution of torsion angle RMSDs for the original Corina conformation and the conformations minimized with either the CSD-KBF or the Tripos force field were effectively the same. This of course does not imply the effect of the CSD-KBF and the Tripos force fields were the same on each molecule individually, but the set of molecules as a whole were affected equally with respect to torsional freedom. For valence angles, the CSD-KBF distribution has a smaller median, a smaller lower quartile and a smaller upper quartile than the distribution from

ACS Paragon Plus Environment

20

Page 21 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Corina structures and the distribution from Tripos force field minimized structures. In this case the value of the upper quartile of the CSD-KBF valence angle distribution is less than the corresponding lower quartiles from the Corina and Tripos minimized structures. For bond length distributions the results are very similar. The bond length distribution after CSD-KBF minimization has a smaller median, a smaller lower quartile and a smaller upper quartile than the distribution from Corina structures and the distribution from Tripos force field minimized structures. For comparison, heavy atom RMSD distributions resulting from overlaying on to the CSD structure were computed for the Corina structures and the CSD-KBF and Tripos minimized structures and boxplots of the distributions are shown in Figure 6. The figure shows that heavy atom RMSD distributions vary in the range of approximately 0 to 5Å and do not differ in any substantial way from each other. The internal degree-of-freedom RMSD distributions above provide a much better discrimination of the effect of minimization.

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 39

Figure 6. Distributions of heavy atom RMSDs for the druglike set of molecules without minimization (Corina) and with minimization by either the CSD-KBF or Tripos force field. As for the Tripos set, we also examine the final versus initial score for all 2396 molecules in the set pre- and post-minimization with CSD-KBF. The result is displayed in Figure 7. In this case the final score is again lower than the initial score for all 2396 structures in the set indicating that all structures have been improved with respect to the objective functions.

ACS Paragon Plus Environment

22

Page 23 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 7. Final versus initial score for the set of 2396 druglike molecules, generated with Corina and minimized with the CSD-KBF. The CSD bond length and valence angle libraries from which the objective functions are constructed exclude the individual observations for bonds involving terminal H atoms but instead provide summary statistics describing them viz. mean and standard deviation. Such two-number summary statistics would not be meaningful for multimodal circular distributions which in general is precisely what the torsion angle distributions are. As explained earlier in the Materials and Methods section, PDFs may be observation based, but may also be mean/standard-deviation

(k, l) based, or even input structure based. For the 2396 Corina generated conformations none of

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 39

the bond length distributions are based on the input structure and only 0.1% of the valence angle distributions are based on input structure values. For the torsion PDFs the situation is a little different. Some strategy is required when no rotamer distribution is provided by the rotamer libraries. We have considered all sets of bonded quartets of atoms that form a torsion angle and in the absence of a rotamer distribution, a PDF based on the input torsion angle, has been generated and included in the model. This increases the number of PDFs based on the input structure. An additional benefit in allowing a degree of flexibility in these torsions is allowing these torsions to relax as other degrees-of-freedom in the structure vary. Of course, given explicit data in the CSD libraries these torsions will be modelled by the underlying observations in the CSD instead of being based on the value in the starting structure. Table 3 shows counts across the 2396 set of what the PDFs were based upon split by structural feature. For bond lengths, marginally more PDFs were based on observations than on mean/sd descriptive statistics and none of the PDFs were based on the input structure. For valence angles more PDFs were based on descriptive statistics than on actual observations and a negligible fraction were based on the input structure. For torsion angles, slightly more PDFs were based on actual observations than on the input structure. Table 3. Counts of PDFs constructed from observations, mean/standard-deviation, or input structure split by bond lengths, valence angles and torsion angles for the 2396 Corina druglike molecules. Feature

Observation

Mean/sd

Input Structure

Bond Length

54518

48036

0

Valence Angle

64148

110908

227

Torsion Angle

47064

0

36335

ACS Paragon Plus Environment

24

Page 25 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Case Study: Application to Tolfenamic Acid Recently, the conformational profiles of several fenamates were studied by Uzoh et al.25 where various analytic representations of the torsion profiles were investigated. Tolfenamic acid (CSD refcode KAXXAI), shown in Figure 8, was taken as a typical example. The CSD-KBF was applied to this molecule and the results investigated.

Figure 8. Tolfenamic acid, CSD refcode KAXXAI, showing two intramolecular distances and 1 torsion angle prior to minimization. After application of the CSD-KBF the O2-H6 and H5-H7 through space distances increased by 0.033 Å and 0.019 Å respectively and the C7-N1-C8-C13 torsion relaxed to -146.11°,

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 39

representing a change of 3.48°. The knowledge based torsional objective function for the torsion defined by atoms C7-N1-C8-C13 was examined and is shown in Figure 9 along with histogram data, the PDF and value of the torsion angle prior to and post-minimization. The colors of the histogram bars range from green to red where green represents the regions with the most observations and red the regions with the least observations.

Figure 9. Torsion histogram, PDF (solid line) and torsion objective function (dot-dash line) for C7-N1-C8-C13 in KAXXAI showing the initial value (dashed blue vertical line) and final value (dashed red vertical line) of the torsion angle after application of the CSD-KBF. The PDF and histogram have been scaled up by a factor of two to better reveal the detail and the y axis is dimensionless.

ACS Paragon Plus Environment

26

Page 27 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Note the shape of the objective function reflects the underlying structural data used to generate

it. There are large histogram maxima at ±180° and corresponding deep minima in the objective

function. There are significantly fewer observations in the remaining angular range and

consequently the objective function value rises indicating this is a less favored region. Note

however there are small local peaks in the histogram near 0° and on either side of and close to

±60° and in between we see local minima in the objective function. The conclusion from the torsion library is that the most favorable values for this particular torsion angle are near to

±180°. The KBF did not, in this instance, drive this torsion angle to the deepest minimum of the

objective function corresponding to the most populated region of the histogram. This is a multiobjective optimization where a balance needs to reached between all objective functions present

and in this case the clash term prevents this torsion from reaching an angle at the minimum of the objective function. In the absence of the clash term, the torsion angle is indeed driven to the

optimum region indicated by the CSD (≈ −180°). However having no term to model the intramolecular clashes leads, of course, to unphysical close contacts. This highlights the general

principle that although the CSD data may indicate regions particularly favorable on average across many structures, this may not be achievable in practice for the particular structure under consideration. A further experiment was carried out to compare the MP2 rigid conformational scan for the

C7-N1-C8-C9 torsion angle, denoted by o and in the range 0° to 180°, of tolfenamic acid

performed by Uzoh et al.25 (see their Fig. 4) with the result from the KBF. Figure 10 shows the rigid knowledge-based conformational scan comprised of the sum of the KBF torsion plus clash

objective functions as o varies. There are two distinct minima and three maxima, a result very

similar to the MP2 calculations in Figure 4 of Uzoh et al. Moreover, the location of the minimum

ACS Paragon Plus Environment

27

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 39

at o ≈ 40° agrees very well with the MP2 result, and the minimum at o ≈ 110° is close to the

shoulder in the MP2 results at o ≈ 100° which evidently corresponds to the minimum at

o ≈ 110° in the relaxed PBE0 DFT calculation also reported by Uzoh et al. The locations of the

three maxima correspond similarly well. Given the knowledge based nature of the torsion objective function and simple heuristic nature of the clash objective function this represents good agreement with a relatively high level ab-initio calculation.

Figure 10. Conformational scan of the C7 N1 C8 C9 torsion angle (o) in tolfenamic acid using the sum of #3)#$ and /&.) from the CSD-KBF.

ACS Paragon Plus Environment

28

Page 29 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The effect of the CSD-KBF on the valence angle, C7-N1-C8 (with N1 being the amine nitrogen in the center of the molecule) is illustrated in Figure 11. Again the shape of the objective function reflects the underlying data used to generate it, hence we observe a strong local (and global) minimum where the majority of the valence angles are observed (120° −

130°). We also see some evidence of a small unfavorable local minimum close to 150° where

there is a single observation and potential outlier in the library. The valence angle is driven towards the main minimum lying between 120° − 130° by the CSD-KBF.

Figure 11. PDF (solid line) based on the CSD distribution (histogram) and objective function (dot dash line) for valence angle C7-N1-C8 in KAXXAI showing the initial value of the valence

ACS Paragon Plus Environment

29

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 39

angle (dotted blue vertical line) and final value of the valence angle (dotted red vertical line) after application of the CSD-KBF. The effect of the CSD-KBF on the valence angle, C2-C3-H2, part of an aromatic ring, is shown in Figure 12. In this case the observations relate to valence angles with terminal hydrogen atoms. These are so numerous that the actual observations are not provided explicitly rather the mean angle and standard deviations of the corresponding distribution are. The PDF constructed in this case is based on a normal distribution centered on the mean value with standard deviation set to the actual standard deviation unless this falls below a threshold (5°), in which case the threshold value is used as the standard deviation. This measure prevents the distributions becoming too narrow which causes problems for the gradient based optimization, but at the same time is sufficiently restrictive to approximate the value from the data. The valence angle in this case is driven to the ideal angle represented by the minimum of the objective function, corresponding to the peak of the PDF.

ACS Paragon Plus Environment

30

Page 31 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 12. PDF (solid line) and objective function (dot dash line) for valence angle C2-C3-H2 in KAXXAI with initial value of the valence angle (dotted blue vertical line) and final value of the valence angle (dotted red vertical line) after application of the CSD-KBF. The effect of the CSD-KBF on the bond length of N1-C7 is displayed in Figure 13. Here the initial bond length is quite close to the peak of the histogram and the corresponding minimum of the objective function. After minimization the bond length is still close to this optimum but it is not achieved exactly. This serves as a reminder of the balance the optimization algorithm needs to find between the many objective functions involved in the optimization and that if the minimum is rather shallow, the location of the identified minimum may be subject to some uncertainty.

ACS Paragon Plus Environment

31

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 39

Figure 13. PDF (solid line) based on the CSD distribution (histogram) and objective function (dot dash line) for bond length N1-C7 in KAXXAI with initial value of the bond length (dotted blue vertical line) and final value of the bond length (dotted red vertical line) after application of the CSD-KBF. For comparison, the effect of the CSD-KBF on the bond length of H6-N1 is displayed in Figure 14. Here a terminal H atom is involved and these are too numerous in the CSD to provide the actual observations explicitly. Instead a normal distribution is constructed centered on the mean and with the same standard deviation as the underlying distribution, unless this value is less than a threshold value (0.05 Å). Values below this threshold make the PDF too narrow, ACS Paragon Plus Environment

32

Page 33 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

making optimization difficult, but 0.05 Å is still sufficiently narrow to be effective. The initial

length of the bond appears to be very short compared to the mean value in the CSD and the optimization modifies the bond length to the mean value observed in the CSD.

Figure 14. PDF (solid line) and objective function (dot dash line) for bond length H6-N1 in KAXXAI with initial value of the bond length (dotted blue vertical line) and final value of the bond length (dotted red vertical line) after application of the CSD-KBF. For tolfenamic acid the counts of the data types on which the PDFs and objective functions are based are displayed in Table 4. None of the bond length or valence angle PDFs are based on the input structure but all are based on CSD data, either actual observations or summary statistics

ACS Paragon Plus Environment

33

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 39

(mean/standard-deviation.) For the torsion angles, about two-thirds of the PDFs are based on observations and about one-third based on the value in the input structure.

Table 4. Counts of PDFs constructed from observations, mean/standard-deviation, or input structure split by bond lengths, valence angles and torsion angles for tolfenamic acid, KAXXAI. Feature

Observation

Mean/sd

Input Structure

Bond Length

19

12

0

Valence Angle

25

24

0

Torsion Angle

15

0

8

The behavior of the component objective functions comprising the overall objective function as the optimization of KAXXAI proceeds from initialization to termination has been examined and is discussed in the Supporting Information. Our final point of note concerns the domain of applicability of the CSD-KBF. The knowledge based field presented here is intended for structures whose geometry may already be considered to be reasonable but for which the user wishes to generate a structure with a geometry modelled according to the distributions represented in the CSD. Highly distorted structures have some

geometric attributes lying in regions of approximately zero density and lim7%