Subscriber access provided by CITY UNIV LIB
Article
Knowledge Based Optimization of Molecular Geometries using Crystal Structures Jason Cole, Colin R. Groom, Oliver Korb, Patrick McCabe, and Gregory P Shields J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.5b00712 • Publication Date (Web): 15 Mar 2016 Downloaded from http://pubs.acs.org on March 20, 2016
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Knowledge Based Optimization of Molecular Geometries using Crystal Structures Jason C. Cole, Colin R. Groom, Oliver Korb, Patrick McCabe*, Gregory P. Shields Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, United Kingdom
KEYWORDS Cambridge Structural Database, CSD, knowledge base, bond length, valence angle, torsion angle, PDF, minimization, optimization
ABSTRACT
This paper describes a novel way to use the structural information contained in the Cambridge Structural Database (CSD) to drive geometry optimization of organic molecules. We describe how CSD structural information is transformed into objective functions for gradient based optimization to provide good quality geometries for a large variety of organic molecules. Performance is assessed by minimizing different sets of organic molecules reporting RMSD movements for bond lengths, valence angles, torsion angles and heavy atom positions.
ACS Paragon Plus Environment
1
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 39
Introduction The Cambridge Structural Database (CSD1) contains hundreds of thousands of crystal structures with molecules whose atomic coordinates have been reliably determined and so represents an invaluable source of molecular geometry information. In Mogul2, the geometric data implicit in the CSD has been developed into multiple libraries including libraries for bond lengths, valence angles and torsion angles, and presented in the form of histograms. This paper describes a method that utilizes the raw geometric data to generate probability density function (PDFs), using the method of kernel density estimation (KDE)3. The generated PDFs are continuous and smooth functions. Peaks in the PDFs are indicative of favorable values for the underlying geometric attribute, reflecting the high frequency of occurrence of this attribute with this value. Conversely, regions where the PDFs are smaller in value represent regions where the frequency of occurrence is lower and correspond to less likely values for the geometric attribute. Thus, the PDFs contain information about the average geometric preferences of a molecule according to the data contained in the CSD. In principle, it should be possible to exploit this information in an optimization framework by converting the PDF into an objective function which can be supplied to an optimization (minimization) algorithm. We attempt just that. One of the most common methods for generating good geometries of small organic molecules is energy minimization using a force field. Numerous force fields exist based on energetics (e.g. Tripos4, MMFF945, GAFF6.) Knowledge-based potentials also exist relating to protein folding7 and to protein-ligand complexes8. These generally make use of statistical Boltzmann type equations and/or radial pair distribution functions for calculating energetics of the differences between an object (protein) and a reference state of that object, the reference state often being
ACS Paragon Plus Environment
2
Page 3 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
difficult to define. An intramolecular knowledge based field (KBF) based upon distributions of internal degrees of freedom, as opposed to radial pair distribution functions, and which makes no assumptions about functional forms is developed. Such a KBF would provide an additional and perhaps valuable tool in the computational chemist’s toolbox for generation of optimal molecular structures. In this context, optimum would be according to the large array of structural distributions contained within the CSD. Mogul2 is well known for providing structural information in the form of histograms for bond length, valence angles and torsion angles. A user may compare his or her molecule of interest against the Mogul distributions9,10. This may highlight structural features in the molecule that lie in regions which are significantly displaced away from the most populated regions. This “molecular health check” can signal poorly modelled molecules, errors and/or regions of strain. The latter case can be used to indicate opportunities for optimization of molecular geometry in the context of molecular recognition. Of course, structural features in a molecule are often interdependent. For instance changing a particular valence angle (e.g. atoms A or B change positions in valence angle A-B-C) may cause a corresponding change in another valence angle (A-B-D). It is entirely possible to have competing objectives in this sense where the optimum for one structural feature may cause another feature to be sub-optimal. A chemist’s intuition may provide the ideal answer in some situations but it is of course not practical for a chemist to individually examine every bond length, valence angle and torsion angle in every molecule. Also, given the existence of large repositories of high quality data such as the CSD it would be prudent to develop methods that make use of this data to guide the decisions about optimality. Many techniques exist for algorithmically determining solutions to optimization problems11,12. Local optimization methods include interval reduction methods, gradient descent
ACS Paragon Plus Environment
3
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 39
methods (e.g. Newton-Raphson, BFGS) etc. Global optimization methods include the various branch and bound methods, simulated annealing, evolutionary methods (e.g. genetic, ant colony) etc. It is the intent of this paper to combine the data in the CSD and a local optimization algorithm, the gradient based Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, and describe a methodology for automatically finding a locally optimum structure. It is important to note this CSD-KBF is not a traditional force field based upon energies. We deliberately do not use the term “energy” or “force” since the objective functions employed are knowledge based, not energy based. Some correlation may be expected between energy minima and populated regions of geometric space which provides additional rationale of the method, see for example the paper by Allen et al.13, Figure 10 in the paper by Cruz-Cabeza and Bernstein14 and Figure 3 in the paper by Kazantsev et al15. These figures show explicit examples of ab initio energy minima coinciding with maxima in CSD distributions.
Materials and Methods Objective Functions Transformation of a PDF into an objective function could be achieved in many ways. For example, taking the negative of the PDF would generate a well in the regions corresponding to high frequency of occurrence. A minimizing algorithm could then be applied to drive the geometric attribute to the minimum of the well, as in traditional energy minimization methods. This particular approach would however suffer from the drawback of exerting little influence in extended regions of low probability (close to zero) density. Whereas, in reality, such regions represent very unlikely domains. There should, in fact, be a strong influence from any such
ACS Paragon Plus Environment
4
Page 5 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
objective function in this case. A transformation that overcomes this difficulty is achieved by taking the negative of the (natural) log of the PDF i.e. − log ( ) and several figures
illustrating this transformation will be presented in later sections of this paper (see Case Study Tolfenamic Acid). In this case the regions of probability density closer to zero are transformed to very large positive values. Regions of high probability density are transformed in to wells. Given these two properties of the negative log transformation we chose to adopt this to transform the PDFs into objective functions. Since the PDFs themselves are smooth and the transformation generates smooth functions, the objective functions so generated are differentiable and can be used in gradient-based minimization frameworks. The data used to generate the PDFs comes from the Mogul libraries. Mogul in fact contains four data libraries, one for bond lengths, one for valence angles, one for torsion angles and one for unfused, unbridged rings. Recently a customized version (“CV”) was developed for the highthroughput generation of conformers16. The customized version of the Mogul libraries is used in this study. The CV contains multiple libraries of molecular dimensions, built from CSD entries published before 2011, limited, for performance reasons, to 500 observations for bond lengths, 250 for valence angles and 1000 for torsion angles (more precisely “rotamers” described below.) The multiple libraries form a hierarchy with respect to precision ranging from the most precise description of the fragment to the most generic description. In the CV there are four bond-length libraries and four valence-angle libraries. The CV also introduced the notion of “rotamer” libraries where each central bond X-Y in the fictitious fragment RA(RB)X-Y(RC)RD contributes just one distribution instead of four torsion distributions. Taking bond lengths as an example and a molecule with bonds then in principle it is possible to extract up to bond length distributions from the bond length libraries. Each distribution would have a corresponding PDF
ACS Paragon Plus Environment
5
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 39
for the bond in question and each PDF would be transformed into an objective function.
Denoting the length of the bond by , the sum total of these transformed PDFs would form the overall bond length objective function ( , … . ) which is a function of each bond length. , … , = − log !" #$% ()( &$'
#$% &$')
(1)
Similarly, using the same notation, we can form overall objective functions for valence angles
( + ), torsion angles ( , ) and out-of-plane atom distances ( ) giving the objective functions
below,
-.&$/ + , … , +0 = −1 log !"-.&$/ (+)( .$'&
-.&$/ .$'&)
.$'&
#3)#$ , , … , ,4 = −5 log !"#3)#$ (,)( .$'&
#3)#$ .$'&)
##7 , … , 8 =
#; #< 7&.$ .#=)
−9 log :!"
.$'&
#; #< ()> 7&.$ .#=
(2)
(3)
(4)
where each is a weighting factor. In formula (3) the summation includes objective functions
based on rotamer distributions and objective functions based on the remaining torsion angles in
ACS Paragon Plus Environment
6
Page 7 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
the molecule. In this work, out-of-plane distributions for 3-coordinate atoms were modelled with a single normal, i.e. Gaussian kernel, distribution. The out-of-plane distributions for acyclic nitrogen atoms exist in the form of two-dimensional data but this study concentrates entirely on one-dimensional distributions. In this first model such atoms were not assigned out-of-plane objective functions. Of course the Mogul library data represents local information; atoms and groups far from the geometric attribute under consideration could be quite different from molecule to molecule. Application of local information without considering the global conformation may generate molecules with unrealistically short interatomic distances. Thus there is a potential for intramolecular clashes and a term is required to model the clash. We adopted the following, non-probabilistic, simple functional form to model the intramolecular clash between atom pair i-j /&.)
A 1 = @ A B1 − 1 D , A ≤ CA CA /&.) 7.3)
1
(5)
where @ is an atom independent weighting factor, A is a weighting factor in principle
dependent on the atom pair, CA is a cut-off range (sum of VDW radii – 0.2) and A the distance
between atom i and atom j. VDW radii are taken from Bondi17, except the value for H, which is
taken from Rowland and Taylor18. Radii that are not available in either of these publications have
VDW radius = 2Å. This functional form represents a purely repulsive interaction of finite range.
For A ≥ CA the clash value is zero. A typical clash profile is displayed in Figure 1.
ACS Paragon Plus Environment
7
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 39
Figure 1. Typical profile of a clash function for one clash pair with @ = 1, CA = 2, A = 50.
See equation (5).
We can form an overall CSD probabilistic knowledge-based intramolecular objective function determined by the bond length, valence angle, torsion angle and out-of-plane terms, and an additional term to control clashing interactions.
, … , , + , … , + - , , , … , , , , … , # = #$% , … , J -.&$/ + , … , +0 &$'
.$'&
J #3)#$ , , … , ,4 J ##7 , … , 8 J /&.) .$'&
(6)
ACS Paragon Plus Environment
8
Page 9 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
The KDE approach to approximating the underlying PDF centers a suitable kernel function at each observation with an appropriate bandwidth. This procedure was described in detail in an earlier publication3 and the reader is referred there for further details. For those bond lengths and valence angles involving terminal hydrogen bonds, being so numerous in the database, the underlying observations are not exposed by the Mogul libraries but instead a mean and standard deviation are supplied which summarize the distribution with just two numbers. This is of course less informative than the actual observations, but PDFs can still be constructed from these parameters. In this case the PDF for the population is modelled as a normal (Gaussian) distribution centered on the mean and with standard deviation equal to the sample standard deviation. In other instances the Mogul libraries may provide no data about the geometric attribute in question. This may not imply there are no corresponding instances in the CSD but rather that the data has not been included in the Mogul library. This can occur for several reasons. There may be too few observations in the data, implying they cannot be considered as significant, or the appropriate search has not been carried out. One advantage of this knowledge based approach to molecular structure optimization is that “missing” data can be included as it becomes available in the database from which the geometry libraries were generated (the CSD in this case). Searches that may not have been carried out can be performed as and when they become apparent. An important question to address is what strategy to adopt if there are neither observations nor sample parameters available upon which to model a PDF and thus generate an objective function. We have adopted the strategy of using the value of the geometric attribute contained in the input molecular structure, with a small degree of variation permitted around that value. Figures will be presented in later sections quantifying the number of PDFs based on the input structure alone.
ACS Paragon Plus Environment
9
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 39
Each PDF based on actual observations is constructed by placing a suitable kernel function at every data point and summing over the kernels. A further step necessary for gradient-based minimization is the evaluation of the derivative of each kernel function and a further summation over the derivatives. When many data points are involved this is inefficient and to overcome this piecewise cubic splines are subsequently constructed to model the KDE of the PDF. Cubic splines are used for calculating the KDE of the PDF for all internal coordinates except out-ofplane distances. Analytic derivatives19,20 with respect to the Cartesian coordinates of the atoms for each objective function have been implemented and we used a limited memory version of the commonly used Broyden-Fletcher-Goldfarb-Shanno (BFGS) minimization algorithm21 (denoted as L-BFGS). Solutions to the optimization problem can be determined to different degrees of
accuracy. The parameter epsilon, L, supplied to the L-BFGS code controls this accuracy. A
minimization will terminate when the norm of the gradient vector is sufficiently small on a scale defined by L. Initially the value of L is set to 0.01. However, if the optimization fails to converge
at this level, L is relaxed geometrically, doubling in value, and a further attempt at optimization
performed, up to an upper limit currently set to 10. All of the structures in the two test sets
described below optimize using the CSD-KBF with L = 0.01 so allowing a geometric relaxation of L may be an unnecessary precaution, in addition it may be questionable how optimal the
structures would be if many doublings of L had been required to achieve successful termination of the optimization algorithm.
Equations (1)-(4) above contain parameters which can be used to control the
weightings of the various objective functions. It may be argued that given each term is a summation over probability density functions that each PDF is naturally scaled and no further
ACS Paragon Plus Environment
10
Page 11 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
weighting is required. This premise was investigated with iterated racing using irace22, a freely available R23 package. Data Sets Several sets of molecules and one specific molecule were used for investigating the utility of this knowledge based optimization approach to molecular geometry. One set comprised druglike molecules extracted from the CSD and a matching conformation generated by Corina24 version 3.49. Properties used to define druglike are given in the Supporting Information. The druglike set comprised 5990 molecules and was randomly subdivided into a training set of 3594 molecules and a test set of 2396 molecules. The training set was used in experiments to
determine optimum values and the test set was used for assessing the outcome of applying this
knowledge based field. In 1989 the general purpose Tripos force field, which has been widely used6, was described and validated4. We used molecules based on the set described therein as
another set. Three molecules were not included from the original Tripos set. The published diagram of AAXTHP in the Tripos paper4 appears to have an extra CH2 group compared to AAXTHP in the CSD. In order to remove any ambiguity in results, AAXTHP was excluded from this study. ACTHCP could not be located and ABINOR02, after addition of H atoms, was a different enantiomer of ABINOS01 so both ACTHCP and ABINOR02 were also excluded. The resulting set of 73 molecules, which will be referred to as the Tripos set, is in fact very similar to the 74 molecules used by Wang et al.6 The original structures corresponding to the 73 molecules in the Tripos set can be downloaded freely through the CSD-Community Access Structures service at www.ccdc.cam.ac.uk. Alternatively, users of the CSD-System can extract entries via ConQuest or by using the CSD-System Python API. The refcode list is provided in the
Supporting Information. First we describe the experiments performed to determine the values ACS Paragon Plus Environment
11
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 39
and then describe the application of the resulting knowledge based field to test sets and describe some individual cases. CSD-KBF minimized coordinates are also provided in the supporting information as mol2 files. Refcodes of the 5990 compounds in the druglike set including training set/test set indicator and RMSD values obtained are also included in the Supporting Information. Parameter Optimization Automatic configuration of optimization algorithms can be achieved using irace, finding the most appropriate settings given a set of input instances of a problem. The goal is to find parameter sets that perform well over a large set of instances. A cost measure is required by irace which assigns a cost value to one run of a particular algorithm configuration on a particular instance. We used both heavy atom overlay RMSD and combinations of RMSDs of internal degrees of freedom as cost measures. The definition of internal degree-of-freedom RMSD will be given below. The basic idea is to minimize a molecule using the knowledge based field and compare the minimized structure against a reference structure. The reference structure is the CSD crystal structure. Assuming the reference molecule represents a local minimum then the knowledge based field should generate a new molecular configuration very close to the reference measured in terms of an RMSD (either heavy atom or based on internal degrees of freedom). The druglike training set of Corina generated conformations were minimized and overlaid on to the original un-minimized CSD molecule using heavy atom RMSD as the cost measure. Additional experiments with the cost measure taken as different combinations of internal RMSDs, such as the sum of bond length plus valence angle plus torsion RMSDs or a weighted sum of these RMSDs, were performed. A Lennard-Jones functional form for the clash was also investigated in the irace experiments. A clash is an unfavorable, repulsive interaction,
ACS Paragon Plus Environment
12
Page 13 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
so on physical grounds the Lennard-Jones form with its attractive well was not used further in this study. The internal degree-of-freedom RMSDs and “Maximum”, reported in the Results and Discussion section, for a set of molecules are defined as follows. For bond lengths MNO"() = P
1
#$%)
$V4XYZ4YX[V $8TUV(V)
)W
W
(=$=)%) 1
Q
−
(/3R).&)
S
(7)
where ])3;/;3) is the number of molecules in the set, and the total number of bonds
considered, #$%) , is the number of bonds in each individual structure summed over the number of structures.
#$%) =
$V4XYZ4YX[V
)W
] #$%) (^)
(8)
There are corresponding formulae for valence angle and torsion RMSDs. The “Maximum” value reported below is defined as follows. First, the maximum absolute difference in bond lengths is determined for a particular structure.
`ab `c`(^) = max g #$%)
(/3R).& )
−
(=$=)%)
g
(9)
Second, the maximum of this set of values across all structures is determined. `ab `c` =
max
)3;/;3) )
`ab `c`(^)
(10)
ACS Paragon Plus Environment
13
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 39
There are corresponding formulae for valence angle and torsion maxima.
Results and Discussion It should be stressed that the purpose of this paper is not to prove the superiority or inferiority of any specific force field or methodology. It can be difficult to objectively compare force fields due, for example, to differences in atom type and/or bond type conventions, optimization algorithms used and optimization termination criteria. The purpose of the paper is to illustrate a novel use of crystal structure databases for optimizing molecular geometries, the problems one may encounter and the kind of results one may expect to obtain. Determination of and clash parameters
In all the experiments using irace to train the and clash parameter values there was no
discernable effect on the resulting RMSD distributions compared to a default set of parameters. The chosen default values were = 1 in equations (1)-(4) and @ = 1, A = 50 in equation (5).
All subsequent computations carried out in this work used these parameters unless otherwise stated.
Minimization results for the Tripos set Table XII published in the Tripos paper4 contains summary statistics for differences between crystallographic geometries and geometries minimized with the energy based Tripos force field. The relevant items from this table are shown below in Table 1.
ACS Paragon Plus Environment
14
Page 15 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Table 1. Subset of Tripos Table XII showing the number of features and the corresponding RMSD and maximum deviation for that feature. Feature
N
RMSD
Maximum
Valence angles °
1483
0.025
0.127
2174
2.50
12.24
2888
9.54
56.26
Coordinates Å
76
0.25
1.088
Bond lengths Å
Torsion angles °
Table 2 shows the corresponding results when the CSD-KBF is applied to the Tripos set. The convergence parameter for minimization used throughout this study for the Tripos force field is L = 0.01 (see Supporting Information).
Table 2. Results for the CSD-KBF applied to the Tripos set. Feature
n
RMSD
Maximum
Valence angles °
1436
0.015
0.15
2108
1.59
23.31
2808
7.9
75.34
Coordinates Å
73
0.30
1.612
Bond lengths Å
Torsion angles °
We use an expression equivalent to equation (7) for the “Coordinates” RMSD and equivalent to equation (10) for the “Coordinates” Maximum, including heavy atoms only. For consistency with Table 1, n in this case refers to the number of structures rather the number of heavy atom pairs. The CSD-KBF compares favorably with the Tripos energy based force field (which was selected as perhaps the best known benchmark) with respect to bond length and valence angle
ACS Paragon Plus Environment
15
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 39
RMSDs and is comparable with respect to torsion angle RMSDs. The complete distributions of the internal degree-of-freedom RMSD values for the individual structures are summarized in the boxplots in Figure 2. This figure shows the improvement obtained by applying the CSD-KBF to this particular set of molecules compared to the Tripos force field implementation used in this study.
Figure 2. Distributions of RMSDs of internal degrees-of-freedom for the Tripos set minimized by the CSD-KBF and the Tripos force field. For comparison, heavy atom RMSD distributions for overlaying the minimized structures on the CSD structure are shown in Figure 3 below. The boxplots in Figure 3 illustrate heavy atom
ACS Paragon Plus Environment
16
Page 17 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
RMSDs may be improved by using the CSD-KBF compared to the Tripos force field in that the resulting range is smaller and has a smaller maximum value, and all the corresponding quartiles are also lower. We include a brief discussion of statistical significance in the Supporting Information.
Figure 3. Distributions of heavy atom RMSDs for the Tripos set minimized by the CSD-KBF and the Tripos force field.
It is important to emphasize that the CSD-KBF does not produce energies, but scores. The scores of the molecules in the Tripos set prior to and post minimization with the CSD-KBF have
ACS Paragon Plus Environment
17
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 39
been examined and the scores for all 73 molecules are displayed in Figure 4. It is desirable that scores, like energies, are reduced during optimization. Each point in this figure corresponds to one of the 73 molecules. All points lie below the diagonal line representing equal final and initial scores. This tells us the score for every structure was decreased by the CSD-KBF. It is conceivable for the score to decrease yet the structure fail to optimize according to the termination criteria imposed on the algorithm. For the set of 73 Tripos molecules however every structure successfully optimized with L = 0.01.
Figure 4. Final versus initial score for the set of 73 molecules in the Tripos set minimized with the CSD-KBF.
ACS Paragon Plus Environment
18
Page 19 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Minimization Results for the Corina druglike test set The 2396 Corina generated conformations in the druglike test set are independent of the CSD structure and thus provide a test of the CSD-KBF when applied to non-CSD structures. The Corina generated conformation was overlaid on to the CSD molecule and the resulting distributions of the internal degrees-of-freedom RMSDs computed. These distributions provide a baseline indication of the structural difference between the Corina conformation and the CSD molecule. The Corina conformation was subsequently minimized with the CSD-KBF and the Tripos force field. The distributions of the internal degree-of-freedom RMSDs of the minimized conformation relative to the CSD molecule were computed in each case and Figure 5 shows the distributions.
ACS Paragon Plus Environment
19
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 39
Figure 5. Distributions of internal degree-of-freedom RMSDs for the druglike set of molecules without minimization (Corina) and with minimization by either the CSD-KBF or Tripos force field. The distribution of torsion angle RMSDs for the original Corina conformation and the conformations minimized with either the CSD-KBF or the Tripos force field were effectively the same. This of course does not imply the effect of the CSD-KBF and the Tripos force fields were the same on each molecule individually, but the set of molecules as a whole were affected equally with respect to torsional freedom. For valence angles, the CSD-KBF distribution has a smaller median, a smaller lower quartile and a smaller upper quartile than the distribution from
ACS Paragon Plus Environment
20
Page 21 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Corina structures and the distribution from Tripos force field minimized structures. In this case the value of the upper quartile of the CSD-KBF valence angle distribution is less than the corresponding lower quartiles from the Corina and Tripos minimized structures. For bond length distributions the results are very similar. The bond length distribution after CSD-KBF minimization has a smaller median, a smaller lower quartile and a smaller upper quartile than the distribution from Corina structures and the distribution from Tripos force field minimized structures. For comparison, heavy atom RMSD distributions resulting from overlaying on to the CSD structure were computed for the Corina structures and the CSD-KBF and Tripos minimized structures and boxplots of the distributions are shown in Figure 6. The figure shows that heavy atom RMSD distributions vary in the range of approximately 0 to 5Å and do not differ in any substantial way from each other. The internal degree-of-freedom RMSD distributions above provide a much better discrimination of the effect of minimization.
ACS Paragon Plus Environment
21
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 39
Figure 6. Distributions of heavy atom RMSDs for the druglike set of molecules without minimization (Corina) and with minimization by either the CSD-KBF or Tripos force field. As for the Tripos set, we also examine the final versus initial score for all 2396 molecules in the set pre- and post-minimization with CSD-KBF. The result is displayed in Figure 7. In this case the final score is again lower than the initial score for all 2396 structures in the set indicating that all structures have been improved with respect to the objective functions.
ACS Paragon Plus Environment
22
Page 23 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Figure 7. Final versus initial score for the set of 2396 druglike molecules, generated with Corina and minimized with the CSD-KBF. The CSD bond length and valence angle libraries from which the objective functions are constructed exclude the individual observations for bonds involving terminal H atoms but instead provide summary statistics describing them viz. mean and standard deviation. Such two-number summary statistics would not be meaningful for multimodal circular distributions which in general is precisely what the torsion angle distributions are. As explained earlier in the Materials and Methods section, PDFs may be observation based, but may also be mean/standard-deviation
(k, l) based, or even input structure based. For the 2396 Corina generated conformations none of
ACS Paragon Plus Environment
23
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 39
the bond length distributions are based on the input structure and only 0.1% of the valence angle distributions are based on input structure values. For the torsion PDFs the situation is a little different. Some strategy is required when no rotamer distribution is provided by the rotamer libraries. We have considered all sets of bonded quartets of atoms that form a torsion angle and in the absence of a rotamer distribution, a PDF based on the input torsion angle, has been generated and included in the model. This increases the number of PDFs based on the input structure. An additional benefit in allowing a degree of flexibility in these torsions is allowing these torsions to relax as other degrees-of-freedom in the structure vary. Of course, given explicit data in the CSD libraries these torsions will be modelled by the underlying observations in the CSD instead of being based on the value in the starting structure. Table 3 shows counts across the 2396 set of what the PDFs were based upon split by structural feature. For bond lengths, marginally more PDFs were based on observations than on mean/sd descriptive statistics and none of the PDFs were based on the input structure. For valence angles more PDFs were based on descriptive statistics than on actual observations and a negligible fraction were based on the input structure. For torsion angles, slightly more PDFs were based on actual observations than on the input structure. Table 3. Counts of PDFs constructed from observations, mean/standard-deviation, or input structure split by bond lengths, valence angles and torsion angles for the 2396 Corina druglike molecules. Feature
Observation
Mean/sd
Input Structure
Bond Length
54518
48036
0
Valence Angle
64148
110908
227
Torsion Angle
47064
0
36335
ACS Paragon Plus Environment
24
Page 25 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Case Study: Application to Tolfenamic Acid Recently, the conformational profiles of several fenamates were studied by Uzoh et al.25 where various analytic representations of the torsion profiles were investigated. Tolfenamic acid (CSD refcode KAXXAI), shown in Figure 8, was taken as a typical example. The CSD-KBF was applied to this molecule and the results investigated.
Figure 8. Tolfenamic acid, CSD refcode KAXXAI, showing two intramolecular distances and 1 torsion angle prior to minimization. After application of the CSD-KBF the O2-H6 and H5-H7 through space distances increased by 0.033 Å and 0.019 Å respectively and the C7-N1-C8-C13 torsion relaxed to -146.11°,
ACS Paragon Plus Environment
25
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 26 of 39
representing a change of 3.48°. The knowledge based torsional objective function for the torsion defined by atoms C7-N1-C8-C13 was examined and is shown in Figure 9 along with histogram data, the PDF and value of the torsion angle prior to and post-minimization. The colors of the histogram bars range from green to red where green represents the regions with the most observations and red the regions with the least observations.
Figure 9. Torsion histogram, PDF (solid line) and torsion objective function (dot-dash line) for C7-N1-C8-C13 in KAXXAI showing the initial value (dashed blue vertical line) and final value (dashed red vertical line) of the torsion angle after application of the CSD-KBF. The PDF and histogram have been scaled up by a factor of two to better reveal the detail and the y axis is dimensionless.
ACS Paragon Plus Environment
26
Page 27 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Note the shape of the objective function reflects the underlying structural data used to generate
it. There are large histogram maxima at ±180° and corresponding deep minima in the objective
function. There are significantly fewer observations in the remaining angular range and
consequently the objective function value rises indicating this is a less favored region. Note
however there are small local peaks in the histogram near 0° and on either side of and close to
±60° and in between we see local minima in the objective function. The conclusion from the torsion library is that the most favorable values for this particular torsion angle are near to
±180°. The KBF did not, in this instance, drive this torsion angle to the deepest minimum of the
objective function corresponding to the most populated region of the histogram. This is a multiobjective optimization where a balance needs to reached between all objective functions present
and in this case the clash term prevents this torsion from reaching an angle at the minimum of the objective function. In the absence of the clash term, the torsion angle is indeed driven to the
optimum region indicated by the CSD (≈ −180°). However having no term to model the intramolecular clashes leads, of course, to unphysical close contacts. This highlights the general
principle that although the CSD data may indicate regions particularly favorable on average across many structures, this may not be achievable in practice for the particular structure under consideration. A further experiment was carried out to compare the MP2 rigid conformational scan for the
C7-N1-C8-C9 torsion angle, denoted by o and in the range 0° to 180°, of tolfenamic acid
performed by Uzoh et al.25 (see their Fig. 4) with the result from the KBF. Figure 10 shows the rigid knowledge-based conformational scan comprised of the sum of the KBF torsion plus clash
objective functions as o varies. There are two distinct minima and three maxima, a result very
similar to the MP2 calculations in Figure 4 of Uzoh et al. Moreover, the location of the minimum
ACS Paragon Plus Environment
27
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 28 of 39
at o ≈ 40° agrees very well with the MP2 result, and the minimum at o ≈ 110° is close to the
shoulder in the MP2 results at o ≈ 100° which evidently corresponds to the minimum at
o ≈ 110° in the relaxed PBE0 DFT calculation also reported by Uzoh et al. The locations of the
three maxima correspond similarly well. Given the knowledge based nature of the torsion objective function and simple heuristic nature of the clash objective function this represents good agreement with a relatively high level ab-initio calculation.
Figure 10. Conformational scan of the C7 N1 C8 C9 torsion angle (o) in tolfenamic acid using the sum of #3)#$ and /&.) from the CSD-KBF.
ACS Paragon Plus Environment
28
Page 29 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
The effect of the CSD-KBF on the valence angle, C7-N1-C8 (with N1 being the amine nitrogen in the center of the molecule) is illustrated in Figure 11. Again the shape of the objective function reflects the underlying data used to generate it, hence we observe a strong local (and global) minimum where the majority of the valence angles are observed (120° −
130°). We also see some evidence of a small unfavorable local minimum close to 150° where
there is a single observation and potential outlier in the library. The valence angle is driven towards the main minimum lying between 120° − 130° by the CSD-KBF.
Figure 11. PDF (solid line) based on the CSD distribution (histogram) and objective function (dot dash line) for valence angle C7-N1-C8 in KAXXAI showing the initial value of the valence
ACS Paragon Plus Environment
29
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 30 of 39
angle (dotted blue vertical line) and final value of the valence angle (dotted red vertical line) after application of the CSD-KBF. The effect of the CSD-KBF on the valence angle, C2-C3-H2, part of an aromatic ring, is shown in Figure 12. In this case the observations relate to valence angles with terminal hydrogen atoms. These are so numerous that the actual observations are not provided explicitly rather the mean angle and standard deviations of the corresponding distribution are. The PDF constructed in this case is based on a normal distribution centered on the mean value with standard deviation set to the actual standard deviation unless this falls below a threshold (5°), in which case the threshold value is used as the standard deviation. This measure prevents the distributions becoming too narrow which causes problems for the gradient based optimization, but at the same time is sufficiently restrictive to approximate the value from the data. The valence angle in this case is driven to the ideal angle represented by the minimum of the objective function, corresponding to the peak of the PDF.
ACS Paragon Plus Environment
30
Page 31 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Figure 12. PDF (solid line) and objective function (dot dash line) for valence angle C2-C3-H2 in KAXXAI with initial value of the valence angle (dotted blue vertical line) and final value of the valence angle (dotted red vertical line) after application of the CSD-KBF. The effect of the CSD-KBF on the bond length of N1-C7 is displayed in Figure 13. Here the initial bond length is quite close to the peak of the histogram and the corresponding minimum of the objective function. After minimization the bond length is still close to this optimum but it is not achieved exactly. This serves as a reminder of the balance the optimization algorithm needs to find between the many objective functions involved in the optimization and that if the minimum is rather shallow, the location of the identified minimum may be subject to some uncertainty.
ACS Paragon Plus Environment
31
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 32 of 39
Figure 13. PDF (solid line) based on the CSD distribution (histogram) and objective function (dot dash line) for bond length N1-C7 in KAXXAI with initial value of the bond length (dotted blue vertical line) and final value of the bond length (dotted red vertical line) after application of the CSD-KBF. For comparison, the effect of the CSD-KBF on the bond length of H6-N1 is displayed in Figure 14. Here a terminal H atom is involved and these are too numerous in the CSD to provide the actual observations explicitly. Instead a normal distribution is constructed centered on the mean and with the same standard deviation as the underlying distribution, unless this value is less than a threshold value (0.05 Å). Values below this threshold make the PDF too narrow, ACS Paragon Plus Environment
32
Page 33 of 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
making optimization difficult, but 0.05 Å is still sufficiently narrow to be effective. The initial
length of the bond appears to be very short compared to the mean value in the CSD and the optimization modifies the bond length to the mean value observed in the CSD.
Figure 14. PDF (solid line) and objective function (dot dash line) for bond length H6-N1 in KAXXAI with initial value of the bond length (dotted blue vertical line) and final value of the bond length (dotted red vertical line) after application of the CSD-KBF. For tolfenamic acid the counts of the data types on which the PDFs and objective functions are based are displayed in Table 4. None of the bond length or valence angle PDFs are based on the input structure but all are based on CSD data, either actual observations or summary statistics
ACS Paragon Plus Environment
33
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 34 of 39
(mean/standard-deviation.) For the torsion angles, about two-thirds of the PDFs are based on observations and about one-third based on the value in the input structure.
Table 4. Counts of PDFs constructed from observations, mean/standard-deviation, or input structure split by bond lengths, valence angles and torsion angles for tolfenamic acid, KAXXAI. Feature
Observation
Mean/sd
Input Structure
Bond Length
19
12
0
Valence Angle
25
24
0
Torsion Angle
15
0
8
The behavior of the component objective functions comprising the overall objective function as the optimization of KAXXAI proceeds from initialization to termination has been examined and is discussed in the Supporting Information. Our final point of note concerns the domain of applicability of the CSD-KBF. The knowledge based field presented here is intended for structures whose geometry may already be considered to be reasonable but for which the user wishes to generate a structure with a geometry modelled according to the distributions represented in the CSD. Highly distorted structures have some
geometric attributes lying in regions of approximately zero density and lim7%