Knowledge-Based Conformer Generation Using the Cambridge

Publication Date (Web): February 9, 2018. Copyright © 2018 American Chemical Society. *J. C. Cole. E-mail: [email protected]. ACS AuthorChoice - Th...
2 downloads 6 Views 672KB Size
Subscriber access provided by UNIVERSITY OF TOLEDO LIBRARIES

Article

Knowledge-Based Conformer Generation Using the Cambridge Structural Database Jason C. Cole, Oliver Korb, Patrick McCabe, Murray Read, and Robin Taylor J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00697 • Publication Date (Web): 09 Feb 2018 Downloaded from http://pubs.acs.org on February 12, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Knowledge-Based Conformer Generation Using the Cambridge Structural Database Jason C. Cole*, Oliver Korb, Patrick McCabe, Murray G. Read and Robin Taylor Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, United Kingdom *E-mail: [email protected].

ABSTRACT

Fast generation of plausible molecular conformations is central to molecular modelling. This paper presents an approach to conformer generation that makes extensive use of the information available in the Cambridge Structural Database. By using geometric distributions derived from the Cambridge Structural Database, it is possible to create biologically relevant conformations in the majority of cases analysed. The paper compares the performance of the approach with previously published evaluations, and presents some cases where the method fails. The method appears to show significantly improved performance in reproduction of the conformations of structures observed in the Cambridge Structural Database and the Protein Data Bank as compared to other published methods of a similar speed.

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 48

KEYWORDS Conformations, Conformer generation, Modelling, Cambridge Structural Database. Introduction The three-dimensional structure of a molecule is fundamental in many scientific domains, including protein-ligand docking, 3D QSAR, ligand-based virtual screening and crystal structure prediction. Consequently, good methods for generating plausible three-dimensional molecular structures (conformations) of small molecules in both unbound and bound states are central to molecular modelling. A good process for conformer generation should produce conformations that are chemically plausible and include, to a reasonable approximation, any conformations known to be bioactive or that have been observed in the solid state, in solution or in the gas phase. Therefore, the ensemble of generated conformations must not only account for the intramolecular preferences of a molecule, but also in some way account for the broad effects that a condensed-phase environment may have on molecular conformations (e.g. extended geometries can become more favoured).1 This is a challenging problem, as conformer generators do not generally have detailed information about local molecular environments. Another challenge for any method in this area is to generate conformations on a rapid timescale: typically, users wish to generate an ensemble of conformations in seconds rather than minutes, though clearly such requirements are domain specific. Speed is critical in a virtual screen of millions of compounds, whereas in a large-scale crystal structure prediction study, where a user may wish to generate initial conformations for a rigid-molecule global search of packing arrangements, the quality of the conformers far out-weighs the length of time needed to create them, as the global search may take upwards of 100,000 CPU hours.2,3

ACS Paragon Plus Environment

2

Page 3 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

We have set out to develop a conformer generator that operates relatively rapidly, yet produces conformations that are chemically plausible. The problem domains we would perceive as being tackled by such a conformer generator would be numerous: for example, input for 3D virtual screening methods, to provide an overview of conformational space, or to generate discrete start points and conformational bounds for algorithms that undertake more rigorous global searches, such as, crystal structure prediction.3 A further intent with this conformer generator is to minimise the parameter choices required by the user, so the only key choice a user has to make is the number of conformers they wish to explore, which will vary depending on problem domain. There are numerous commercial and free programs that attempt to solve the problem of conformer generation (e.g. Omega4,5, confab6, confect7, CAESAR8, Trixx Confomer Generator9, Frog210, RDKit11, BCL::CONF12) using a variety of approaches. Some systems use force fields to infer intramolecular geometries, whereas others rely directly on crystal-structure data derived from the Cambridge Structural Database (CSD)13 or the Worldwide Protein Data Bank (wwPDB).14 Typically, these methods will be validated by the authors of the program but some have been evaluated in independent tests using test sets of varying size.15–18 These independent evaluations serve both as impartial viewpoints of the relative success of each program and as excellent reviews of the choice of methods in use. Our method is knowledge-based, based on CSD data as encapsulated in (a) derived data libraries (the Mogul system)19,20 and (b) heuristic rules. We use rotamer libraries to characterise preferred rotatable-bond geometries, and ring template libraries to describe ring geometries. The bond lengths and valence angles of an input 3D molecular structure are initially minimised based on average values taken from Mogul. The molecule is then partitioned into a tree based on probable rotamer values and sampled using a depth first approach, with pruning of unlikely paths according

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 48

to Mogul data and clash score. The software differs from other knowledge-driven generators in a number of ways, the most important of which are as follows. Firstly, we use dynamic rotamer libraries that can be automatically updated with new data rather than static snapshots. Secondly, we have used a cascading library approach to tackle the critical problem of data coverage whilst using the most relevant information. Thirdly we have ensured that sparse distributions are treated with appropriate levels of smoothing to avoid over-emphasis based on a small number of observations and fourthly, we have attempted to tackle ring flexibility using CSD-derived knowledge comprehensively. The software has been evaluated extensively using the ligands in the Platinum data set,18 a set of high quality protein ligand complexes. We emphasise that the Platinum data set was used only for evaluation, not for training of the algorithm. There is further evaluation data for CSD crystal structure conformations and comparative performance in the Supporting Information. These evaluations show that the software is applicable to and effective at the generation of conformers in the solid state. Materials and Methods The overall workflow is shown in Figure 1.

ACS Paragon Plus Environment

4

Page 5 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1. Workflow of conformer generation in this paper. The knowledge base. Mogul19 is a collection of data libraries that describe molecular geometric preferences. For the work discussed herein, we used a modified version of the system that provides libraries of bond lengths, valence angles, rotamers (i.e. the torsional geometries of rotatable bonds)

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 48

and ring geometries. This modified Mogul has been described in detail elsewhere.20 Hence, we confine ourselves here to a brief summary of essential features, together with details of significant improvements we have made since our earlier publications. Each library contains distributions based on crystal structures in the CSD. Each distribution relates to a particular chemical fragment. The fragment is described by a set of keys; for example, the entry in a rotamer library pertaining to a rotatable bond A-B will have keys describing the nature of A and B and, to a greater or lesser extent, their neighbouring atoms and bonds. Typically, there are several different libraries that describe fragments to differing levels of chemical detail (e.g. ‘fine’, ‘medium’, ‘coarse’ and ‘very coarse’). In ‘fine’ libraries, a high degree of chemical detail is captured by the keys used, whereas in other libraries, more rudimentary keys are used, thus leading to a coarser description of the chemical fragment. The libraries are typically searched in a cascading fashion, using the fine library in the first instance and resorting to sequentially coarser libraries if necessary to retrieve a distribution with sufficient observations. The geometry of a rotamer is described by a single reference torsion angle, chosen in a consistent manner for all members of a distribution. The distribution will span a full 360° range and will be asymmetric about zero if the fragment is chiral. On the other hand, Mogul takes account of any rotamer symmetry; for example, the distribution returned for the C-C bond in R-CH2-CH2-R is made symmetric by adding, for each crystallographically-observed value x, its symmetryequivalent value -x. The symmetry-generated observations are not counted when assessing whether a distribution has sufficient observations. Apart from making use of the Mogul rotamer distributions, manually-derived distributions can be employed. They need not be restricted to fragments containing only one rotatable bond. These distributions may be used if the torsion angle for a rotatable bond should be constrained to a

ACS Paragon Plus Environment

6

Page 7 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

specific value, or for fragments containing sequences of rotatable bonds whose torsion angles are highly correlated for a reason (e.g. strong intramolecular hydrogen bonding) that is unlikely to be reproduced simply by use of a clash term. The CSD Conformer Generator uses a small library of pre-defined rotamer distributions, which are listed in the Supporting Information. Only 11 of these are used because we find that a simple clash term is adequate to model most correlations between adjacent rotatable bonds. For flexible ring systems, pre-clustered template conformations can be extracted from Mogul. This covers isolated, fused, spiro-linked and bridged ring systems. The techniques for creating fused and spiro-linked ring systems have been described previously.20 An extension of the method to cover bridged rings is described in the Supporting Information. Templates can be produced for ring systems containing up to six component rings. If an assembly contains more than this number of rings, the conformer generator will not attempt to generate templates for the assembly and will instead use the input geometry. There is also functionality to deal with rings for which no template is obtainable from Mogul; the templates are then generated on the fly, using Mogul rotamer distributions for cyclic bonds. Details of the procedure are described later in the paper and in the Supporting Information. If ring generation fails, and no template structure can be generated, the ring conformation from the threedimensional input structure is used. In a randomly selected subset of 11700 molecules from the ZINC “clean drug-like” database,21,22 278 molecules contained rings for which no templates were found in Mogul. Ring generation from cyclic-rotamer data was successful for all 278 molecules, leaving no cases where the input structure had to be used. Pre-minimisation. Before rotatable-bond geometries and ring conformations are sampled, the 3D input structure is locally minimised in vacuo. The aim of this step is to optimise bond lengths

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 48

and valence angles which are, except as part of ring template sampling, not changed during the conformer generation process. A memory-limited BFGS algorithm23,24 minimises the given structure in Cartesian coordinates using a custom force field based on the functional forms and parameters of the Tripos force-field,25 which have, however, been highly modified. A purely repulsive clash pair potential has been developed, which replaces the vdW term in the Tripos forcefield. The potential is evaluated for all atom pairs that are separated by at least four bonds. For two atoms separated by a distance the clash contribution is calculated as

1

if

(1)

0 otherwise

where

(set at 50.0) is the penalty value at zero distance and

is the cut-off distance beyond

which there is no contribution. The cut-off distance is calculated by summing the vdW radii26 of the participating atoms reduced by a distance , which has a default value of 0.2 and a value of 1.0 for atom pairs which can form hydrogen bonds, i.e. (2) The function and its analytic derivative with respect to the atomic positions can be evaluated efficiently. No electrostatic term is used. While the Tripos force-field assigns target bond-length and valence-angle values based on tables indexed by the mol2 atom typing scheme,27 in this work these parameters are derived from the CSD via Mogul. If, for a particular bond length or valence angle, there are no observations in the CSD (an extremely rare occurrence), the input bond length and valence angle are used as the target value, respectively. In the Tripos force-field, the force constant for a specific bond length or

ACS Paragon Plus Environment

8

Page 9 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

valence angle varies according to the nature of the atoms involved. Our approach, however, uses only one force constant constant

(set at 0.02 kcal mol-1 degree-2) for valence angles and one force

(set at 600 kcal mol-1 angstrom-2) for bond lengths, which are the default values

in the Tripos force field. Degrees of freedom in conformational sampling. The degrees of freedom considered during conformational sampling are rotatable bonds and replaceable ring systems. A bond is defined as rotatable if it is single and acyclic with neither atom being metal or bonded to metal, and both atoms must be bonded to at least one other non-hydrogen atom (i.e. “hydrogen rotors” are not considered). We define a ring system as replaceable if it has one or more templates created from Mogul. Sampling of rotatable-bond geometries. At the conformational sampling stage, each rotatable bond is allowed to adopt any of a set of discrete torsion angles, each assigned an empirical probability based on the distribution that has been retrieved for the bond. To prepare the retrieved distribution for representative angle selection, Kernel Density Estimation (KDE) is applied to generate a smooth Probability Density Function (PDF) .28,29,30 The von Mises kernel, which is suitable for circular distributions, is used. Methods for estimating the kernel bandwidth29 can yield values that are sometimes too high or too low, leading to over- or under-fitting. This can be improved by setting upper and lower bounds on the bandwidth which are dependent on the number of observations and detail level of the Mogul library from which the distribution came.

The representative angles are generated using a greedy algorithm,31 as follows. The more important peaks in the PDF are identified, where a peak is considered important if its height is at

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 48

least 50% of the highest peak. In the peak identification, we examine the PDF on a discrete grid of torsion angles i.e. every integer degree. The resulting set of peaks is processed in height order and corresponding peak angles are selected one at a time and added to the representative set provided they are at least 15° from any angle already chosen. All angles within a 15° range of each selected peak angle are excluded from further consideration. Once the list of important peak angles is exhausted, there will be an initial list of representative angles

and corresponding

15°

exclusion zones. Any angles on the grid outside of the exclusion zones that have not yet been considered are then processed in PDF height order and considered for addition to the representative set. However, as each new angle

is added, a new exclusion zone of

30° is introduced and

the next angle considered has the next highest PDF that lies outside of all exclusion zones. The process continues until the full 360° range has been covered. An approximate probability mass function is generated for the selected representative torsion angles by normalizing their PDF values to sum to 1.

The empirical probabilities are used to guide conformer construction in the sampling stage, i.e. each torsion angle will be sampled in decreasing order of probability. Reference torsion angles with a near zero empirical probability (less than 1/10001) are not considered in the sampling stage. Furthermore, if a rotatable bond has an n-fold symmetry, only the asymmetric part of the distribution will be considered. If manually-derived distributions for fragments with more than one rotatable bond are employed, conditional empirical probabilities are derived. Manually-derived rotamer distributions are matched first and take precedence over the automatically derived Mogul distributions. If several manually-derived sets of distributions covering a single, or multiple, rotatable bond(s) are available, a backtracking algorithm is used to find the set of non-overlapping

ACS Paragon Plus Environment

10

Page 11 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

sets covering the most rotatable bonds in the molecule. Although the set-covering algorithm has an exponential run-time, this is usually not a problem in practice due to the moderate size of molecules used in conformer generation applications. If the empirical probability for a specific reference torsion angle is lower than a pre-defined limit 0.05, this setting is classified as unusual, a classification that is used later to speed up searching by tree pruning.

Sampling of flexible-ring geometries. As noted earlier, pre-clustered template conformations are extracted from Mogul for flexible ring systems (including fused and bridged) or, if necessary, generated on the fly. In the former case, an empirical probability is assigned to each template, calculated as the number of observations of the template in Mogul (therefore, in the CSD) divided by the total number of observations of the ring structure. Again, this empirical probability is used in the sampling stage to guide the search by sampling ring templates in decreasing order of probability.

Ring generation. Ring generation is used when Mogul cannot supply flexible-ring geometries. The process starts with the ring being isolated from the rest of the molecule. The ring is then divided into fragments connected by rotatable bonds, and one of the bonds is broken, resulting in an open chain molecule. Mogul rotamer distributions for cyclic bonds give the torsional preferences for these rotatable bonds. The CSD Conformer Generator algorithm is run on this set of fragments, with constraints to guide the search towards closed ring structures with a valid bond length and torsion over the broken bond. The ring structures thereby found are closed and then optimised using Mogul bond length, valence angle and torsion angle preferences. Empirical

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 48

probabilities for generated rings are calculated from the probabilities of the torsion angle settings chosen during the search phase. More details of the new ring generation function, and its performance, are given in the Supporting Information.

Search-space pruning. At this point, the number of geometries of non-zero empirical probability (henceforth, “settings”) that are available to each degree of freedom (rotatable bond or flexible ring system) has been determined. Given the number,

, available to the ith degree of

freedom, the theoretical maximum number of conformers can be calculated by ∏

. This does not consider that some conformers will be eliminated by pruning techniques at

the conformational sampling stage (e.g. close-contact checking). Nevertheless, it can be used as a guide to the size of the sampling problem and therefore to whether the total number of settings should be reduced up front. If the value of

exceeds a pre-defined threshold value

(default 50 million), the number of settings to be sampled is reduced according to a protocol which performs one reduction step at a time and checks if the resulting theoretical number of conformations is lower than the threshold. The procedure terminates when this is the case or if the protocol runs out of steps. The setting reduction protocol is described more fully in the Supporting Information. Incremental construction and traversal order. At this point, the molecule is effectively a set of rigid or semi-rigid fragments separated by rotatable bonds, each of which has various possible settings. A semi-rigid fragment is a flexible ring system for which more than one template is available. The conformation sampling is done by incremental construction of the molecule from its rigid and semi-rigid fragments. One of the fragments is chosen as a base (starting) fragment. The remaining fragments are added in a pre-determined order, the traversal order. These fragment

ACS Paragon Plus Environment

12

Page 13 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

order choices are deduced as follows. The fragment containing the atom which has the longest shortest-bond path to all other atoms in the molecule is used as the base (starting) fragment. The reason is that the sampling strategy employed for the conformer generation spends a fair amount of time sampling the leaves of the search tree and may terminate prematurely if a maximum number of conformers is reached. Hence, it is beneficial if centrally-located torsions, which potentially have a big impact on the overall shape of the molecule, are located as far down the search tree as possible. This ensures that these important torsions are most likely to be searched thoroughly. Clearly, there is no free choice with respect to where exactly the torsions will be in the search tree. As the molecule is constructed incrementally, it is required that certain fragments must already be in place to allow the next one to be attached.

The conformation generation problem can now be formulated as a discrete search problem. For each degree of freedom , molecule with

,⋯,

(i.e. a variable rotatable bond or a flexible ring system) in a

degrees of freedom, there is a set of allowed discrete settings

,

,

,

,⋯,

(allowable torsion-angle values or ring templates) with associated empirical probabilities Each set of settings is ordered according to decreasing probability, i.e. 1, ⋯ , probability

,



,

, ,

.

, ∈

1 . Assuming independence of the empirical probabilities, the overall empirical for a conformer can be calculated by ,

,⋯,

(3)

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

where

is the setting chosen for the ith degree of freedom and

Page 14 of 48

is its probability. To avoid

numerical inaccuracies when accumulating probabilities of molecules with many degrees of freedom, the natural logarithm of the empirical probability is used instead, i.e. ln

,

,⋯,

If, for the ith degree of freedom, the first (i.e. highest probability) setting is termed last (lowest probability) is termed

,

(4)

ln

,

and the

, a theoretical minimum and maximum conformer probability

can be derived: ln

ln

ln

ln

(5) ,

(6) ,

Conformer sampling. The conformational sampling step constructs conformations according to the traversal order and the set of settings determined above. Starting with the base fragment, each newly-added rigid or semi-rigid fragment is attached to the growing conformation using one of the settings available to it; this will be a torsion angle of a rotatable bond or a template of a flexible ring. Checks will be made to determine whether the partially-built conformer is unacceptable because it triggers one of the termination criteria described below. If so, the current search branch is terminated. Otherwise, the next fragment is attached. This procedure is carried

ACS Paragon Plus Environment

14

Page 15 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

out until either a full conformation of the molecule has been constructed or a termination criterion has been met. When a full conformation is generated, the settings that have been used in it are inserted into a tree data structure, which stores the empirical probability of the conformer in the leaf. The leaf is back-linked to the setting of the previous degree of freedom sampled, which in turn is back-linked to its predecessor, etc. While the depth-first search (DFS) strategy employed is in theory exhaustive, several termination criteria are used to limit the number of conformers sampled viz. (i) close contacts, (ii) number of unusual torsions, (iii) conformer probability and (iv) the number of conformations in the conformer tree data structure. Checking for close contacts in conformers is an essential part of conformer generation methods and at the same time one of the most time-consuming. In this work, the clash function described in equation (1), which is a purely repulsive potential with a finite cut-off, is used to perform closecontact checks between atoms. Only contacts between atoms which are separated by at least four bonds are checked. As conformations are sampled in a discrete space - a limited set of torsion angle settings or a limited number of ring templates - close contacts may be generated which might later be relieved by a small rotation around a rotatable bond. Therefore, adjustments have been made to the original values stated for

in equation (2) (to 0.3, and 1.1 for atoms pairs that can

form hydrogen bonds) and the vdW radius of hydrogen atoms used to derive c in equation (1) (to 0.7Å) to produce a softer clash term for use during discrete sampling. In the incremental construction process, a contact test is invoked once a new fragment has been added to the molecule. Instead of testing the atoms of the newly added fragment against all atoms of the previous fragments, a bounding sphere data structure is used for efficient clash checking. For each rigid fragment, a bounding sphere containing all atoms of the fragment is calculated; for flexible ring

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 48

systems, one bounding sphere per ring template is used. The bounding sphere of a newly added rigid fragment is tested against the bounding sphere of each of the previous fragments. If the bounding spheres do not intersect, the atoms inside the bounding spheres cannot intersect. If the bounding spheres do intersect, the atoms of the newly added fragment are tested against the bounding sphere of the respective previous fragment. Only when an atom intersects the bounding sphere of the respective previous fragment, is it tested against all atoms of this previous fragment. Since a depth-first incremental construction is carried out and only the coordinates of the newly added fragment are varied, no contacts between pairs of previously added fragments have to be rechecked. If the sum of close contact contributions for a partially constructed molecule accumulates to a value higher than clash

the current search branch is terminated.

The second termination criterion is based on the number of unusual rotamers in the conformer. Each time a fragment with a rotatable bond is added, the probability of the associated rotamer is checked to see if it is less than a threshold,

. If this is the case, the count of unusual

rotamers for the current conformer is increased by one. If this count exceeds a threshold, , the search along the current branch is terminated. Furthermore, no further settings for the current rotatable bond need to be sampled, as the rotamer-angle settings are ordered by is set to 2, meaning conformations with more

decreasing probability. By default, than 2 unusual rotamers are disallowed.

The third termination criterion uses the overall conformer probability to decide when to terminate. A probability threshold conformers with a

is chosen. Search-tree branches which will result in

lower than

are skipped.

is chosen between

and

to produce a conformer tree which is estimated to contain as many high probability conformers as possible below a maximum number of

conformers (see below). The

ACS Paragon Plus Environment

16

Page 17 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

aim here is to search a large conformer space, but without the risk, due to the DFS search strategy, of missing branches which use low probability choices early on. The method used to choose the threshold is described in the Supporting Information. Finally, to avoid excessive memory consumption, the number of conformers stored in the conformer tree data structure is limited to a maximum

conformers. If this limit is

reached, the conformer generation process is terminated. Clustering. So far, conformers have been generated with respect to probability and clash criteria but not conformational diversity. Depending on the setting of

, the number of

conformers generated can be in the order of thousands to millions for large molecules. To create a representative set of high probability and at the same time diverse conformers, a clustering and minimisation step is carried out. First, the list of conformers stored in the conformer tree is sorted according to decreasing probability. Then, the first conformer in the sorted list, i.e. the one with the highest probability, is chosen and minimised as the first member of a list of clustered conformers. The second and subsequent conformers are only minimised and added if they are dissimilar to all conformers already accepted. The procedure continues until either all conformers in the sorted list have been processed or a user-specified maximum number of conformers have been added to the list of clustered conformers. This process requires many conformer dissimilarities to be estimated, which usually makes it the rate-limiting step of the entire algorithm. In consequence, it is necessary to estimate conformer dissimilarity quickly. This is primarily achieved using a fast torsion-based dissimilarity coefficient which is described in the Supporting Information. Clustering minimisation. As described above, a relatively forgiving clash criterion is used at the discrete sampling stage, to avoid the rejection of potentially interesting conformations whose

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling

torsion angles and ring-system geometries are high probability, but which have some close contacts. Each conformer considered for the list of clustered conformers is therefore minimised in torsional space with respect to the less forgiving clash term used during the initial minimisation of the input 3D molecule. In addition to this clash term, restraints are applied to ensure the torsion angles do not drift too far from the values they were assigned during the discrete sampling. The functional form of the restraint for torsion i is:

(7)

Θ

where Θ is the actual torsion angle of torsion , mol-1,

is a force constant set at 10.0 kcal

is a parameter, set at 2.0, influencing the shape of the potential (Figure 2) and Θ is the

torsion angle sampled in the discrete sampling stage. 80 70

change penalty

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 48

60 50 40 30 20 10 0

‐180

‐120

‐60

0

60

120

180

torsion change angle Figure 2. The torsion restraint function.

ACS Paragon Plus Environment

18

Page 19 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The minimisation is performed with a memory-limited BFGS algorithm.23,24 The minimised conformer is then added to the list of clustered conformers provided it is still sufficiently dissimilar from the conformers already in this list. Dissimilarity is measured in the same way as in the main clustering step (see above and “Dissimilarity coefficients used in clustering” in the Supporting Information) except that the threshold values are halved.

Evaluation The CSD Conformer Generator was evaluated against a recently published test set of structures derived from the PDB (the Platinum set).18 This dataset consists of 4548 structures in its primary subset and 2859 structures in its diverse subset. We elected to use the diverse subset in the evaluation. It has been filtered to attempt to remove errors in molecular geometry due to low resolution in the original structural data or misinterpretation of the observed electron density. Each test subset is provided as a single SD file of observed geometries. To evaluate a conformer generator, one needs to consider critical points within the generation workflow and evaluation. Typically, an end user will start with a structural representation that is solely described by chemical connectivity and absolute stereochemistry. An example of such a description, ubiquitous in cheminformatics, is the SMILES code.32 A typical conformer generation workflow will start with a SMILES representation, generate an initial set of 3D coordinates (the 1D → 3D step) and then use these for generation of a conformational ensemble (3D → 3D Ensemble). In this publication, we have only evaluated generation of 3D ensembles from a starting 3D structure, to allow direct comparison with previously published results; however, we note that the 1D → 3D step is an integral part of the conformer generator workflow.

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 48

The 1D → 3D step can influence the results that are generated in the 3D → 3D Ensemble step, particularly if the 3D → 3D Ensemble step only considers flexible torsions and ring geometries, as variations in valence angles can propagate into larger variations in the atomic root mean square deviation (ARMSD). Such variations in valence angles might arise during the initial minimisation step, to relieve clash in the input geometry. In principle this risk could be reduced by searching for a low-clash conformer before the initial minimisation; however, the CSD Conformer Generator does not implement this extra step. In practice we have observed only a small influence in starting from either generated or observed 3D geometry (see “More on the effect of starting from the observed conformation” in the Supporting Information). To allow for direct and fair comparison, we obtained from the authors of the Platinum set the initial 3D inputs that they used for their evaluation of conformer generators. These were used as initial 3D inputs for the CSD Conformer Generator. In evaluation, too, the ARMSD metric itself can be variable, depending on the software package used to measure it. In our comparison, we used an algorithm that is analogous to that used in the Platinum evaluation, which accounts for all possible permutations of matching atoms in its assessment of the minimum ARMSD (see the Supporting Information for detail). The CSD Conformer Generator was run via the CSD Python API.33 Except for the number of conformers generated, all default settings were used. Three evaluation runs were carried out: two with the maximum number of conformers set to 50 and 250 respectively, as this allows for direct comparison with two conformer generator evaluations published by the authors of the Platinum data set,16,18 and one to generate up to 10,000 conformations where possible. Results for the 50 and 250 conformer sets are presented here, and the equivalent figures for the 10,000 conformer set are available in the Supporting Information.

ACS Paragon Plus Environment

20

Page 21 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Results To assist the reader in comparing the following results with other conformer generators, extracts from the published results of benchmarking commercial conformer generators,18 are reproduced in the Supporting Information. Ensemble sizes. A plot of the mean ensemble size versus number of rotatable bonds is shown in Figure 3. When the maximum permitted number of conformers is 50 (250), this many are nearly always returned when the rotatable bond count is above 5 (7). The picture when 10,000 conformers are allowed is less clear (see Figure S4 in the Supporting Information). Some structures with higher numbers of rotatable bonds generate ensembles smaller than 10,000 conformers. One possible reason is that rather high dissimilarity thresholds are commonly used when selecting diverse conformers for large molecules with many rotatable bonds (see Table S4 and “Dissimilarity coefficients used in clustering” in the Supporting Information). This may reduce the number of conformers generated to below 10,000. Visual inspection of several examples indicate that there are many other reasons why a molecule may have an ensemble of less than 10,000 conformers including: 1. A lack of flexible rings. 2. The presence of relatively rigid rotatable bonds in esters and amides, which reduces the sampling space required for a molecule. 3. Steric clashes between closely associated fragments. Clashes allow elimination of branches in tree construction, effectively reducing the search space. 4. Topological symmetry within a molecule. Topological symmetry reduces the underlying size of the tree that requires sampling in our algorithm.

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 48

However, we also note that there are many molecules that contain some of these features and still generate a maximum ensemble size of 10,000 conformers.

Figure 3. The average ensemble size generated by the CSD Conformer Generator for structures from the Platinum data set with specified numbers of rotatable bonds when generating a maximum of 50 and 250 conformations per molecule. Overall success rates. Success rates for the program were analysed for the Platinum diverse test set. The percentage of test-set entries, for which (a) the best and (b) the top-ranked of the conformers generated satisfies given ARMSD thresholds, is reported in Table 1 from runs with the maximum number of conformers set to 50 and 250. ‘Best’ conformer here is defined by having the lowest ARMSD. The ‘top-ranked’ conformer here is defined as the first in the list of generated conformers, which has been sorted firstly by probability, secondly by clash score, thirdly by number of unusual torsions and fourthly by generation order. Table 2 gives the mean and median ARMSD of the best conformer. These may be compared with the equivalent figures reported in the recent evaluation of several commercial conformer generators18 (see Tables 2 and 3 therein,

ACS Paragon Plus Environment

22

Page 23 of 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

extracts of which are reproduced in the Supporting Information) and show that the CSD Conformer Generator is performing extremely well. Table 1. Percentage of Platinum data set entries for which (a) the best and (b) the top-ranked of the conformers generated by the CSD Conformer Generator satisfied given ARMSD thresholds. Results are shown for runs in which the maximum ensemble sizes were 50 and 250. Maximum size

ensemble 50

250

Best ARMSD Threshold (Å) Conformer %

Top-ranked Conformer %

Best Conformer %

Top-ranked Conformer %

≤ 0.5

57

18

66

18

≤ 0.75

74

28

84

28

≤ 1.0

84

39

92

39

≤ 1.25

90

50

96

50

≤ 1.5

94

59

98

59

≤ 1.75

96

68

99

68

≤ 2.0

98

76

100

76

> 2.0

2

24

0

24

Table 2. Arithmetic Mean and Median ARMSD values for the 50 and 250 conformer sets generated by the CSD Conformer Generator on the Platinum data set. 50 conformers requested

250 conformers requested

Mean Median Std. Dev. Mean Median Std. Dev. 0.59

0.42

0.48

0.48

0.38

0.34

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling

Ranking Analysis. The relative ranking of a conformation is a useful metric in indicating the degree of sampling necessary to increase the chance of generating an observed conformation.

100%

Percentage of Test Case Entries

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 48

75%

ARMSD