ReFlex3D: Refined Flexible Alignment of Molecules Using Shape and

Mar 30, 2018 - As such it can readily be implemented in existing workflows ranging from the selection of compounds from a virtual screening campaign t...
3 downloads 3 Views 779KB Size
Subscriber access provided by UNIV OF CAMBRIDGE

Chemical Information

ReFlex3D: Refined Flexible Alignment of Molecules using Shape and Electrostatics Thomas Christian Schmidt, David A Cosgrove, and Jonas Boström J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00618 • Publication Date (Web): 30 Mar 2018 Downloaded from http://pubs.acs.org on April 1, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ReFlex3D: Refined Flexible Alignment of Molecules using Shape and Electrostatics Thomas C. Schmidt,*,† David A. Cosgrove,‡ Jonas Boström† †

Cardiovascular and Metabolic Diseases, Innovative Medicines and Early Development, AstraZeneca, Pepparedsleden 1, SE 43183 Mölndal, Sweden



Cozchemix Limited, 37 Coniston Way, Macclesfield, Cheshire, SK11 7XR, United Kingdom

*

corresponding author, email: [email protected]

Abstract We present an algorithm, ReFlex3D, for the refinement of flexible molecular alignments based on their three dimensional shape and electrostatic properties. The algorithm is designed to be used with fast conformer generators to refine an initial overlay between two molecules and thus to obtain improved overlaps as judged by an increase in calculated similarity values. ReFlex3D is open-source and built as a python package working in combination with the OEChem Toolkit. As such it can readily be implemented in existing workflows ranging from the selection of compounds from a virtual screening campaign to the construction of similarity based prediction models to estimate binding affinities. We evaluate ReFlex3D against the AstraZeneca Validation Test Set and illustrate its potential within a predictive model compared to an established method (Posit).

Introduction The alignment of molecular structures in three dimensional space is one of the most important yet also most challenging tasks encountered in molecular modelling.1 A perfect alignment is an essential prerequisite for most investigations based on three dimensional properties as any small deviation from a perfect overlay is instantly reflected by these quantities. One of the fields in which molecular alignments have experienced the widest spread is the discovery of new drugs via structure-based and ligand-based computer aided drug design.2,3 While structure-based drug design provides a means of tailoring a compound so as to most perfectly complement the binding site of a target protein, it has two major disadvantages. The effort needed to take into account the molecular environment of the receptor and the interactions with the compound under investigation is considerable (e.g. the preparation of the target structure, sampling of poses). Furthermore, the geometrical structure of the receptor must be known in the first place, a condition that is frequently not met within a drug discovery project.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

A valuable alternative is given by ligand-based approaches which exploit the fact that similar activities against a target are usually rooted in similarities between the three dimensional structures of bioactive compounds. The necessary features can be condensed into a pharmacophore model to which compounds under investigation can be compared or one can directly assess the similarity with respect to known actives by means of molecular descriptors. In both cases, the quality of the comparison depends critically on the alignment with the template.2 Unsurprisingly, the interest in robust and fast alignment molecular algorithms has led to the development of numerous approaches (e.g. SPAt4, SEAL5, FLASHFLOOD6, FLEXS7), which can be subdivided into three major classes. The earliest and most popular algorithms make use of the root mean square distance (RMSD) of corresponding atom pairs between two structures, e.g. the maximum common substructure,8,9 a value that can be used both for the alignment and as a measure of similarity. Despite being widely used as a standard for measuring similarities or dissimilarities it does have a major downside. To compute the RMSD value, one requires corresponding atom pairs derived by a 1:1 mapping between the two compounds under investigation.10 While this mapping can readily be performed for molecules sharing a common core, it becomes unfeasible for topologically diverse compounds and the resulting RMSD values become meaningless. To tackle these issues, methods have been developed that do not rely on a point to point correspondence between atoms of the two compounds but which try to map common features of the three dimensional structures, e.g. surface characteristics or pharmacophore features (SQ11). A completely different approach uses the three dimensional extent of the molecules as a whole and attempts to maximize the volumetric overlap between the two (ROCS12). [e],[f] Being originally developed based on the idea of finding the maximum overlap in the electron density of molecules, this approach can readily be extended by using other three dimensional functions such as the electrostatic potential (EON).13 As well as the choice of which of these approaches to follow for aligning the molecules, one also faces the question of how, if at all, to treat the conformational flexibility inherent in molecular systems. This question is of interest as even for moderately size molecules, the conformational flexibility of the compounds constitutes a major part of the degrees of freedom to be considered. It is well known that the geometric flexibilities also have a strong input on the potential energies of a molecule and can thus influence its binding properties.16 Usually, one is interested in low-energy conformers as each 1.4 kcal/mol increase in energy is reflected in a tenfold weaker binding. We will however first concentrate on the geometrical aspect of conformational flexibility. A widely used approach to tackle the problem of conformational freedom is to generate ensembles of conformers prior to the actual alignment process which in turn only performs rigid body transformations to optimize the overlap with the template. This approach is computationally very efficient and widely used in combination with conformer generators such as CORINA14 or OMEGA15. While using precomputed conformers leads to high efficiency, the quality of the results crucially depends on that of the generated ensemble of conformers. This is only a minor concern in virtual screening if geometries for the database molecules as well as for the align molecule are generated on a comparable basis.17 However, it becomes crucial if the bioactive conformation of an align molecule is known, e.g. from X-ray crystallography. A perfect overlap can only be found if the exact geometry of the align molecule is also present in the ensemble. As this is rarely the case, the best possible similarity to the reference is almost always lower than that of a perfect fit.18 Furthermore, this deviation might be considerably different for different molecules, potentially biasing a rank ordering based on their similarities.

ACS Paragon Plus Environment

Page 2 of 29

Page 3 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Flexible alignments on the other hand, taking the conformational freedom of molecules into account, are capable of minimizing this issue, albeit at the cost of drastically increased computational demands. To cut down on these requirements, some algorithms reduce the flexibility accounted for to the possible torsions between rigid fragments of the compounds. The final alignment is then constructed iteratively starting from an initial fragment and adding the others sequentially by sampling their positions and choosing the one which maximizes the similarity with respect to the reference.[f],19 However, the quality of these approaches depends heavily on the choice of how the first fragment is placed. Furthermore, most algorithms follow a greedy strategy, always following the current best solution, which is not guaranteed to reach the best global solution. Hence a common solution is to perform several initial alignments from different start points and select the optimal solution at the end. For the alignment of a molecule (B) onto a reference molecule (template, A), our approach comprises two steps. In the first step, an ensemble of conformers is generated for molecule B which are than superimposed onto the geometry of molecule A by rigid-body transformations. For this step we utilize the OEShape Toolkit provided within the OEChem Toolkit20 due to its excellent performance. In a complete solution, one would then perform a second step, a refinement allowing B to change its conformation, for each of the initial overlays. However, to reduce the computational demand only a subset (of conformations) is passed to the second stage, by selecting a smaller subset of geometries based on and ordered by their scores for the initial overlay.

In the second step, the initial overlays are refined to the optimal alignments. To account for the molecular flexibility, this refinement is performed by optimizing the coordinates of each of the aligned molecule’s atoms. The scoring function is a combination of the relative potential energy of B and its similarity with respect to A measured via the overlap in their three dimensional shapes, color or electrostatic potentials represented by atomic charges. The reference geometry of A is kept fixed at all times. The problem of finding a global extremum is a frequently encountered problem in optimization problems and there are many possible approaches. We are not attempting to compete with those; instead we concentrate on improving the local optimization of a preliminary alignment of two compounds. By using global quantities like shape overlap as the optimized score, we avoid any issues accompanying the required mapping of atoms between the two molecules of interest which also makes the method suitable for topologically highly diverse molecules. In contrast to other shape-based algorithms, we include color and electrostatics directly as an integral part of the flexible alignment. We evaluate the advantage of using atomic point charges to represent the electrostatics of molecules compared to the use of the efficient approximation of the electrostatic properties via color atoms as implemented in the OEChem Toolkit.20 During the optimization, we aim to balance the ratio between the energetic (relative energies) and similarity contributions (shape and electrostatics), such that the method is equally well suited for smaller or larger molecular systems. We therefore employ normalized similarity measures, e.g. the Tanimoto or Tversky indices, which can be parameterized so as to balance a physically meaningful energy range for any molecule under investigation. Finally, by utilizing the Tversky index within the scoring function, the algorithm is designed not to be limited to alignments aimed at finding the perfect match between two molecules as a whole, but is also capable of retrieving optimized alignments if the molecule under investigation (B) comprises a super- or substructure of the template molecule (A).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The quality of the algorithm is assessed using the AstraZeneca Validation test set consisting of 121 diverse targets with 1464 ligands in total.21,22 Firstly, the algorithm is tested in a series of self-alignment calculations to evaluate how accurately the method is capable of reproducing the experimental X-ray structures used as reference geometries if the exact shape of the investigated compounds is known. Secondly, cross-alignment calculations are used to evaluate how far our ligand-based approach is capable of regenerating binding poses of small molecules without explicit knowledge of the target geometry but only of analog molecules. Finally, the benefit of using the flexible alignment in comparison to a rigidbody alignment is investigated by comparing a previous study with alignments derived using ReFlex3D and correlating the obtained similarities with experimental binding affinities.

Methods Measures of Similarity The degree of similarity between two molecular geometries can be quantified by any of the three measures listed below: 1. Shape[33]: This measure uses the spatial overlap between the two molecules. The shape of a molecule is represented by a combination of atom-centered Gaussian functions. 2. Color: This measure approximates chemical features (formal charges, H-bond donor and acceptor positions, rings,…) by artificial color atoms. Each color atom is assigned a Gaussian function and the overlap is calculated similarly to that of shape. 3. Electrostatics: The electrostatic properties of the two molecules under comparison are represented by atomic point charges which are assigned according to the MMFF94 scheme or computed via AM1 calculations. The overlap of electrostatic properties is again based on corresponding Gaussian-functions representing the charge distribution.

Calculation of RMSD values RMSD values are a useful measure to describe how well one molecular geometry resembles a second one. However, it requires a 1:1 mapping of atoms between the two structures. Wherever reported within this study, RMSD values are computed as root mean square deviations in the atomic positions of a molecular geometry with respect to the crystal structure geometry of the same chemical compound. This also holds true, even if the alignment was performed using a different chemical compound as template.

Shape- and Color-Based Flexible Alignment of Molecules The alignment method introduced in this study, ReFlex3D, was inspired by the Shapefit algorithm within the Posit program.23,24 The main idea behind Shapefit is to optimize the volume overlap between a molecule B and a template compound A while accounting for the flexibility of the align molecule by freely optimizing the coordinates of all atoms. The objective function for the optimization is composed of two terms, one describing the overlap to be maximized ( Eshape ) and the other one the conformational energy ( Erel ) of the align molecule (B), that assures chemically reasonable geometries. Balancing these two contributions is achieved via the scaling parameter λ, which is adjusted for each individual, pairwise alignment such that the conformational energy of B with respect to the nearest local minimum does not exceed 10kcal/mol. The optimization function f shapefit thus reads:

ACS Paragon Plus Environment

Page 4 of 29

Page 5 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

f shapefit = Eshape+ λ⋅Erel While the Shapefit algorithm allows for a flexible alignment of chemically diverse molecules, its implementation in Posit exhibits several drawbacks that our method attempts to improve on. Firstly, a mere technical inconvenience, Posit requires the definition of a target receptor. As this information is frequently not available, ReFlex3D is designed to work without a receptor model solely on a ligand-only basis. Secondly, the overlap term in Shapefit23 comprises an energy term and the overlay between the shapes of the two molecules under investigation. While the Shapefit potential is reported23 to only include those two terms it’s current implementation in Posit also contains a term representing electrostatic similarities. ReFlex3D extends this idea by using a combined similarity term ( T combo ) composed of the shape overlap ( T shape ) and the overlap in chemical features ( Tcol or ) represented by so called color atoms.25 As an alternative to the use of Tcol or ReFlex3D also provides the possibility of utilizing the overlap between their electrostatic potentials ( T elstat ) arising from the respective atomic charges.26 The contributions arising from these similarities and the conformational energy are balanced by an energy dependent scaling factor: (δ⋅Erel )

f= E pot⋅cscale⋅e

⋅[T shape +T color +T estat ]

The scaling factor represents the third major difference between ReFlex3D and Shapefit. While the ratio between the conformational energy of B and the similarity term is parameterized individually for each pairwise comparison in Shapefit, ReFlex3D uses a universally parameterized energy dependent scaling. In our method, the alignment is dominated by the similarity terms close to the local energy minimum whereas their influence is dampened with an increase in potential energy. It is important to note that Erel represents the potential energy with respect to the nearest local minimum ( E0 ) and Epot E = E0 + Erel E represents the absolute energy. Thus pot . rel is bound to [0,+∞ ] , so the contributions due to similarity thus are bound to the interval [0, cscal e] . Using a series of test compounds,27 we determined that a refinement from Omega conformer geometries by introducing flexibility into the alignment resulted on average in an improvement of 3% in T combo . Hence the parameters cscale and δ were chosen such that a change of 5% in similarity will be balanced out by a relative energy increase

mol kcal of 10 kcal/mol. The exact values of δ= − 0.01 kcal and cscal e= 250 mol provide a reasonably smooth damping of the similarity term.

were then selected, as they

Another improvement in comparison to the Shapefit algorithm concerns the use of similarity values ( T shape …). Instead of the Tanimoto values used within Shapefit, ReFlex3D uses the corresponding Tversky, T (α , β )shape , values, which allow for additional control of the similarity measure. While there are no restrictions for the values of α and β per se, commonly values are chosen from the interval [0,1] . Setting both Tversky parameters to unity ( α= β= 1 ) reproduces the Tanimoto coefficient. By changing these parameters, the alignment procedure can be adjusted for molecules representing either substructures ( α< 1 ) or superstructures ( β< 1 ) of the template compound. Comparing 2-phenylhexane (A) with benzene (B) for example, the Tanimoto value will only amount to about 0.5, even if the benzene rings overlap perfectly. Using a parameterization for substructures however ( α= 0 , β= 1 ), the Tversky index will still amount to 1.0.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Both parameters, α and β , can readily be adjusted for the task in hand. As the molecules under investigation usually do not differ significantly in size within each set, we kept both parameters fixed at their standard value of 1, thus effectively using Tanimoto indices throughout.

Workflow The workflow of ReFlex3D can be depicted as shown below: 1. Generate an exhaustive ensemble of conformers for the molecule to be aligned (B) by using the Omega conformer generator; superimpose each conformer with the template molecule (A) via shape and color using the OEShape Toolkit 2. Reduce the number of conformers by selecting up to ten conformers according to overlap in shape and color with A 3. Use the new scoring function of ReFlex3D to flexibly refine the initial overlays of each of the conformers of B by optimization of the overlap with respect to the template molecule A; the scoring function can comprise a combination of shape and chemical overlap or a combination of the overlap in shape and electrostatic properties represented by point charges (MMFF94 or AM1)

Implementation and Usage The flexible alignment was implemented in form of a python script package extending the capabilities provided by the OEChem Toolkit.20 The optimization of the objective function is carried out making use of numerical minimization routines provided within the SciPy package.28 The program starts by reading the template geometry and the initial overlay(s) from the respective input files. For each align geometry, an object is created containing the actual geometry and that of the template compound. Also, the nearest local minimum of the molecule B and its corresponding energy E0 are determined. When using electrostatics as the similarity measure, atomic charges are assigned (MMFF9429 or AM130,31 charges). Using the atomic Cartesian coordinates as input, the created object provides a function to compute the value of f . While internal coordinates would be beneficial for energy calculations, overlap terms are calculated in Cartesian space. Therefore we were consistently using Cartesian coordinates throughout. Using the function minimize32 from the SciPy package, f can thus be minimized by optimizing the atomic positions. The final set of optimized coordinates is returned as the optimal alignment. ReFlex3D is designed as a local optimizer and can thus only refine the provided overlay to the nearest local optimum. Hence, to find the global minimum an initial overlay sufficiently close to it is required. For the current study, the input structures are derived from a conformational ensemble created via the Omega conformer generator within the OEChem Toolkit. To ensure that the ensemble covers conformational space adequately, Omega was parameterized to perform an exhaustive search: the energy window was increased to 25kcal/mol (default 10kcal/mol) and the maximum number of returned conformers was set to 100,000 which no individual molecule was expected to exceed. Due to the immense size of many of the ensembles so generated, considering each single conformer would clearly exceed currently available computational capabilities. Therefore, a subset of conformers needed to be selected. The selection was performed based on the shape and color overlap between the conformers

ACS Paragon Plus Environment

Page 6 of 29

Page 7 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

and the reference compound. Up to ten conformers are returned as long as their T combo values are at least 90% of the highest value within the ensemble.

The AstraZeneca Validation Test Set The basis of the current investigations is the AstraZeneca Validation Test set introduced by Giangreco et al21 which is freely available under https://www.ccdc.cam.ac.uk/support-and-resources/downloads/#. The set contains a total of 1464 compounds addressing 119 targets (two with alternative binding sites) and can thus be considered sufficiently broad to allow the estimation of a method’s general applicability in aligning molecular structures. While the overlays of compounds addressing the same target have been produced by taking their protein environments and especially the binding site geometries into account, the geometries of the compounds within the dataset provide an excellent means for validating ligandonly alignment methods. In a second paper, Giangreco et al.22 investigated the AstraZeneca Validation Test Set for its suitability for testing the performance of pharmacophore based alignment programs. Within this study, they performed a classification of the 121 target sets based on how well the crystal structure geometries could be recreated by pharmacophore-based alignment programs. Their classification is based on a combination (Borda tally) of the shape Tanimoto, the color Tanimoto and the Tanimoto coefficient of the compounds fingerprints and results in four categories: easy (22 sets), moderate (73 sets), hard (18 sets) and unfeasible (8 sets). For the later analysis, we are going to use the same classification to compare our results against those reported earlier. In this context it needs to be noted that the set of 2-C-methyl-Derithrol 2,4-cyclodiphosphat synthase (Q3JRA0) exhibits a Borda tally of exactly 314, thus qualifying for the unfeasible rather than the hard set (according to the scheme used by Giangreco et al.22). For consistency and because the distance to the nearest target in the hard set (312 vs 314) is smaller to that in the unfeasible set (314 vs 324), we keep considering Q3JRA0 as being “hard” in line with the original classification of Giangreco et al.

Geometry Optimizations by Energy Minimizations using MMFF94 The current implementation of ReFlex3D uses the relative energy Erel of a molecule with respect to the nearest local minimum to scale the similarity contributions during the refinement. Whereas the initial energies do not have a physical meaning on their own, using the relative energies during the alignment assures chemically reasonable optimized geometries independent from the exact input geometries. A first series of calculations was therefore carried out on the curated geometries of the AstraZeneca Validation Test Set which were optimized so as to minimize their potential energies with respect to the Merck Molecular Force Field (MMFF94)29 as implemented in the OEChem Toolkit. The energy differences obtained provide an estimation of the likely ability of ReFlex3D to reproduce the crystal structure geometries. In the case of excessive relative energy values, the similarity contribution to ReFlex3D’s scoring function will be almost nullified and the resulting geometries will not resemble the crystal structures, but MMFF94 minimum energy geometries. Low relative energies however allow the maximization of the similarity between molecules B and A and will result in optimized alignments close to the crystal structures.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Flexible Self-Alignment starting from the Crystal Structure Geometries ReFlex3D balances the overlap between a align compound and its reference geometry against its potential energy with respect to the nearest local minimum. Due to the latter, a perfect match with the reference is very unlikely. However, an estimate of the best case performance of the method can readily be obtained by applying the flexible alignment directly on the provided crystal structure geometries. While the initial overlap with the template in this case is already perfect, the energy minimization will introduce geometry deviations until the contributions between similarity and relative energy equal out. The structural deviations are compared to those obtained by the unconstrained energy minimizations mentioned before and are assessed by comparing the differences in relative energies as well as RMSD values. To compare between the representation of the molecules’ electrostatics via color atoms and atomic point charges, series of self-alignments were also carried out using the charge models implemented in the OEChem Toolkit (MMFF94 charges,29 conformational dependent and independent AM1 charges using the bond charge correction30,31).

Initial Conformers generated using Omega As the flexible alignment represents only a local optimization, its quality depends highly on the initial geometries provided. Therefore, in order for it to be successful, conformers need to be provided that resemble structures close to the optimal overlay geometry, in this case the X-ray crystal structure. Conformer ensembles used within our investigations are generated for each of the compounds of the dataset using the Omega conformer generator within the OEChem Toolkit. To ensure that the ensembles are sampled as extensively as possible, the allowed energy window was set to 25kcal/mol (default: 10kcal/mol) and the maximum number of conformers to be returned to 100,000 (default: 200). The conformers within each ensemble were successively aligned to their corresponding crystal structure geometry by minimizing the RMSD values, also within the OEChem Toolkit, and the one yielding the lowest deviation was returned. The quality of the generated conformers was then assessed by their RMSD values as well as their shape and color overlap with the reference compound.

Flexible Self-Alignment starting from the Best Omega Conformer To assess whether the flexible alignment can be shown to improve on the conformer ensembles generated using the Omega conformer generator in terms of increased similarities and refined overlays with the template geometries, the best conformers of each Omega conformer ensemble have been refined using the flexible alignment algorithm. These calculations were conducted in two series. The first series is intended to compare the best Omega conformer with respect to its RMSD value with the refined structure obtained by the flexible alignment. While selecting the Omega conformer giving the lowest RMSD with respect to a reference compound allows the assessment of the best performance of the conformer generator itself, refining these geometries with the new algorithm provides an unbiased comparison between the two approaches and gives an estimate of the improvement due to flexible treatment of the molecules. While using the RMSD value as a measure for selecting the best conformer, it also results in the best conformer to be optimally aligned with the crystal structure thus biasing the flexible alignment towards the optimal solution. To remove this bias, a second series of alignments was

ACS Paragon Plus Environment

Page 8 of 29

Page 9 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

performed using the same selected Omega conformer geometries as before, but discarding the overlay by translating the molecules away from the compound and performing a random rigid-body rotation. The preliminary alignment between molecule B and template molecule was then accomplished by a rigid-body alignment using T combo and using the alignment yielding the highest value. The flexible alignment was then started from these geometries and the solutions obtained were compared to those of pure Omega conformers by means of RMSD values, shape and color overlap as well as the relative energies of the final conformers.

Flexible Self-Alignment starting from an Omega Conformer Ensemble To assess the suitability of the novel alignment algorithm in cases where only the topological structure of the align compounds is known, initial overlays were derived from ensemble conformers generated with the Omega tool. Therefore an exhaustive ensemble of conformers was generated for each align molecule. As the number of generated conformers can easily become very large, it is currently not feasible to use all conformers for the actual alignment procedure. A selection was made such that the number of conformers is greatly reduced but a suitable starting geometry for the alignment close to the crystal structure geometry is still present in the remaining set of conformers. As the flexible alignment is intended to be used to estimate the best possible alignment of two chemically different compounds, selecting suitable conformers by means of RMSD values is generally not feasible. Therefore, and in correspondence with the scoring function of the flexible alignment algorithm, the most suitable conformers are selected by their T combo (shape and color Tanimoto) values with the reference compound. In the present study we took advantage of the OEShape Toolkit and its functionality to rapidly optimize the shape overlay between two molecules and to align the align molecue to a template. The final set of conformers was then built by selecting at most 10 conformers by decreasing T combo values as long as these values exceeded 90% of the T combo value of the topmost ranked conformer. These sets of conformers were then used as input to the flexible alignment algorithm and the resulting overlays were compared with those obtained from the previous investigation using Omega conformer ensembles.

Flexible Cross-Alignment Similarly to the aforementioned self-alignments using the provided crystal structures and optimal Omega conformers as starting geometries, pairwise cross-alignments were performed within each of the 121 subsets for each protein target. While the self-alignments are expected to work similarly well across all 1461 compounds within the complete dataset, the cross-alignments are expected to exhibit a large dependence on the presence and conservation of a common pharmacophore across the compounds within one subset. Hence the results of the cross-alignment are related to the classification given by Giangreco et al. who classified the 121 sets into easy, moderate, hard and unfeasible with respect to pharmacophore based alignments.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

Prediction of Binding Affinities The advantage of ReFlex3D and its refined alignment of molecular geometries is evaluated by using the optimized geometries to predict binding affinities of a series of compounds against their target protein. The results are compared to alignments obtained by Posit published earlier. The quality of the predictions is assessed by the correlation coefficient between the T combo values measured with respect to a compound with high affinity and the experimentally determined binding affinities.

Results Preliminary Analysis of the Dataset – Geometry Optimizations The geometrical deviation between the crystal structures and the geometries optimized by energy minimization using MMFF94 was found to be relatively low with a mean RMSD value for heavy atoms of 0.74 Å. Some crystal geometries have been found to almost coincide with the corresponding force field minima with RMSD values as low as 0.03 Å while in some cases significant deviations with RMSD values up to 3.47 Å have been observed. Inspecting the distribution of RMSD values (see Figure 1) however reveals that they are clustered around their median value of 0.57 Å with more than 75% of all structures exhibiting RMSD values not exceeding 1.0 Å. For the vast majority (more than 95%) of the structures, RMSD values were found to be lower than 2.0 Å. 200 180 160 140

frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 29

120 100 80 60 40 20 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

2

RMSD [Å]

Figure 1: Histogram of the computed RMSD values between the energetically (MMFF94) optimized geometries and the corresponding X-ray crystal structures.

Flexible Self-Alignment Starting from the Crystal Structure Geometries The exact geometries from the crystal structure data were used as input for ReFlex3D to evaluate how closely the refined geometries resemble the template structures. For each optimized compound the

ACS Paragon Plus Environment

Page 11 of 29

relative energy with respect to the nearest local minimum was obtained by full optimization of the X-ray crystal structure geometries using the MMFF94 forcefield. The average of the conformational energy penalty was determined to be 7 kcal/mol and more than 80% of all refined structures do not exceed the 10kcal/mol threshold. An inspection of the distribution of the relative energies (see Figure 3) reveals that less than 3% of all calculations resulted in energies above 25kcal/mol. These larger energies are found among charged compounds, mainly zwitterionic ones, exhibiting at least two highly polarized oppositely charged regions on their molecular surface. Hence the geometries of these molecules tend to collapse during unconstrained energy optimizations due to the large attractive electrostatic forces between these surface patches. The constraints due to similarity with respect to their crystal structure reference necessarily prevents a complete collapse and results in the observed, highly elevated conformational energies. Attempts to dampen the effects of charged groups on the molecular geometries by using solvent models were unfortunately not successful. Using the Poisson Boltzmann solvation model within the OEChem Toolkit led to non-smooth potentials such that the optimizer was not able to find the nearest minimum. Being constrained to the crystal structure geometries, the flexible alignment results in geometries much closer to the reference structures than the previous, unconstrained optimization. Thus the average RMSD value (compare Figure 2) for the heavy atoms reduces to 0.13 Å, significantly lower than that previously computed (0.74 Å). This also holds true for the largest RMSD value which was reduced almost by a factor of two from 3.47 Å to 1.72 Å. 800 700 600 500

frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

400 300 200 100 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

2

RMSD [Å]

Figure 2: Distribution of RMSD values measured between the optimized geometries obtained with ReFlex3D and the reference structures from the AstraZeneca Validation Test Set.

ACS Paragon Plus Environment

900

100

800

90

700

80 70

600

frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 29

60 500 50 400 40 300

30

200

20

100

10

0

0

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

cumulat ive distribut ion [%]

Journal of Chemical Information and Modeling

0

relat ive energy [kcal/mol]

Figure 3: Distribution of the relative energies of optimized geometries obtained by using ReFlex3D and the crystal structures as input. As can be seen from the cumulative numbers, the majority of structures exhibits less than 10kcal/mol relative energy with respect to the nearest local minimum.

Evaluation of Electrostatics Instead of Color Atoms The overlays obtained with ReFlex3D using

T combo as the similarity score were compared to those

computed using T estat instead of Tcol or . The first charge model we investigated was MMFF94 charges. As these charges do not depend on the three dimensional geometries of the molecules, these charges could be assigned at the beginning of each pairwise alignment leading to only a moderate increase in computational time compared to using Tcol or . Using the MMFF94 charges did not improve the alignments (see Figure 4). While the electrostatic similarity between the optimized geometries and the X-ray structure reference geometries could be slightly improved (0.08 on average), T combo decreased by 0.04 on average and the mean RMSD increased by 0.02Å compared to the shape and color based approach. Exchanging the MMFF94 charge model for AM1 charges assigned at the beginning of each optimization showed similar results with improvements on the electrostatic similarities (+0.08) but slightly worse results with respect to T combo (-0.05) and RMSD values (+0.03Å, see Figure 4). Finally, we conducted another series of calculations employing AM1 charges, but this time assigning them according to the exact geometry within each step of the optimization. Although this comes at a highly increased cost with respect to the computational effort, in contrast to the other two variants using atomic charges this approach leads to improved self-alignments (see Figure 4) compared to the shape and color based approach. The increase in T estat is on average again 0.08, but there is also an increase in the T combo values of 0.03, leading to RMSD values on average 0.07 Å lower than those of the

T combo -based alignment.

ACS Paragon Plus Environment

Page 13 of 29

400 350 300 250

frequency

200 150 100 50

0 0. 02 0. 04 0. 06 0. 08 0. 1 0. 12 0. 14 0. 16 0. 18 0. 2 0. 22

0 -0 .2 -.1 8 -.1 6 -.1 4 -.1 2 -0 .1 -0 .0 8 -0 .0 6 -0 .0 4 -0 .0 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

dRMSD [Å] MMFF94

AM1 BCC stat ci

AM1 BCC conf dependent

Figure 4: Distribution of changes in the RMSD values (dRMSD) compared to the T combo based approach. RMSD values are computed between the optimized alignments from ReFlex3D and the corresponding crystal structure geometries and the difference between the charge models and the

T combo

-based approach is reported. While MMFF94 and conformation

independent AM1 charges lead to worse results compared to the charges provides an improvement of 0.07 Å on average.

T combo

-based approach, using conformation dependent AM1

To give an estimate of the significance of the differences using atomic charges compared to the simpler color atom approach, the effect size was calculated using Cohen’s d.34 For the MMFF94 and the AM1BCC charges computed on the initial conformers, the corresponding d-values are relatively low (0.13 and 0.16 respectively) indicating only a small effect between either of the approaches. The much larger d-value of 0.61 computed for the comparison of using T combo against conformationally dependent AM1BCC charges however shows that the observed reduction in RMSD values represents a significant improvement (medium to large effect). While the use of conformation dependent AM1 charges is shown to be beneficial compared to the simpler T combo approach, we still employ the latter for the further investigations, as its cost benefit ratio currently still outweighs the improvements in the alignments.

Initial Conformers Generated Using Omega The Omega conformer generator was used to exhaustively generate conformer ensembles for each compound in the dataset. The ensembles were verified to be complete with respect to the number of conformers as the number of geometries in the largest ensemble (30,693 for ligand 88V, 4-{2-[({3-[(2methyl-1,3-benzothiazol-6-yl)amino]-3-oxopropyl}carbamoyl)amino]ethyl}benzyl)propanedioic acid, pdbcode 4ajn) did not exceed the specified limit of 100,000 structures. A geometric comparison after aligning the generated conformers with the crystal structure by minimizing the RMSD value between

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

them and selecting the overlay yielding the lowest RMSD value, resulted in a close resemblance of the reference structures, with a mean RMSD value of 0.44 Å and no individual value exceeding 2.0 Å, in good agreement with previous investigations (see Figure 5).This indicates that the conformers generated by Omega provide at least one conformer that is reasonably close to the corresponding crystal structure and that could be used as initial structure for the flexible alignment procedure.

400 350 300 250

frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 29

200 150 100 50 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

2

RMSD [Å]

Figure 5: Distribution of the RMSD values for the top ranked Omega conformers of each molecule in the dataset. The RMSD values represent the deviation of the Omega conformers aligned with the corresponding X-ray crystal structure as template.

Flexible Self-Alignment Starting from the Best Omega Conformer Having verified the suitability of the Omega conformer ensembles, each of the top ranked molecules with respect to the crystal structure (via RMSD) was further refined with the new flexible alignment method. A comparison between the refined structures and the initial Omega conformers was carried out using the change in conformational energy, the change in shape and overlap scores as well as the change in RMSD values with respect to the corresponding crystal structure. Due to its design to optimize the overlap between two compounds, the similarity measures T shape ,

Tcol or and T combo generally increase. While this is smallest for T shape with an almost negligible average increase of +0.03, the improvement in by Tcol or +0.16 and thus T combo of +0.19 is noteworthy. While the absolute improvement of individual similarity measures shows that the flexibility of the molecules allows them to resemble the reference geometry more closely, the overall benefit of the new

ACS Paragon Plus Environment

Page 15 of 29

algorithm can be seen in the distribution of similarity measures between the Omega conformers and the geometries refined by ReFlex3D (see Figure 6 and Figure 7). While the Tcol or values computed for the Omega conformer ensembles are widely spread down to 0.5 (see Figure 6 and Figure 7), those measures obtained for the refined geometries are concentrated in a range between 1.0 and 0.9 leading to less than 10% of the molecules having a self-alignment error of more than 0.2 from the expected perfect fit ( Tcol or = 1.0 ). 600 500 400

frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

300 200 100 0

1 0.98 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

Tcolor Omega

Figure 6: Comparison of the distributions of obtained with ReFlex3D.

Tcol or

ReFlex3D

between the best Omega conformers and the optimized geometries

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

900 800 700 600

frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 29

500 400 300 200 100 0

2

1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1

1

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

Tcombo Omega

Figure 7: Comparison of the distributions of with ReFlex3D.

Tcombo

ReFlex3D

between the best Omega conformers and the refined geometries obtained

As well as the similarity measures directly included in the optimization function ( T shape and Tcol or ), the RMSD values between the optimized geometries and their corresponding crystal structure reference were also found to improve considerably (see Figure 8). On average, the RMSD was computed to 0.44 Å within the Omega conformers and only 0.25 Å after the flexible refinement, an average improvement of 0.19 Å.

ACS Paragon Plus Environment

Page 17 of 29

500 450 400 350

frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

300 250 200 150 100 50 0

-1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

RMSD [Å]

Figure 8: Distribution of the change in RMSD measured between Omega conformers and geometries obtained with ReFlex3D, each with respect to the crystallographic reference geometries.

The refinement of top ranked Omega conformers selected by minimal RMSD values leads to a bias when being refined with ReFlex3D and the RMSD re-computed. To remove this bias, a second selection of conformers from the Omega conformers was used, namely selecting the top ranked conformer by shape ( T shape ). There is again an average improvement in the overlap measures T shape (+0.04), Tcol or (+0.10) and

T combo (+0.15), although this improvement is less than in the previous series. This is mainly due to the initial alignment already having been performed based on an optimization of these quantities thus leaving less potential for further optimization. RMSD values of the refined geometries were determined to be within the range 0.03 to 1.98 Å with an average value of 0.26 Å, comparable to the best case scenario described above. The improvement in RMSD due to the flexibility amounts to 0.17 Å, slightly less than before.

Flexible Self-Alignment Starting from Selected Omega Conformers The bioactive conformation of a compound is usually not known and thus cannot be used to select the optimal conformer from a conformer ensemble by means of measuring an RMSD value. However, simply refining all conformers is currently not feasible due to the immense computational demand. Therefore, a small set of initial geometries has to be selected from that ensemble using a different measure. In our case, these geometries were selected by their Tcol or value, which can be calculated for any arbitrary pair of compounds. However, selecting the conformers with largest overlap in shape and color does not

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

necessarily retrieve the conformer leading to the lowest RMSD measure when comparing to the experimentally (X-ray) determined bioactive conformation. Therefore, the measured RMSD values for the cross-alignment were found to be much larger than those for the self-alignment. Within the set of 1461 compounds, the average RMSD was determined to be 0.45 Å with a maximum deviation of 3.43 Å. Closer inspection (see Figure 9) revealed that the latter was caused by the compound C24 (3-({2-[(4carbamimidoyl-phenylamino)-methyl]-3-methyl-3H-benzoimidazole-5-carbonyl}-pyridin-2-yl-amino)propionic acid ethyl ester), pdb-code: 1kts, alpha-thrombin), as the RMSD values for the conformers that give the highest T combo results were found to be more than 1.0 Å higher than the lowest RMSD values found among this ensemble. The high RMSD values of the best T combo -based overlay originate from the misalignment of the terminal pyridine and ethyl-ester moeities, which cannot be corrected for by the local optimization of the refinement step. Neglecting the results of the C24 (pdb-code: 1kts, alphathrombin) compound, the maximum RMSD reduces considerably to 1.96 Å.

Figure 9: An illustrative example showing the difference in the selection of molecules using RMSD or T combo . While the top ranked overlay based on the RMSD with respect to to the crystal structure geometry of C24 (pdb-code: 1kts, alpha-thrombin, drawn in green) shows a generally good overlap with relatively high local deviations (left), the T combo -based overlay aligns the two ring scaffolds much better at the expense of confusing the terminal substituents (right).

Flexible Cross-Alignment The investigations described up to now were aimed at clarifying the best possible performance of ReFlex3D in reproducing bioactive conformations determined via X-ray crystallography. To achieve best case performance, however, we focused on refinements of known geometries, although a perfect reference geometry is usually not available in general. We also carried out another set of calculations where we refined the geometry of one molecule (B) using the crystal structure geometry of another compound (A) from the same test set as a reference. This may be termed cross-alignment. As shown above, the result of this (cross-)refinement will largely depend on the resemblance between the initial geometry of B and the reference geometry of A. The geometry of A also needs to be similar to the bioactive conformation within the crystal structure of B in terms of shape and electrostatic properties, as the refinement is based on a maximization of their mutual overlap. In order to focus on the effect of the mutual overlap on the refinement, a first series of

ACS Paragon Plus Environment

Page 18 of 29

Page 19 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

cross-alignments was performed using the exact crystal structures of B as the initial geometry. This avoids potential issues starting from unsuitable geometries which can not be refined to the bioactive conformation of B. A second investigation was performed using an ensemble of conformers generated via Omega for molecule B. Thus the exact bioactive conformation of B is no longer required to start the refinement. Depending on the flexibility of the compound investigated, conformational ensembles can become very large. To reduce the computational demands to a manageable level, we therefore generate extensive ensembles for each compound, and reduce the number of conformers to a maximum of ten conformers for each molecule as judged by similarity. The selection is performed based on a rigid body overlay between the Omega geometries and the template molecule using the T combo as measure. Only the highest score and conformers that yield at least 90% of its T combo value are used within the following refinement.

Using the crystal structure of each molecule B as initial geometry for the refinement via ReFlex3D was used to determine the best-case performance in reproducing the bioactive conformation of B. Hence it is not too surprising that the cross alignments using the crystal structures as starting geometries result in relatively low RMSD values with an average of only 1.04Å across all molecules within the dataset. However, not all compounds could be refined to geometries closely resembling their crystal structures. This occurs in cases such as the lactotransferin set (P24627) in which some of the compounds exhibit virtually no mutual overlap when comparing their exact crystal structure geometries (see Figure 10 left). Hence, any shape-based method necessarily fails in a cross-alignment between such compounds. However, for compound sets that share a common scaffold with significant overlap in shape and color, e.g. the receptor-like tyrosine-protein phosphatase gamma (P23470, see Figure 10 right), a crossalignment can indeed be expected to yield crystal-structure like geometries. In the case of P23470, the flexible alignment yields very accurate overlays with RMSD values (with respect to the crystal structures) not exceeding 0.80Å for any single overlay. The average within this set is as low as 0.39Å and the best cross-alignment reproduces the crystal structure almost exactly with an RMSD of 0.07Å.

Figure 10

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 10: Overlay of selected molecule within the lactotransferin (P24627) set showing only little spatial overlap(left). In comparison the overlay of the receptor-like tyrosine-protein phosphatase gamma set (P23470) showing a close resemblance between all molecules. While the latter is suitable for a ligand-based alignment, such an approach is likely to fail for the former set.

As the bioactive conformation of B is usually unknown, we performed a second series of refinements starting from Omega conformers. The selection criteria mentioned above resulted in a total of 180,581 individual conformers (more than 120 conformers per compound B), all of which have been refined using ReFlex3D. Looking at the average RMSD value obtained for all 180,581 conformers after refinement with respect to their respective crystal structure geometries, a value of 3.48Å, the refinement with ReFlex3D does not look promising. However, one has to take into account that also the initial Omega conformers, though being selected to exhibit maximal overlap with the template molecule, exhibit an average RMSD of 3.49Å. This result confirms the observations made for the self alignment, that not all conformers of a molecule can equally well be refined to an optimal geometry. The effect of using different template molecules A for the refinement of B can be seen in Figure 11. While the average RMSD values for all 1464 molecules are widely spread, the corresponding minimal RMSD values, obtained from all crossalignments within the corresponding set, are found to be very low. This indicates that it is absolutely crucial for any local refinement, and thus also for ReFlex3D, that the initial conformer geometry already resembles the desired crystal structure geometry reasonably well.

If only the best overlay between the template molecule A and all the (refined) conformers of molecule B are considered, effectively choosing the best conformer within the ensemble, the RMSD values decrease significantly (Omega: 2.80Å, ReFlex3D: 2.79Å). Not unexpectedly, the individual values within the four classes (easy, moderate, hard and unfeasible) are quite diverse. Within the unfeasible class, there is hardly any mutual overlap between the compounds such that a shape based cross-alignment will not be able to reproduce crystal-structure like geometries. This is reflected in the high average RMSD value of Reflex3D (4.88Å) as well as in that of the initial Omega conformers (4.87Å). Compounds within the sets in the hard and moderate classes exhibit a larger mutual overlap and thus allow for a cross-alignment. However, as their average RMSD values are measured to be 2.74Å /3.05Å (Omega) and 2.73 Å/3.03Å (ReFlex3D) there is only very limited benefit from refining the initial Omega conformers within these sets. Finally within the easy sets, the larger overlap of a common scaffold within each set allows for very low RMSD values averaging to 1.41Å (Omega) and 1.37Å (ReFlex3D). It has to be noted that, although the average improvement using ReFlex3D compared to a rigid body overlay of Omega conformers does only lead to a marginal improvement of 0.04Å across all compounds, it translates into more than 60% of conformer geometries being improved over their initial Omega conformers. While the performance within the unfeasible class is worse (under 50%), ReFlex3D is capable of improving almost 65% within the easy and hard sets. Within the moderate sets, still more than half the structures get closer to their crystal structure geometries. We furthermore like to stress that these results were obtained by using only shape and color as components for the similarity contribution during refinement. As shown above, using conformation

ACS Paragon Plus Environment

Page 20 of 29

Page 21 of 29

T dependent AM1 charges instead of color yields a further improvement in the predictability of crystal structure geometries. Although this approach could not be tested for the cross-alignments due to limited computational capabilities, it is expected to improve the average RMSD by at least 0.1Å taking the improvement of using conformation dependent AM1 charges as shown above into account.

T combo T combo Figure 11 500 450 400 350

frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

300 250 200 150 100 50 0

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5

RMSD [Å] RMSD (min)

RMSD (mean)

RMSD (max)

Figure 11: Distribution of minimum, average and maximum RMSD values computed for all molecules using every other molecule within each set as template for an alignment using ReFlex3D. The optimal geometries after refinement exhibit RMSD values around 1.0 Å with only few exceeding 1.5 Å.

AlignScores In addition to RMSD measurements of individual compounds, we also used the AlignScore from the publication of Giangreco et. al to investigate the flexible alignment. The AlignScore provides a measure of how well a series of molecules can be fitted onto their template geometries based on the RMSD value computed for the heavy atoms of A and B. The threshold for the RMSD value is set to 2.0 Å and the

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

AlignScore is considered to indicate success if at least half the molecules within the set can be fitted with RMSD values below that threshold. As the flexible alignment within this study is a deterministic process, we only obtained one single best alignment for each pair of compounds. However, as each molecule within a set can be used as the template molecule for the alignment, the AlignScore can be computed as many times as there are molecules within the set. Therefore, we do not report individual AlignScores but their average values and extrema for each set. Consistent with the classification of Giangreco et. al.22, the AlignScore for the easy sets and initial positions taken from the crystal structures results in an overall success. There is only a single set, dihydrofolate reductase (P16184), for which not every template compound leads to a successful alignment. Also the average AlignScore for the moderate and hard sets indicate that a successful alignment is possible within each of these sets. However, the choice of a suitable template molecule is more pronounced compared to the easy sets. There are six out of the moderate and four out of the hard sets for which not every molecule represents a suitable reference geometry for the alignment. Due to the initial positions taken from the X-ray crystal structures, even some of the alignments within the unfeasible sets yield AlignScores in excess of 50%. However, only a single set (acetylcholinesterase, P04058) yields AlignScores consistently higher than 50%. While unexpected for a set classified unfeasible, these results are in line with those reported by Giangreco et. al. Within their studies, P04058 consistently lead to AlignScores of 43%, only slightly below the success criterion of 50% and resulting in a higher average value than other sets, e.g. proto-oncogene tyrosine-protein kinase Src (P00523, 41.8%) classified only as moderate. While the molecules within each of the easy sets comprise a relatively large common core and similar orientations of the molecules within their targets binding pockets, which greatly facilitates a ligand-based alignment between chemically diverse compounds, they get more diverse towards the moderate and hard sets. Unsurprisingly, the lack of similarity between the ligands within the unfeasible sets (e.g. P24627) renders a ligand-based alignment useless. When Omega conformers and corresponding overlays are used as initial geometries rather than the optimal crystal structures, the AlignScores are found to decrease as expected. Only two of the easy sets yield perfect scores independent of the template molecule chosen and four of them fail completely. The rate of failures increases through the other three classes. However, within the moderate sets, successful alignments can be obtained for at least half of the sets by selecting a suitable template molecule. Within the hard and unfeasible sets, there is at least one reference molecule, for which the AlignScore results in values below 50%, but 10 out of 18 hard sets and even one of the unfeasible sets yield success if the right template molecule is used. Overall, the AlignScores showed that ReFlex3D is capable of reproducing the bioactive conformations within most of the 121 set reasonably well.

Prediction of Binding Affinities The “end goal” of a previous study was to utilize molecular alignments for the construction of a predictive model to estimate binding affinities.27 The T combo values from overlays generated using Posit and measured with respect to 5-(4-Piperidyl)isoxazol-3-ol (4-PIOL) and the experimental binding

ACS Paragon Plus Environment

Page 22 of 29

Page 23 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

affinities (from NMR binding experiments) were calculated and the correlation was determined to yield an R = 0.58 . Because 4-PIOL was not the strongest binder, the calculations of T combo were repeated using the best binder, the thio-derivative of 4-PIOL as reference. However, the correlation in this case 2 dropped to R = 0.53 . 2

Using ReFlex3D to obtain refined alignments of the 16 compounds with 4-PIOL and its thio-derivative led to increased T combo values for all molecules. As well as the similarity measures increasing, the correlation of these similarities with the experimental binding affinities also improved. Using 4-PIOL as 2 reference, the correlation was determined to yield R = 0.65 (+0.07). Using the thio-compound, the 2

strongest binder within the series as reference, an even higher value of R = 0.74 (+0.21) could be obtained.

Discussion We have presented a method to align molecules flexibly with respect to a putative bioactive conformation of a reference compound. The alignment is guided by the maximization of the overlap between two molecules A and B with respect to their three dimensional shape, their chemical features and electrostatic potentials as well as the conformational energy of the align molecule. The method represents a local optimization method and was investigated in combination with several approaches to generate starting points for the optimization. Calculations starting from the optimally aligned structures showed that the parameterization of the algorithm provides a reasonable balance between the similarity with respect to the reference geometry and the conformational energy of the molecule B. The optimized alignments were found to be reasonably close to the crystallographic reference structures with (generally) RMSD values well below 0.5 Å. In a comparison with geometries generated by a conformer generator (Omega), the flexible alignment was found to improve on the measured RMSD values and the three dimensional similarities ( T shape , Tcol or , T combo , and T estat ). The latter findings in particular point out how important an accurate representation of the geometry is for assessing molecular similarities. In contrast to virtual screening approaches that frequently use the same conformer generator to obtain geometries for template and align molecules, the importance of an accurate representation of the bioactive conformation is shown by an increase in the similarity measures of 0.15 when using the flexible alignment. Hence, the refined geometries provide a much closer resemblance of the target structures compared to those obtained from conformer generators. A comparison of the measured RMSD values after the self-refinement using different sets of initial structures pointed out the importance of suitable starting geometries. Starting from the observed crystal structure geometries showed only minimal deviations between the optimized geometries and those used as a reference. Considering the uncertainties in the atomic positions resulting from limited experimental accuracy and those introduced by the later refinement, the alignments can be considered as nearly perfect. Without prior knowledge of the correct three dimensional structure of the align compound, the generation of starting structures proved to be more difficult. Conformer ensembles created with the Omega program were shown to contain suitable geometries in most cases. However, selection of the best overlay was challenging. A good compromise between computational demand and accuracy of the results was found by selecting the best conformers according the T combo and

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

performing the refinement on a small subset of the ensemble. An important benefit of using T combo for the selection is that this measure is independent of the similarity in the chemical connectivity between template and align molecule and thus, in contrast to RMSD values, universally applicable. The cross refinement lead to results in line with the classification of test sets provided by Giangreco et. al. Similar to other pharmacophore based alignment methods, the quality of the results was found to depend on a conserved common core and matching chemical functionalities across the compounds within the set. An advantage was found in the design of the scoring function, being solely based on

T combo and the conformational energy, making the algorithm insensitive to the exact chemical structure of the compounds under investigation. Within the current implementation, ReFlex3D provides the possibility to approximate electrostatic properties of the molecules under investigation either by color atoms or by atomic point charges according to one of the charge schemes available from the OEChem Toolkit. While using simple MMFF94 charges and the T estat instead of Tcol or worsens the overlap compared to a shape and color based approach, conformationally dependent AM1 charges (with bond charge correction) have been shown to improve the alignments both in terms of individual similarity measures (shape, color and estat) as well as reduced RMSD values with respect to the reference geometries. The required computational effort may currently be seen as prohibitively large. However, with the advent of new technologies like the cloud based solution Orion,35 the improved quality of the results from ReFlex3D might soon justify the additional computational demands. The benefits of refined alignments have been shown in a comparison with a previously published study. A predictive model based on similarities computed for the overlays from ReFlex3D was shown to be more accurate compared to the previously employed tool, Posit. The improved correlations between the determined similarities and the experimental binding data point out the importance of using the best available template structure, the thio compound, treating the molecule in a flexible way and extending the shape based similarity by a similarity measure for the electrostatic properties of the molecules.

Conclusion A flexible alignment algorithm was introduced which uses the overlap in shape and color to align align molecules completely flexibly with a given reference structure. To maintain physically reasonable geometries, with respect to bond distances and angles, the optimization function also includes energy terms from the MMFF94 forcefield that constrain the alignment to geometries of low energy conformations. There are three advantages of the current approach: i) the independence of a receptor model ii) the possibility of aligning molecules based on structural similarity (shape/color) even when they lack a 1:1 mapping of atoms/substructures iii) a completely flexible treatment of the molecule governed by computationally inexpensive force field terms as compared to methods using pre-calculated minimum conformations. The advantage of a flexible treatment of molecules was demonstrated by the comparison against conformer ensembles generated with the Omega conformer generator and minimum energy geometries determined using the MMFF94 forcefield. The increase in similarity measures and the decrease in RMSD values with respect to the reference geometry show that a flexible treatment of molecules can drastically

ACS Paragon Plus Environment

Page 24 of 29

Page 25 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

improve the quality of measuring similarities between molecules. The 15% increase in T combo for identical molecules shows the more accurate identification of compounds while the increase in the spread of T combo values within a set allows for a finer distinction between compounds. ReFlex3D is a local optimizer and thus depends on the initial geometries provided. Using exhaustive conformer ensembles obtained from the Omega conformer generator results in excessively large computational demands while using only the best initial geometry did not always lead to the optimal alignment. A reasonable compromise could be achieved by using a reduced ensemble of up to ten geometries selected via the overlap between the template and the aligned compound. When using similarity measures to predict the binding affinity of anti-fibrinolytic compounds, the optimized alignments obtained with ReFlex3D correlated much more strongly with the experimental binding affinities than the previously investigated alignments based on precomputed conformer ensembles. While approximating the electrostatic similarity by the use of color atoms already showed greatly improved alignments, using conformationally dependent AM1 charges gave another improvement of 0.05 Å in RMSD, albeit at a significantly higher computational demand. However, using massively parallelized architectures or cloud based solutions such as Orion, the benefit of increased accuracy in the alignment obtained with ReFlex3D might soon justify the computational effort. ReFlex3D was developed as a python script within the OEChem Toolkit and the code is available for download. It requires only a minimum of parameters which are universal and independent of the molecules under investigation. Users are encouraged to implement ReFlex3D as part of their workflows to benefit from its improved alignment capabilities.

Additional Information The code for the flexible alignment as well as the script for selecting initial overlays from an exhaustive conformer ensemble is available at https://github.com/OpenEye-Contrib.

Author Information Corresponding Author E-mail: [email protected] ORCID 0000-0001-5728-0160 Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Notes The authors declare no competing financial interest.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Paragon Plus Environment

Page 26 of 29

Page 27 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

References (1) (2)

(3) (4) (5)

(6)

(7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24)

Cole, J.; Davis, E.; Jones, G.; Sage, C. R. Molecular Docking—A Solved Problem? In Reference Module in Chemistry, Molecular Sciences and Chemical Engineering; Elsevier, 2017. Cappel, D.; Dixon, S. L.; Sherman, W.; Duan, J. Exploring Conformational Search Protocols for Ligand-Based Virtual Screening and 3-D QSAR Modeling. J. Comput. Aided Mol. Des. 2015, 29 (2), 165–182. Lemmen, C.; Zimmermann, M.; Lengauer, T. Multiple Molecular Superpositioning as an Effective Tool for Virtual Database Screening. Perspect. Drug Discov. Des. 2000, 20 (1), 43–62. Cosgrove, D. A.; Bayada, D. M.; Johnson, A. P. A Novel Method of Aligning Molecules by Local Surface Shape Similarity. J. Comput. Aided Mol. Des. 2000, 14 (6), 573–591. Kearsley, S. K.; Smith, G. M. An Alternative Method for the Alignment of Molecular Structures: Maximizing Electrostatic and Steric Overlap. Tetrahedron Comput. Methodol. 1990, 3 (6, Part C), 615–633. Pitman, M. C.; Huber, W. K.; Horn, H.; Krämer, A.; Rice, J. E.; Swope, W. C. FLASHFLOOD: A 3D FieldBased Similarity Search and Alignment Method for Flexible Molecules. J. Comput. Aided Mol. Des. 2001, 15 (7), 587–612. Lemmen, C.; Lengauer, T.; Klebe, G. FlexS:  A Method for Fast Flexible Ligand Superposition. J. Med. Chem. 1998, 41 (23), 4502–4520. Wang, T.; Zhou, J. EMCSS:  A New Method for Maximal Common Substructure Search. J. Chem. Inf. Comput. Sci. 1997, 37 (5), 828–834. Allen, W. J.; Rizzo, R. C. Implementation of the Hungarian Algorithm to Account for Ligand Symmetry and Similarity in Structure-Based Design. J. Chem. Inf. Model. 2014, 54 (2), 518–529. Munkres, J. Algorithms for the Assignment and Transportation Problems. J. Soc. Ind. Appl. Math. 1957, 5 (1), 32–38. Miller, M.D.; Sheridan, R.P.; Kearsley, S.K., SQ: A Program for Rapidly Producing Pharmacophorically Relevent Molecular Superpositions. J. Med. Chem. 1999, 42, 1505-1514. ROCS; OpenEye Scientific Software: Santa Fe, NM. EON; OpenEye Scientific Software: Santa Fe, NM. CORINA; Molecular Networks GmbH: Germany and Altamira, LLC: USA. Omega; Openeye Scientific Software, Inc.: Santa Fe, NM, 2004. Boström, J.; Norrby, P.-O.; Liljefors, T. Conformational Energy Penalties of Protein-Bound Ligands. J. Comput. Aided Mol. Des. 2016, 12 (4), 383–383. Hawkins, P. C. D.; Nicholls, A. Conformer Generation with OMEGA: Learning from the Data Set and the Analysis of Failures. J. Chem. Inf. Model. 2012, 52 (11), 2919–2936. Boström, J.; Greenwood, J. R.; Gottfries, J. Assessing the Performance of OMEGA with Respect to Retrieving Bioactive Conformations. J. Mol. Graph. Model. 2003, 21 (5), 449–462. Kramer, B.; Rarey, M.; Lengauer, T. Evaluation of the FLEXX Incremental Construction Algorithm for Protein–ligand Docking. Proteins Struct. Funct. Bioinforma. 1999, 37 (2), 228–241. OEChem Toolkit; Openeye Scientific Software, Inc.: Santa Fe, NM. Giangreco, I.; Cosgrove, D. A.; Packer, M. J. An Extensive and Diverse Set of Molecular Overlays for the Validation of Pharmacophore Programs. J. Chem. Inf. Model. 2013, 53 (4), 852–866. Giangreco, I.; Olsson, T. S. G.; Cole, J. C.; Packer, M. J. Assessment of a Cambridge Structural Database-Driven Overlay Program. J. Chem. Inf. Model. 2014, 54 (11), 3091–3098. POSIT; Openeye Scientific Software, Inc.: Santa Fe, NM. Kelley, B. P.; Brown, S. P.; Warren, G. L.; Muchmore, S. W. POSIT: Flexible Shape-Guided Docking For Pose Prediction. J. Chem. Inf. Model. 2015,55 (8), 1771–1780.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(25) Color Atoms https://docs.eyesopen.com/toolkits/python/shapetk/shape_examples.html#coloroverlap (accessed Aug 29, 2017). (26)

The Way of ZAP https://docs.eyesopen.com/toolkits/python/zaptk/thewayofzap.html (accessed

Aug 29, 2017). (27) Schmidt, T. C.; Eriksson, P.-O.; Gustafsson, D.; Cosgrove, D.; Frølund, B.; Boström, J. Discovery and Evaluation of Anti-Fibrinolytic Plasmin Inhibitors Derived from 5-(4-Piperidyl)isoxazol-3-Ol (4-PIOL). J. Chem. Inf. Model. 2017, 57 (7), 1703–1714. (28)

Jones, E.; Oliphant, E.; Peterson, P. SciPy: Open Source Scientific Tools for Python; 2001.

(29) Halgren, T. A. Merck Molecular Force Field. I. Basis, Form, Scope, Parameterization, and Performance of MMFF94. J. Comput. Chem. 1996, 17 (5-6), 490–519. (30) Jakalian, A.; Bush, B. L.; Jack, D. B.; Bayly, C. I. Fast, Efficient Generation of High-Quality Atomic Charges. AM1-BCC Model: I. Method. J. Comput. Chem. 2000, 21 (2), 132–146. (31) Jakalian, A.; Jack, D. B.; Bayly, C. I. Fast, Efficient Generation of High-Quality Atomic Charges. AM1BCC Model: II. Parameterization and Validation. J. Comput. Chem. 2002, 23 (16), 1623–1641. (32) SciPy minimize https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html. (33) A. Grant, A.; Pickup,B. T. .A Gaussian Description of Molecular Shape. J. Phys. Chem., 1995, 99 (11), 3503–3510. (34) Cohen’s d: https://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d. (35) ORION; Openeye Scientific Software, Inc.: Santa Fe, NM.

ACS Paragon Plus Environment

Page 28 of 29

Page 29 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

For Table of Contents Only

ACS Paragon Plus Environment