Subscriber access provided by UNIVERSITY OF TOLEDO LIBRARIES
Article
xMaP – An interpretable alignment-free 4D-QSAR technique based on molecular surface properties and conformer ensembles Jan Dreher, Josef Scheiber, Nikolaus Stiefl, and Knut Baumann J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00419 • Publication Date (Web): 27 Nov 2017 Downloaded from http://pubs.acs.org on November 28, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
xMaP – An Interpretable Alignment-free 4D-QSAR Technique Based on Molecular Surface Properties and Conformer Ensembles Jan Dreher‡, Josef Scheiber#, Nikolaus Stiefl+ and Knut Baumann* Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, D 38106 Braunschweig, Germany ‡ present address: Bayer AG, Wuppertal, Germany + present address: Novartis Institutes for BioMedical Research, Basle, Switzerland # present address: BioVariance GmbH, Waldsassen, Germany
* Corresponding author: Knut Baumann Tel ++49 531 391-2750 Email:
[email protected] ACS Paragon Plus Environment
1
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 46
ABSTRACT
A novel alignment-free molecular descriptor called xMaP (flexible MaP descriptor) is introduced. The descriptor is the advancement of the previously published translationally and rotationally invariant 3D-descriptor MaP (Mapping Property distributions onto the molecular surface) to the 4th dimension. In addition to MaP, xMaP is independent of the chosen starting conformation of the encoded molecules and is therefore entirely alignment-free. This is achieved by using ensembles of conformers, which are generated by conformational searches. This step of the procedure is similar to Hopfinger’s 4D-QSAR. A five-step procedure is used to compute the xMaP descriptor. First, a conformational search for each molecule is carried out. Next, for each of the conformers an approximation to the molecular surface with equally distributed surface points is computed. Third, molecular properties are projected onto this surface. Fourth, areas of identical properties are clustered to so-called patches. Fifth, the spatial distribution of the patches is converted into an alignment-free descriptor that is based on the entire conformer ensemble. The resulting descriptor can be interpreted by superimposing the most important descriptor variables and the molecules of the dataset. The most important descriptor variables are identified with chemometric regression tools. The novel descriptor was applied to several benchmark datasets and was compared to other descriptors and QSAR techniques comprising a binary fingerprint, a topological pharmacophore descriptor (Cats2D), and the field-based 3D-QSAR technique GRID/PLS which is alignment-dependent. The use of conformer ensembles renders xMaP very robust. It turns out that xMaP performs very well on (almost) all datasets and that the statistical results are comparable to GRID/PLS. In addition to that, xMaP can also be used to efficiently visualize the derived quantitative structure-activity relationships.
KEYWORDS 4D-QSAR, alignment-free, validation
ACS Paragon Plus Environment
2
Page 3 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
INTRODUCTION Structure-activity correlation techniques need to represent molecules numerically. During the past years and decades a considerable number of molecular descriptors have been developed. Programs such as Dragon1,2 or VCCLab3 compute up to 5000 descriptors per molecule. These descriptors can be distinguished by the way they represent molecules. One-dimensional descriptors (1D) include bulk parameters as well as physicochemical properties (e.g. log P, molecular volume). Two-dimensional (2D) techniques use information of the molecular graph. Three-dimensional (3D) descriptors are based on molecular geometry (i.e. the Cartesian coordinates of the molecule’s atoms) and can roughly be divided into two different groups: grid-based techniques such as CoMFA4, CoMSiA5, GRID6 or Continuous Molecular Fields7, which transforms grids into continuous functions, and distance-based techniques such as MaP8,9, the Fuzzy Pharmacophores10, and GRIND11. Both types use a different frame of reference. In the grid-based techniques changes of the molecule’s position in space cause changes in the descriptor (external frame of reference). The position of the molecule in space is irrelevant if distances between molecular features are used to characterize it (internal frame of reference) since distances between two or more molecular features of a single conformer (i.e. a rigid object) do not change when the object is moved in space (translational and rotational invariance). If an external frame of reference is used, the molecules need to be aligned in space to derive meaningful descriptors. This latter step is obviously not necessary for distance-based techniques and thus avoids potential bias caused by the chosen alignment rule. The downside of this approach is a more difficult visualization and interpretation of the resulting model. However, for several newer distance-based approaches this problem has been solved8–11. In 3D-QSAR, only a single conformer per molecule is analyzed. Hence, an appropriate conformer for each molecule in the dataset needs to be chosen. Since most biologically active molecules are more or less flexible, the latter selection step represents another potential source of bias. Moreover, with a single conformer 3D-QSAR methods cannot represent molecular flexibility. This has early been realized by Hopfinger and coworkers who introduced the first 4D-QSAR technique which encodes molecular ACS Paragon Plus Environment
3
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 46
geometry plus molecular flexibility12. A lot of publications show the power of this particular 4D-QSAR technique13–17. Briefly, Hopfinger’s 4D-QSAR uses a grid-based method (external frame of reference) and requires an alignment of the molecules prior to analysis. It uses large, homogenous conformer ensembles generated by molecular dynamics simulations. As a grid-based method, interpretation of the 4D-QSAR models is straightforward. Similar to Hopfinger’s 4D-QSAR, the newly introduced xMaP approach is also based on conformer ensembles to reflect molecular flexibility. As opposed to Hopfinger’s 4D-QSAR, it uses an internal frame of reference and thus needs no alignment. The conformers are generated by conformational searching and cover a larger conformational space. Overall, xMaP is quite different in its nature resulting in different strengths (no alignment). Dobler and Vedani extended receptor-surface models18 to encode molecular flexibility19,20. This technique has also been extended to 5th and 6th dimension representing target flexibility and different solvation states21,22. It matured over the years and is broadly applicable. However, as a receptor-surface model, the technique is also alignment dependent. Another alignment dependent multiconformational method related to the receptor-surface models was published recently23–25. Similar to Hopfinger’s 4D-QSAR homogenous conformer ensembles are employed here. Self-organizing maps (SOM) have also been used successfully for 4D-QSAR26–29. Here the coordinates and the partial charges of the conformational ensemble of each molecule are used to construct a 2D-SOM which is then used as input for PLS regression with variable selection. Again, this method requires the alignment of the molecules since the SOM construction uses an external frame of reference. LQTA-QSAR in essence extended CoMFA to the 4th dimension and also is alignment dependent30,31. It uses molecular dynamics simulations to generate the conformational ensemble and a grid-based approach with different probes to generate CoMFA-like descriptors. An enormous number of descriptors (several ten thousands) is generated so that variable filtering and variable selection becomes very important32. An alignment independent approach that uses multiple conformers was developed by Kuz’min and colleagues33–35. Here, a set of different simplexes (predefined tetratomic structural fragments with fixed composition, chirality and symmetry) is used to encode the conformers. This approach to 4D-QSAR ACS Paragon Plus Environment
4
Page 5 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
does not take long-range distances into account because simplexes consist of one sphere of neighboring atoms. Moreover, it results in a huge number of variables. Another 4D-QSAR method extends the Electron-Conformational (EC) method36, which is based on quantum chemistry, to conformational ensembles37 and also is alignment independent. Since the size of the so-called electron-conformational matrices of conjunction (ECMC) depends on the number of atoms, the method first performs a pharmacophore elucidation to identify identically composed fragments of constant size (so-called electron conformational sub-matrices of activity) to enable QSAR modelling. The method is thus restricted to data sets of closely related analogs with a common scaffold. A recent approach also uses molecular dynamics successfully to characterize the conformational space38. It uses alignmentindependent WHIM descriptors to encode the molecules39. While these descriptors holistically characterize each conformer, they cannot easily be back-projected onto the molecule. The article is organized as follows. In the next section the method is described in detail and compared further to other approaches. In the results part of the study, the novel molecular structure descriptor is benchmarked against alignment independent 2D-QSAR methods and alignment dependent 3D-QSAR methods using twelve well-established datasets, eight of which originate from Sutherland et al.40. 2D descriptors are included as simple approaches to check whether there is a real benefit in terms of predictivity of the more sophisticated methods. In addition to the benchmark study, xMaP is studied in detail for acetylcholinesterase inhibitors.
MATERIALS AND METHODS xMaP - overview. The initial step of the newly developed technique is the computation of conformer ensembles for each molecule in the dataset by a conformational search procedure. If the 3D structures within one ensemble form more than one cluster, only the largest one is kept (this will be referred to as harmonization of a conformer ensemble). That means, that only a single binding mode is modeled which is chosen to be the energetically most likely one. Afterwards a discretized molecular surface is computed for each of the remaining conformers. Each of the surface points is assigned a hydrophobicity ACS Paragon Plus Environment
5
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 46
property (strongly hydrophilic, weakly hydrophilic, strongly hydrophobic, weakly hydrophobic). In addition, surface points can be assigned a H-bonding property ((H-bond donor, H-bond acceptor) if applicable. Next, the large number of surface points is reduced to surface patches by clustering areas of identical properties. These surface patches are represented by their centers of mass and surface size. They are then used to generate potential 2-point-pharmacophores. These are characterized by a particular patch property combination and the distance between the centers of mass of the two patches. The resolution on the distance axis is 1Å. Put differently, each vector entry stores the occurrence of a particular 2-point-pharmacophore for a range spanning 1Å (e.g. 7.0Å≤x