Frontiers in Molecular Design and Chemical ... - ACS Publications

(GTM) - a dimensionality reduction method which can be used both to data visualization, clustering ... embedded in the D-dimensional data space: xk â†...
2 downloads 8 Views 1MB Size
Chapter 11

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

Generative Topographic Mapping Approach to Chemical Space Analysis Héléna A. Gaspar,1 Pavel Sidorov,1,3 Dragos Horvath,1 Igor I. Baskin,2,3 Gilles Marcou,1 and Alexandre Varnek1,3,* 1Laboratoire

de Chemoinformatique, UMR 7140, Université de Strasbourg, 1 rue Blaise Pascal, Strasbourg 67000, France 2Faculty of Physics, M.V. Lomonosov Moscow State University, Leninskie Gory, Moscow 119991, Russia 3Laboratory of Chemoinformatics, Butlerov Institute of Chemistry, Kazan Federal University, Kazan, Russia *E-mail: [email protected]

This chapter describes Generative Topographic Mapping (GTM) - a dimensionality reduction method which can be used both to data visualization, clustering and modeling. GTM is a probabilistic extension of Kohonen maps. Its probabilistic nature can be exploited in order to build regression or classification models, to define their applicability domain, to predict activity profiles of compounds, to compare large datasets, to screen for compounds of interest, and even to identify new molecules possessing desirable properties. Thus, GTM can be seen as a sort of a multi-purpose Swiss knife, each of its blades being able to shape an answer to a specific chemoinformatics question, based on a unique map.

1. Generative Topographic Mapping: Principles Generative Topographic Mapping or GTM, introduced by Bishop et al. (1, 2) in the 1990s, performs a dimensionality reduction (3–5) from the initial Ddimensional data space to 2-dimensional latent space (“GTM” map) by embedding a 2-dimensional non-linear manifold into the data space (Figure 1). GTMs may be conceptually assimilated to fuzzy logic-driven Kohonen (6, 7) Self-Organizing Maps (SOMs). © 2016 American Chemical Society Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

Figure 1. Each node xk in the latent space (red point on the grid) is mapped to the corresponding manifold point yk in the initial data space by the non-linear mapping function y(x;W). The GTM algorithm involves the following steps. A regular squared grid of K nodes covering the 2-dimensional latent space is generated, where K is the square of some small integer l, the grid width expressed in number of nodes. A node k is defined by its integer 2D coordinates xk = (lx,ly), with index k = lx × K + ly where lx,ly = 0 ... l −1 and k = 0 ... K2 − 1. Each node is mapped to a manifold point yk embedded in the D-dimensional data space: xk → yk, using the non-linear mapping function y(x;W) that maps points from the two-dimensional latent space into the D-dimensional data space:

where Y is the K × D matrix of manifold points yk, W is the D × M parameter matrix, and Φ is the K × M radial basis function matrix with M RBF centers μm:

σ2 is a parameter that can be initialized as the average squared Euclidean distance between two RBF centers, multiplied by a tunable factor w. W, the parameter matrix, can be initialized such as to minimize the sum-of-squares error between the mapping of latent points and the equivalent positions by default corresponding to linear Principal Component Analysis (8) (PCA) mapping hypothesis. The points yk on the manifold are the centers of normal probability distributions (NPDs) of t:

where tn is a data instance and β the common inverse variance of these distributions. The ensemble of N data instances (in cheminformatics, N molecules) spans the relevant zone of the problem space to be mapped. Molecules are represented by their molecular descriptor vectors tn, n = 1 ... N, which define a “frame” within which the map is positioned, and will therefore be termed “frame set” in this chapter. As the manifold grid points are positioned in the neighborhood of frame set points, by optimizing the below-given log likelihood function, mapping 212 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

is bound to the initial space zone covered by the frame set, which will be folded into the delimited square grid of K nodes. Although the manifold may describe an infinite hypersurface in the initial space, its extrapolation far away from the frame set-covered zone makes little practical sense. “Exotic” compounds outside the frame zone will be rendered within the bounds of the square grid, but such a mapping relies on extremely low probability densities, and these compounds will be recognized as being outside the applicability domain of the map. Therefore, the proper choice of frame compounds is a key prerequisite in GTM design; these compounds may, but do not need to coincide with the actual compound collections targeted by the GTM-based study. An optimal GTM corresponds to the highest log likelihood, taken over all frame instances n, optimized by expectation-maximization (EM):

In the expectation (E) step of the EM algorithm, the responsibility, or posterior probability, that a point tn in the data space is generated from the kth node is computed based on current β and W using Bayes’ theorem:

In the maximization (M) step, β and W are recomputed using current responsibilities:

where I is the M × M identity matrix, G is a K × K matrix with elements Gkk = ∑nRkn, R is a K × N matrix of responsibilities, T is a N × D matrix of points t in the initial data space, N is the number of points in the training set. E- and M-steps of the algorithm alternate until convergence. Four parameters are tunable by the user: the number of RBFs M, the number of nodes K, the RBF width multiplication factor w and the weight regularization coefficient λ; the two last parameters can be used to avoid overfitting and tune the stiffness of the manifold.

The responsibilities Rkn are used to compute the mean (real value) position of a molecule on the map, s(tn), by averaging over all nodes with responsibilities as weighting factors :

213 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Each point on the GTM thus corresponds to the averaged position of one molecule. This step completes the mapping by reducing the responsibility vector to a plain set of 2D coordinates, defining the position of the projection point of the initial D-dimensional vector on the map plane.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

1.1. Incremental GTM The incremental version of GTM (iGTM) algorithm can be used to build maps based on arbitrarily large frame sets, up to millions of compounds (9). Instead of updating the model with the entire data matrix, iGTM divides the data into blocks and updates the model block by block until convergence of the log likelihood function. The published technical details (9) of the algorithm will not be detailed here; although practically important in terms of enabling the otherwise unfeasible processing of large frame sets, it does not produce any special category of maps. GTMs are undistinguishable with respect to the algorithm (iGTM or classical) used to build them, as far as this could be proven with data sets that were small enough to be processed by the classical approach. Note that iGTM is unavoidable to build a map with a very large frame set, but is not required to project large compound sets onto an already generated GTM manifold. While map building requires simultaneous access to all frame compound descriptors (thus causing memory problems), a posteriori compound projection onto an existing manifold is an independent operation involving only the concerned compound, and may therefore be easily broken up into individual jobs to be dispatched on computer cluster/grids.

2. The Two Representation Levels of a GTM Model The main advantage of GTM over other dimensionality reduction/mapping techniques is its two-stage approach to dimensionality reduction: •



first, a mapping from the original, arbitrarily large D-dimensional descriptor space to the “standardized”, common K-dimensional responsibility vector space (responsibility level), second, a reduction from the responsibilities to 2D positions on the map (2D level).

The 2D level is, however, not the most interesting one. Indeed, the second step is essential for completing the 2D mapping procedure, rendering the final projections and producing the expected 2D scatterplots. However, the reader will understand that, unlike in geography, where dimensionality reduction from three-dimensional space into 2D causes an important, but nevertheless acceptable loss of information, a 2D map of very high-dimensional spaces will be inherently imprecise, irrespectively of the linear or non-linear mapping strategy, to the point of not being of great use. Molecular structures cannot be robustly characterized by two real numbers only, no matter how much sophistication and effort is invested into defining those numbers. 214 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

The full advantages of GTM are apparent at responsibility level, at (usually) intermediate K dimensionality (2 ≪ K < D). “Usually”, because, in first line, this is by no means a technical constraint: it is perfectly possible (10) to have K ≥ D. Furthermore, it is well known that the actual number D of elements of the molecular descriptor vector represents the effective dimensionality of the descriptor space only if vector elements are completely uncorrelated, which is rarely the case in chemoinformatics, unless measures in this sense are explicitly taken. K is a user-tunable parameter, which should be chosen such as to avoid massive loss of chemically relevant information, all while “flattening out” noise due to less relevant descriptor components. The responsibility vector has the property of being bound to a square grid, a common reference system that may be visually rendered in spite of its still high dimensionality K. A molecule characterized by its rn vector can be visualized by the pattern of grid nodes that it “highlights”, i.e., with respect to which its responsibility values are significant. Graphically, highlighted nodes may be rendered by bubbles, letting a molecule appear as a readily recognizable pattern of bubbles of varying sizes, or color intensity (see Figure 2) on the 2D lattice.

Figure 2. Intuitive rendering of a molecule as bubble pattern of its responsibility vector on a GTM. The crosshair at the barycenter of the bubble pattern defines the point position of the molecule at the 2D representation level.

Similar molecules have similar bubble patterns, and the human eye may easily grasp the similarity of bubble patterns. Further reduction of the molecule object to a single point of 2D coordinates (the crosshair in Figure 2), which is precisely the barycenter of the bubble pattern, may represent a drastic loss of information unless one single node accounts for the entire density distribution. Otherwise, bubble patterns may happen to have a same barycenter, but may not resemble each other at all in terms of shape. Hence, the intuitive bubble pattern representation, typical of the responsibility representation level, is a key advantage of the GTM approach over mapping algorithms that only address the information-depleted 2D representation level. 215 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

The key difference between Kohonen maps and GTM precisely resides in this fuzzy compound-to-node assignment protocol of the latter. This may seem like a minor enhancement, but it is actually a key factor. Its direct consequence is that, at a same grid size K, the volume of chemical information that can be monitored by a GTM is incomparably larger. In Kohonen maps, a compound is unambiguously assigned to a node and, eventually, compounds within a node are indistinguishable. Truly, the quantization error parameter describes how well the molecule fits in its node: some are closer to the node center (its “code vector”) than others. But quantization errors may still not be used to discriminate between molecules within a same node: two molecules that are both remote from the code vector may or (more likely) may not be similar to each other. The only way to escape this compression artifact is to allow for grid sizes sufficiently large to avoid forcing unrelated compounds into a same node. A Kohonen grid of K nodes may at best distinguish between K different core structural motives (in the broad, context-specific meaning of this term: chemotype, scaffold, pharmacophore, etc.). But the actual, appropriate number of relevant distinct core motifs in any given problem is a priori unknown, and potentially huge if one would consider mapping the entire relevant drug space. However, in the GTM formalism, molecules are not necessarily “swallowed” by a single node and the total number of distinct structural motifs that can be distinguished is defined intuitively by the number of distinct bubble plots that may be drawn with the help of a K-sized grid, or technically by the phase space volume spanned by the K-dimensional responsibility vectors. Kohonen maps operate only with pure states, GTMs, by contrast, with mixed states, and the latter come in virtually infinite numbers (not all of them corresponding to real compounds or common core motifs). Furthermore, as will be shown later on, the responsibility representation level is extremely useful to represent compound collections as cumulated responsibility vectors of constituting compounds, i.e., cumulative compound bubble plots. Beyond the intuitiveness of bubble plots, responsibility vectors are full-blown, competent molecular descriptors on their own; they can be used to quantitatively estimate the similarity degree of two bubble plots, representing two molecules or two arbitrarily large compound libraries. Also, they form the fundament for predictive modeling using GTMs: based on responsibilities, measured properties of training set molecules may be extrapolated to the nearest (top-responsibility) nodes of the lattice, using various mathematical formalisms. Then, any external molecule positioned on the map may, on the basis of its responsibility values, be assigned a predicted property values, by reverse extrapolation from the previously established node-bound property values of nearest nodes. This may apply to both continuous and categorical properties, leading to GTM-driven regression and classification models, including GTM-specific applicability domain control tools, as further dedicated subchapters will show. Since quantitative modeling and quantitative validation of these models is possible at the responsibility representation level, this quantitative validation may serve as an objective criterion of mapping quality. Pure visualization tools are lacking such objective estimators of map quality. GTM 2D projections may, indeed, fail to return the intuitive, readable 2D plots matching the unrealistically high expectations of users. However, even an inherently imperfect 216

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

2D representation of a chemical compound collection may be deemed to be the best (least imperfect) possible representation if one can show that the underlying responsibility level hosts a strongly predictive model of the molecular properties under study. The responsibility level may thus indirectly serve as an “alibi” for the information-depleted 2D level. It may help to answer the question of maximizing the generality of chemical space maps, in quest of “universal” approaches able to cover highly diverse chemical space zones, all while supporting prediction of a maximum of diverse and unrelated chemical properties.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

3. GTM Applications: The Outline The rest of this work will cover various applications of the GTM methodology, introducing the underlying algorithms and reporting published examples. Since, as mentioned above, quantitative QSAR/QSPR (11) modeling (Quantitative Structure-Activity/Property Relationships) supported by GTM plays a paramount role in validating and assessing map quality, the three next subchapters (4-6) will be dedicated to them. The first two cover classification and regression models, respectively. Subchapter 6 deals with novel ways to delimit the applicability domain of QSAR models, which naturally emerge from chemical space coverage/density considerations supported by GTM analysis. GTM parameter tuning in order to build “universal” maps of maximal generality, with a robust competence in predicting a large panel of molecular properties/activities will be the subject of subchapter 7. The predictive power of GTM-driven models, in various contexts, will be briefly analyzed in subchapter 8. Finally, the use of GTMs for Big Data analysis (subchapter 9) will revisit the concept of cumulated responsibilities and introduce property-colored synthetic vectors that may serve as compact compound library descriptors at the responsibility representation level. An actual example of commercial compound library comparison will be discussed. If compound libraries are sets of binders to a given target, then the library descriptor explicitly represents the chemical space associated to that target, and therefore implicitly characterizes the target. This opens possible applications of GTM in computational chemogenomics. Eventually, an original “Stargate” GTM approach, providing a direct link between any two vector spaces, e.g., a molecular descriptor space and a property profile space, will be introduced in subchapter 10. The two distinct spaces are coupled by “pinning down” the flexible manifolds of each space onto a common K-dimensional latent space grid. This approach gives direct access to polypharmacological predictions on the basis of molecular descriptors and, inversely, highlights molecular descriptor vectors that likely encode compounds complying with a desired polypharmacological profile.

4. GTM-Based Classification 4.1. Latent-Space Classification Given a training set of m molecules assigned, on the basis of experimental input, to different and non-overlapping categories ci (typically, actives ∈ c1, 217 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

inactives ∈ c2), then the responsibility vector of each molecule can be used to transfer class information onto its associated nodes (12, 13). Intuitively, if the class assignment is visualized as a color, then each molecule will “transfer” some of its color to the nodes, proportionally to responsibilities. Transferred colors accumulate in the nodes, eventually defining nodes where one specific color dominates over the others. Note, again, that training molecules m are not necessarily related to the frame set compounds n that served to generate the manifold equations: frame and training sets may be identical, partially overlapping or even fully disjoined. The latter should be understood in terms of individual compounds contained, but not in terms of covered chemical space zones. If so, the GTM algorithm would still work, technically speaking, but color transfer would make little sense, for it would rely on far-fetched extrapolations. Mathematically, the (normalized) amount of color on each node represents the probability of association of the node k to class ci:

where P(xk|ci) is computed as follows:

where is the responsibility of node k for a molecule belonging to class ci, ni enumerates training set compounds belonging to class ci, is the number of training set compounds belonging to class ci, and represents the prior probability of class i, i.e., the fraction of class members within the training set. If P(c1|xk) > P(c2|xk), node k will be formally assigned to class 1, and visually rendered in the associated color (Figure 3), with an intensity modulated by P(xk|ci). This allows checking whether the local dominance of class 1 corresponds, indeed, to a significant local accumulation of members of that class, or whether the prevalence is the result of unreliable extrapolations of distribution tails to nodes far off the actual regions of interest. Such considerations can be quantitatively exploited to outline GTM model applicability domains, as will be shown in a subsequent subchapter. Now, “colored” nodes represent a repository of the knowledge extracted from the training set compounds, and can be subsequently used for predictions, by transferring the acquired “color” back to query compounds q to be classified. As a first step, a query compound q defined by its descriptor vector tq will be located on the GTM, i.e., associated to responsibilities {Rkq} and optionally mapped to its 2D residence point s. From here, there are essentially two ways to assign it into a predicted class, based on the two formal GTM representation levels outlined in §2. 218

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

Figure 3. GTM latent space classification model with applicability domain (AD) (12, 14), for DUD (15) AchE inhibitors (red) and decoys (blue). Lighter regions have a lower probability of association to the winning class P(xk|cbest) and may therefore be discarded from the applicability domain of the model. The points on the map represent individual compounds colored by class. At responsibility level, the so-called global method consists in predicting the class likelihood values P(ci|tq) of q, as follows:

The local method based on the 2D representation only uses the conditional probability of the node closest to the molecule in 2D, P(xnearest|ci):

The local method is expected to be less reliable, since prone to significant information loss when compressing the responsibility vector down to a pair of real numbers, whilst the molecular bubble plot may suggest that several different nodes are responsible for the classification of the compound. However, if the local method provides good results, it will allow for direct reading of molecular properties from (latitude, longitude) specifications, as intuitively desired from a mapping protocol. In order to translate P(ci|tq) into a clear-cut answer to the question “to what class does q belong”, it is sufficient to consider the largest of these values as “winning” class, although the confidence in the prediction should be downgraded if the winning class won by a narrow margin only (16). 4.2. Initial-Space Classification Interestingly, classification problems also support a different approach, herein termed initial space classification, relying on independent fitting of class-specific manifolds, treating members of each class as independent “training” sets. Each of the fitted manifolds will specifically cover the chemical space zones inhabited by the corresponding class members, while building a separate probability density 219

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

function p(t|ci) in the initial data space for each class ci. Then, a new molecule q must be projected on all the different class-specific manifolds. The better it fits a given manifold, the larger the likelihood of membership of the associated class ci. The associated probability P(ci|tq) can be obtained by applying Bayes’ theorem:

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

where p(t|ci) is the approximation of the conditional probability density function in the data space:

This methodology was applied to the classification of actives and decoys from the DUD (directory of useful decoys). In a certain sense, initial space classification reduces a multi-class classification problem to multiple single-class probability-approximation tasks. This approach has some drawbacks as compared to the afore-mentioned responsibility-level one-manifold-driven classification approaches (Latent-Space Classification), because (a) it is clearly more effortful, requiring multiple manifold fits, and (b) it does not support the straightforward visual interpretations as seen in Figure 3. However, it testifies of the extreme versatility of the GTM tool.

5. GTM-Based Regression Regression (17–19) consists in learning a hidden relationship between explanatory variables x and dependent variables y from several observations. In QSAR, dependent variables are activities to be predicted, such as the affinity for a given target or the solubility of molecular compounds. Two types of GTM-based regression models can be built: supervised, and unsupervised. Supervised approaches use activity information as well as descriptors to train the models, whereas unsupervised methods only use descriptors and are "blind" to the activities during the training process. Supervised GTM regression models include the new method Stargate GTM (10), which is a GTM generalized to multiple spaces. By fitting two manifolds in two separate spaces, one for descriptors, one for properties, the Stargate GTM model integrates property information and becomes supervised. This method can be used to predict one property or entire property profiles, and a special subchapter §11 is dedicated to it. 5.1. Unsupervised Regression: Key Issues First, note that issues discussed here are of equal concern for classification models, which are nothing but “regressions” in terms of categorical variables. 220

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Intriguingly, “unsupervised regression” sounds like a paradox, insofar as “regression” represents the forcibly supervised extraction of reproducible, common trends observed to occur within a training set, and their extrapolation to external instances. This apparently confusing terminology was chosen in order to emphasize the unsupervised nature of the first step in GTM regression model building, which is a two-step undertaking:

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

(a) an unsupervised construction of a manifold, under complete ignorance of property data, putatively going as far as using a distinct frame set of compounds for which the property in question is not even known, and (b) the supervised transfer of property values from training set compounds onto the manifold grid points. The implicitly supervised learning in GTM regression does not include manifold fitting, but begins only once the manifold is given. It consists on coloring/transferring the property-related information from the training set (not frame set) compounds to the manifold grid, very much like class “colors” were transferred in the foregoing subchapter. Note that this two-step approach to regression may also require the user to revisit the concept of model cross-validation, which may be less straightforward than in a classical context, where a subset of compounds representing some fraction 1/X of the training set (X being typically an integer between 3 and 10) is iteratively kept aside, a model is fitted on hand of the remaining (X-1)/X compounds, in order to predict the property for the kept-aside. This scheme may now be complicated by the fact that frame compounds entering the unsupervised stage may or may not coincide with training compounds, whilst it is not a priori clear whether frame set membership of a training compound will implicitly mean a more accurate modeling of its property. So-far observation in our hands show that this is not the case, but there is, to our knowledge, no exhaustive study of the impact of potential cross-validation scheme setups on GTM regression model robustness. In the following discussion, we will explore the potential cross-validation scenarios that may be considered, in the perspective of the so-far accumulated experience in this rather new domain. If frame and training sets coincide, then the leave-1/X-out scheme could be in principle applied to both stages, i.e., letting cross-validation refit each time a novel moiety based on the (X-1)/X kept-in compounds, color it by property data from the same compounds, then map the kept-aside 1/X and predict their properties. However, final predicted values will stem from different manifolds: herein obtained cross-validation results are not visually interpretable within a common frame. It is also debatable whether the resulting cross-validation statistics may be interpreted as characteristics of “the” model, since they originate from a heterogeneous collection of values stemming from potentially quite different models, based on different manifolds. Furthermore, this approach is clearly more time-consuming than building the manifold outside of the cross-validation loop, using all compounds and herewith obtaining the final responsibility vectors for all. In this simplified scenario, cross-validation will concern only manifold coloring by the (X-1)/X, followed by the prediction of kept-aside 1/X, etc. Resulting 221 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

cross-validation parameters characterize the model based on a common manifold, but subjected to variations in coloring due to peculiar kept-in and kept-aside compound sets. If frame and training sets are distinct, cross-validation of the frame set actually means selection of the smallest compound collection needed to properly cover the chemical space needed to explain the variance of the property of interest amongst the given training compounds. Given a frame set, a manifold is built; if cross-validated coloring/prediction with respect to the training compounds yields a robust model, this means that the manifold was a good choice; therefore, implicitly, the frame set was a good choice. Of course, whether or not compounds encountered during other external predictions will also fit the manifold close enough is an open question. If not, then they can be recognized as such, and discarded from prediction on the basis of applicability domain considerations (see §6 below). Furthermore, since the frame set was seen to provide proper coverage of training set space (as testified by its successful cross-validation), external compounds beyond frame set coverage would also be very different from training set compounds. This means that sheer increase of frame set size and diversity would not help unless backed up by an increase of training set size and diversity. In other words, like any other structure-activity model, the final quality and applicability of GTM-based regression will be controlled by training data quality. 5.2. Specific Implementations of GTM Regression Models We designed several regression methodologies, for the transfer of property “color” from training compounds onto the manifold grid. Instead of probabilities of assignment to a given activity class, each node k will be assigned some weighed mean property level a̅k, based on the properties am of training molecules m that are most strongly connected to the node. Again, the so-called "global" method is based on the entire responsibility vector of a test molecule q (responsibility level), and "local" methods take into account only the 2D-level neighborhood of q. 1. The global prediction approach is based on the K-dimensional activity landscape vector a̅̅ computed as following:

where am are the training set activities, and Rkn the responsibility of the kth node for the nth molecule. An example of activity landscape representing the stability constant of metal-ligand complexes is given in Figure 4. The prediction of the property of an external compound q uses its responsibility vector Rkq at each node xk as a weight for the activity landscape value a̅k, so that the predicted activity âq for the query is a weighted average of landscape values:

2. The nearest nodes approach operates at 2D representation level, and consists in selecting the V nearest nodes xv (nodes minimizing ||xv − sq||) on the 222 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

2D grid, according to Euclidean distances from the projection point of the query compound sq. Euclidean distances are used to build the GTM probability density functions (PDFs) during training and should be a natural choice for this task. Only the V activity landscape values a̅v of nodes surrounding the molecule projection are considered for prediction and the predicted activity is their average value:

Both global and nearest nodes approaches, operating at responsibility and 2D representation level respectively, are fundamentally Nearest Neighbor (NN) methods (20, 21). In the global method, all the reference instances of known property (the K nodes) are considered, and the extrapolation proceeds by weighing reference properties a̅k by the similarity level between nodes and query compounds, for that is exactly the empirical interpretation of responsibility values. Nodes, in this approach, act like artificially constructed reference compounds, hybrid constructs covering the frame set zones and onto which the property-specific knowledge was transferred from the training set molecules. What benefit does this bring, with respect to plain, direct comparison/extrapolation between the initial training set compounds and the query molecule? Beyond the benefit of relevant, non-linear dimensionality reduction, “condensing” the property information of a putatively very large training set of tens of thousands of compounds onto a limited series of representative nodes, may represent an impressive computational resource economy with respect to a classical neighborhood model. While classical k-NN requires the query compound to be compared to all the N training molecules in D-dimensional space (estimation of N similarity scores in D dimensions), in GTM regression the required K similarity scores are implicitly given by responsibility vectors. 3. However, the classical nearest neighbors method (k-NN) (22) is welcome to operate at both responsibility and 2D representation levels of a GTM model. These straightforward approaches were indeed implemented as alternative GTM regression engines. Nearest neighbors can be found by computing Euclidean distances ||sm − sq|| between the mean position of training and query compounds on the 2D map, or by estimating the correlation level between their responsibility vectors. The predicted activity of a new tested compound q is the average of experimental activities av of the V nearest training set compounds. Like always in neighborhood-based extrapolation of properties based on the similarity principle, the peculiar choice of the empirical hypotheses used to embody this principle (similarity metric, similarity cutoff, choice of the number k of nearest neighbors in k-NN, etc.) may impact the final quality of results in ways that are difficult to predict a priori. Therefore, the “correct” best way to perform GTM-based regression may depend on the actual problem under investigation, and plain 2D-level approaches based on the V=5 nearest neighbors in terms of s projections were seen to outperform (22) responsibility-level approaches which overestimated the influence of remote training set examples. By default, we recommend several of these approaches to be compared, in cross-validation and external prediction modes, in order to highlight the one best suited in a given 223

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

context. In particular, since it is difficult to make a reasonable guess of the best suited regression strategy, the choice of the latter has been considered as an additional degree of freedom in an evolutionary approach aimed to optimize map performance, in general (see §7). All possible choices allow for robust regression models to emerge, but the experience gained so far seems to hint that the nearest-node approach (with the initially stipulated V=5 – the number of nodes to use was not subject to further refinement) is the most likely winner, in published and yet unpublished studies alike.

Figure 4. Visualization of a GTM regression model; the response variable is the stability constant of LuIII-ligand complexes. Some examples of ligands are represented, forming stable complexes (in the hills of the activity landscape), or less stable complexes (in the plains).

6. GTM-Based Applicability Domain Definitions An applicability domain (14, 23) (AD) delimits the chemical space zone within which the extrapolation of knowledge from training compounds to other molecules is considered trustworthy. It depends on the instances used to train the model, as well as on the features used to model the space. Basically, there are two complementary landmarks defining applicability domains: •

Structure-dependent criteria: structural closeness (similarity) to training set molecules. This approach prompts the user to define some chemical space zone expanding the “habitat” of training set molecules by some typical similarity radius, by a “box” encompassing all training compounds, etc. 224

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011



Property-dependent criteria: internal coherence of the predicted values. This is applicable in situations when predicted properties stem from an averaging operation, which is the case of GTM predictions, computed by averaging over contributions over nearest neighbors/nodes. The mean value returned as final prediction is considered trustworthy only if individual contributions to it fall within a narrow range. Predicting a mid-range property value (or a marginal prevalence of one class) by averaging over a heterogeneous neighborhood of both low-activity and high-activity nodes should be met with skepticism, while predicting a moderate activity by averaging over roughly equal, moderate contributions makes perfect sense.

6.1. Structure-Dependent Applicability Domains The GTM methodology provides an elegant way to monitor the relevant chemical space “habitat” of training compounds and its relevant neighborhood, i.e., zones not directly populated by training compounds, but still featuring significant residual probability densities. This basically happens in two successive steps: 1.

2.

first, check whether the projected molecule is close enough to the manifold. This sine qua non condition requires explicit verification, since, as already mentioned, GTM will forcibly fold any point of the infinite chemical space into the square 2D grid of points. This is the likelihood filter: with GTM, the log likelihood of each molecule can be computed and used to determine whether a new molecule must be discarded by simply applying a cut-off value (22). A related approach has been used by Bullen et al. for scatterometer data to discard outliers (25). further checking will concern where exactly on the manifold the new molecule will be located: are the matching areas densely populated by training compounds? Thus, a straightforward structure-based AD criterion in GTM models would be a cumulative responsibility cutoff value, below which nodes would count as “uninhabited” by training compounds and be discarded as unreliable sources.

This might discard entire regions of the 2D map from the applicability domain. There are several ways to find whether a molecule should be considered as resident in such a discarded region: find the nearest node for each molecule and establish whether it was discarded or not, or draw a delimited area around each node (Voronoi region, confidence ellipse, etc.) and capture molecules within the area. This method is easily applicable to classification or regression models, and can be visualized on a 2D map easily in both cases: for classification models, by 225 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

removing nodes outside the AD or using a transparency-based representation; for regression models, by shading regions of the activity landscape outside the AD.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

6.2. Activity-Dependent Applicability Domains Three AD definitions based on prediction coherence criteria will be introduced in the context of classification models. The (easy) exercise of adapting these considerations to regression approaches is left to the reader. The class-dependent density and class prevalence factor (CPF) methods discard nodes from the AD; in other words, some regions of the map will be considered as a “no man’s land” because they are not clearly dominated by either of the classes. Eventually, compounds associated to the excluded nodes will be considered unpredictable (recall above discussion of the structure-dependent AD case). Alternatively, the class likelihood factor (CLF) directly defines whether a test compound is predictable. 1. Class-dependent density: the classification map is obtained by coloring each node of the map by the winning class cbest, which has the highest conditional node probability P(xk|cbest), matching the cumulated responsibilities from training set representatives of class cbest. GTM nodes will be considered inside the AD if they fulfill the condition:

Therefore, this AD definition is very similar to the density-based AD, except that it only takes into account the density of training compounds belonging to the winning class. The problem with this approach is that it does not take into account information from other classes, and does not highlight the safety margin by which the winning class prevailed. 2. Class prevalence factor: the class prevalence factor or CPF is a lower threshold value for the prevalence rate of the winning class probability P(xk|cbest) over the probability of the second-best class assignment option. By definition, this ratio always exceeds or equals 1; CPF = 1 means that the two first class assignments are clearly undistinguishable. To ensure that cbest is surely the “best”, and did not outperform the other classes ci due to some artifact, CPF > 1 should be required. However, the higher the threshold, the fewer the nodes managing not to be discarded as “no man’s land”.

This method was found to be at least as efficient as the bounding box method (12) and allows for a straightforward visual interpretation on the map. See, for example, Figure 5 where discarded nodes are simply “bleached” out from the classification map. CPF usually removes regions at the frontier between two classes, which is the main place where models "hesitate". 3. Class likelihood factor: the class likelihood factor method is based on the class entropy, which is the value defined for each compound and measuring the disorder of class probabilities P(ci|tq) attributed to each query compound tq: if 226 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

the probability of the winning class is far greater than the probability of any other class, the entropy will be low, since the probability is concentrated in one place; however, if the probabilities are almost the same for all classes, the disorder will be for a query compound tq is defined great and the entropy high. The entropy as follows:

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

Then, the condition for lying within the applicability domain is the following for the query compound tq:

where

is the highest class entropy, which occurs when every probability

P(ci|tq) is equal to , where Nc is the number of classes; i.e., when all conditional class probabilities are equal and no class wins. It should be noted that, in this case, we do not discard nodes from the applicability domain but establish that some queries cannot be correctly predicted. Therefore, this AD definition cannot be directly visualized as the CPF method.

Figure 5. BCRP (breast cancer resistance protein) (24) inhibition classification map, where inhibitors are more likely to be found in red regions and non-inhibitors in blue regions; nodes are represented by large circles and tested compounds by points. Nodes outside the CPF applicability domain (that is, nodes for which the likelihood of the winning class was not at least thrice as high as any other) were removed from the map on the right-hand side. A significant increase in balanced accuracy (BA) can be achieved at moderate loss of compound set coverage. 227 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

A thorough benchmarking of all the above-mentioned putative AD delimiter criteria has not yet been undertaken – therefore, at this point, we cannot make any detailed statement on the relative pertinence of these choices or any specific recommendation on the AD threshold values required by each criterion.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

7. GTM Parameter Tuning and the Quest for “Universal” Maps of Drug-Like Chemical Space The previous analysis of GTM classification and regression models highlighted the fundamental distinction between actual unsupervised map (manifold) construction, based on a frame set, and subsequent (supervised) learning or “coloring” of this map, based on a potentially different training set. Some options/parameters only concern the unsupervised manifold fitting step, and include the four GTM setup parameters: the grid size K¸ the number of RBFs M, the RBF width factor w and the weight regularization coefficient λ, in addition to the frame set choice, which can be formally regarded as an additional degree of freedom. Picking the actual learning/coloring strategy out of the various options outlined in §4 and §5 is a degree of freedom not affecting manifold construction. Eventually, one meta-parameter of paramount importance affects both manifold construction and learning process: the choice of the initial descriptor space, the primary conveyor of numerically encoded structural information. All these parameters have an impact on the quality of the final predictive model supported by the manifold. Actually, the model quality is the only objective criterion to validate the quality of the proposed manifold. In absence of such quantitative validation, chemical space mapping is trapped at the border between science and art. Without a cross-validated coefficient to rely upon, the “beauty” of a map is often the only criterion to decide whether the chosen grid size “fits” the problem, etc. Coupling visualization with prediction by the GTM methodology is a factor of paramount importance. Interestingly, the (unsupervised, perhaps even based on external frame compounds) manifold is not tailor-made to specifically serve as support of a dedicated model, for the given training set of the targeted property. If a manifold may serve as support for a model, all while ignoring any model-related information at buildup stage, an interesting question arises: Is there any manifold that may successfully serve as support not to one, but to many distinct and diverse structure-property models, classification and regression confounded? If yes, such a manifold able to simultaneously support many models of completely unrelated and biologically relevant properties may claim to be a “universal” map of (drug-like) chemical space (26). The objective quality criterion of such a map would be some mean of cross-validated predictive power scores, over the panel of various property models it supports. 7.1. Targeted Selection Panel The quest for a universal map implies, first, the definition of properties to enter the training panel, and associated selection sets. It is understood that, for 228 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

each property, a large (>100), coherent training set of compounds with reported property values must be available. However, training sets do not need to overlap; no complete matrix reporting all properties for all compounds is required. Since these training sets are used to select GTM models with optimal mean predictive power, they will be termed selection sets. For technical reasons, it is mandatory to decide upfront whether the selection properties shall all be continuous or categorical variables: the mean predictive power cannot be taken by mixing correlation coefficients from regression models and balanced accuracies from classification models. Note that there is no limitation to the scope to be covered by selection properties: we may formally search for the optimal GTM setup to model one property (classical single-end-point model), as well as build a domain-specific map (concerned with an ensemble of related properties: say, affinities for receptors of a same class), or go for a “universal” map, selected for its competence with respect to a maximum of completely unrelated properties.

7.2. The Parameter Space Next, the problem search space needs to be delimited, by defining: (a) a list of options for possible frame set choices. As already mentioned, these may totally, partially, or not at all coincide with selection set compounds. Frame sets may be chosen such as to cover various orders of magnitude in terms of size; this allows the user to obtain an idea of the minimal frame set supporting the build-up of useful maps. Algorithmically, the tool is obviously aware of the current frame set size, so it may automatically switch to incremental iGTM mode with larger frame sets. (b) a series of considered options for the molecular descriptors. This implies to provide, for both frame set and selection set compounds, the associated files encoding (after curation/standardization, etc.) each compound as corresponding descriptor vector. We recommend descriptor vectors to be normalized, by detecting, for each descriptor element, minimal and maximal values over frame set compounds, and using these for linear rescaling (minimum to 0.0, maximum to 1.0). Note that these reference minima and maxima for each descriptor elements are tied to frame set compounds, and should be applied as such to training set descriptors (and later to external prediction compounds), and not locally reevaluated for each set. Also, keep in mind that open-ended descriptor vectors, such as open-ended fragment detection (detection of a novel fragment appends an additional element to the dynamically increasing descriptor vector) must follow a common vector element numbering strategy and ignore any terms never encountered in frame set compounds. (c) specifications for the allowed ranges of the four manifold parameters (K, M, w, λ) and for the precision required to scan specified ranges. (d) last but not least, the method should be allowed to pick which one of the implemented (regression or classification, as defined by the upfront 229

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

choice of the selection problem) manifold coloring/prediction tool should be used.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

7.3. Map Tuning: The Evolutionary Approach The above choices can be synthetically represented as a “chromosome”, with loci dedicated to each mentioned degree of freedom. Some loci represent categorical variables, denominating the choice of frame set, descriptor type or prediction method; some are integers (size, RBF number), and others are real numbers. Evolutionary computing readily supports browsing such heterogeneous search spaces, which makes it a method of choice for the quest of optimally tuned GTM models. The chromosome (“genotype”) unambiguously encodes the “recipe” to build a GTM model (the associated “phenotype”). This phenotype is defined by the ability to “survive” in the competitive environment of a fixed-size chromosome population (under steady evolution through cross-over and mutation events involving current members), e.g., its “fitness” score. The nature of this fitness score has already been hinted at: some mean of cross-validated predictive power scores, over selection sets. This might be refined by introducing a penalty related to the spread (standard deviation) of individual scores per set: at equal mean predictive power, the map faring roughly equally well for each selection model is to be preferred to a map doing very well on few models but failing for others.

Figure 6. From chromosome to fitness score: encoding, building and assessing the quality of a multiproperty-competent GTM model. 230 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Figure 6 illustrates the detailed process of estimating the fitness score for a multiproperty-competent GTM model operating in regression mode, and employing repeated, randomized leave-1/3-out cross-validation for a robust assessment of individual quality criteria Q2 for each selection set.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

7.4. Towards a Universal Map Recently, the above strategy was applied to a selection panel of 144 ChEMBL (27) data sets of ligands associated to 144 biologically diverse targets, reporting measured pKi values. Frame set choices were constructed by alternatively regrouping subsets of these selection sets and external molecules (both reported bioactives and plain, commercial compounds from public databases). Descriptor choice covered 38 diverse ISIDA (28, 29) fragmentation schemes, covering all possible fragmentation topology and atom coloring (30) choices. The GTM-specific adaptation of a previously developed (31) libSVM tuning tool (a distributed, asynchronous genetic algorithm) was employed to search, on a cluster, for the “fittest” maps according to the paradigm illustrated in Figure 6. Five “top” map candidates, each based on a different molecular descriptor space, were subjected to in-depth external testing. A maximum of exploitable information, from ChEMBL and other sources (categorical active/inactive classification with respect to >400 human targets, antiparasitic/antiviral classification data, etc.) was used to challenge their ability to serve as support for novel predictive models. Although the modeling of these external challenges was completely unrelated to selection stage, in terms of concerned properties, of involved training molecules and type (classification instead of regression), these maps fared remarkably well. Cross-validated balanced accuracy scores above 0.7 were reached in roughly 80% of the almost 500 different external modeling challenges (results achieved without any Applicability Domain monitoring).

8. Predictive Power of GTM-Based Models: A Critical Assessment In the so far published methodological studies describing and benchmarking of GTM-driven classification and regression models, a same conclusion emerged: GTM-based predictive models are only slightly less successful compared to state-of-the-art non-linear machine learning models built using Random Forest (RF) or Support Vector Machines (SVM). This minor loss of predictive power appeared however as a perfectly acceptable price to pay in exchange for the visualization abilities, rendering GTM models intuitive and hence much more appealing than “black box” non-linear equations. It is largely believed that there must be some trade-off between interpretability and accuracy of models, and this is certainly true even within the family of various GTM-based methodologies: the 2D representation level is the “intuitive”, whilst the responsibility level is the “accurate” one. True, the above-mentioned primary studies did rely on limited GTM setup optimization; occasional tuning of some parameters was undertaken, while others, notably the grid size K, were kept to default values. Occasionally, the 231

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

baseline machine learning method used for comparison did also feature tunable parameters, which were not systematically subjected to optimization either. In order to quantitatively evaluate the cost, in terms of predictive power loss, of “intuitive” GTM models over “black box” SVM, single end-point regression models were optimized with both GTM regression and libSVM ε-regression engines, using method-specific adaptations of the above-described evolutionary procedure. A same pool of molecular descriptor choices was offered to both methods. The GTM frame set choice option was disabled: the training set compounds were forced to act as frame set. In this way, an equivalent tuning effort of method-specific control parameters can be guaranteed. The comparative model building concerned ChEMBL ligand sets associated to 47 different targets, with known pKi values. Comparison of GTM and SVM regression cross-validated Q2 values (Figure 7) established the mean accuracy loss of GTM models at 0.05, which is, in QSAR modeling, marginally relevant. The one notable exception for which GTM-driven regression failed to approach SVM performance was the data set associated to the 5HT-1a receptor.

Figure 7. Cross-validated coefficients of GTM vs. libSVM models produced under similar evolutionary optimization conditions, for 47 independent training sets ligands associated to different targets. The blue thick line represents the diagonal, while the thin black trend line, with a fixed slope of one, features as intercept (-0.049) the mean predictive power loss of GTM over SVM models.

9. GTM-Based Big Data Analysis Let us recall, at this point, the three main tools by means of which a compound library can be analyzed, after projection on a GTM. These are (Figure 8): class maps, property landscapes and density maps. 232 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

Figure 8. Each node xk in the latent space can be colored according to the most probable class (classification map), an average property (property landscape), or cumulated responsibilities (density map). Here, the property landscape and density map values where interpolated using kriging (32). Density maps can be directly generated for any compound library, by cumulating individual responsibility vectors (bubble plots in Figure 2), and then by using color intensity or a color spectrum as marker of cumulated responsibility. Class maps and property landscapes imply some previous learning to associate specific classes/property values to map coordinates. However, the learned property does not need to be an experimentally measured value, like in the case of GTM model training, but may be plain, calculable molecular properties: calculated mass, hydrophobicity, shape descriptors, chiral atom counts, etc. Hence, they can be generated for any arbitrary, commercial or even virtual compound collection. The key paradigm in GTM-driven library analysis is to compare any two libraries in terms of their characteristic projection patterns, which are K-dimensional vectors. Therefore, comparing two compound libraries amounts to the extremely fast comparing of two (or several, if several viewpoints based on different landscape coloring schemes are desired) K-dimensional vectors. This is clearly much easier and intuitive than pairwise comparison of members of one library to compounds from the other. The entropy and similarity of subsets can also be computed, providing a practical way to compare chemical libraries. The entropy S can be computed using the following formula:

where pk is the mean (normalized cumulated) responsibility value at node xk:

Libraries more uniformly distributed in the chemical space have higher entropy, and libraries restricted to a specific area have lower entropy (Figure 9). 233 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

Figure 9. Density map (log scale indicating the density of molecules) and entropy values of two chemical libraries, EMC and Focus Synthesis. Density plots as in Figure 9 provide a synthetic description of the chemical space covered by a chemical library. This is conceptually similar to other coverage monitoring methods, based for example on Kohonen maps (33), but more intuitive. Unlike in Kohonen maps, where chemical space coverage is basically given by the number of nodes containing molecules, the fuzzy, probabilistic interpretation of the responsibility vector is more readily associated to the idea of space coverage. Eventually, the K-dimensional vectors (cumulated responsibilities or any property-colored landscape vectors) characterizing the various compound libraries under study, may be used as a starting point for building a generative topographic map, as any other regular molecular descriptor vector. Such μGTM (9) or meta-GTM includes thus a second level of abstraction: points on the 2D μGTM map do not represent molecules, but whole libraries. This method is useful when multiple data subsets or libraries must be compared.

10. Applications in Chemogenomics: Mapping of Target-Specific Signatures Of course, GTMs are not specific tool for representing small organic molecules: macromolecules, too, might be mapped as bubble plots onto dedicated macromolecule-specific maps, after encoding by specific descriptors. However, biological target molecules may be also “projected” onto the same small-molecule GTMs used to represent their ligands, by using associated ligand sets as implicit target “descriptors”. As shown above, a compound collection may be rendered as a density plot (cumulated, or mean responsibility vector), just like any single molecule (individual responsibility vector). Consider, thus, the collection of all so-far known binders to a biological target, and its density pattern on a GTM. If this collection is exhaustive, then its density pattern represents an outline of the chemical space “area” populated by the given target inhibitors. As such, this may be interpreted as a descriptor of the specific chemical subspace that can be associated to the given target. In a certain sense, this highlighted chemical subspace is a functional image of the target per se, for it characterizes the binding “habits” of the target. These specific signatures (mean responsibility vectors) can be constructed for a single target, starting from the collection of its validated ligands, or a target family, at various hierarchical levels (from the collection 234

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

of ligands binding to at least one target family members). The GTM approach offers a unified perspective, going from ligands to targets and then to target families, as can be seen from the two hierarchical series of signatures shown in Figure 10 below. The specific chemical subspace associated to the FYN tyrosine kinase is part of the greater subspace associated to the tyrosine kinase family, itself a subspace of the kinase drug-space. For G-coupled protein receptors (GPCRs), four hierarchical levels were highlighted: dopamine D1, dopamine family, monoamine (rhodopsin-like) GPCRs, and all GPCRs. It can be seen that the dopamine family is already quite representative of the rather homogeneous monoamine GPCR binders, whilst the monoamine GPCR subspace is a limited subdomain of the entire cluster of GPCR ligand “galaxies”. Furthermore, since both plot series were realized on a common “universal” map manifold, kinase and GPCR spaces can be directly compared and are seen to correspond to rather distinct patterns, as expected.

Figure 10. Density plots of associated ligands represent target, or target-family specific signatures on a GTM. Supported by a “Universal” map, build according to the protocol outlined in §7, the signatures of the different targets (the underlying responsibility vectors) can be quantitatively compared. The concept of describing targets from the perspective of their ligands is not new (34–36). It has the merit to directly focus on target function, rather than sequence or shape considerations, which are determining function, but are not straightforwardly correlated with function. Estimating functional relatedness of targets is a critical task in chemogenomics, where it is required to infer whether structure-activity information transfer (37) from one target to another may be useful in multi-task learning. The idea is not flawless: if to date known ligands are not a representative sample of all possible ligands of the target, then its 235 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

ligand-driven description will be inevitably biased. Of course, it is of little utility for target deorphanization, with no known ligands at all. However, whenever ligand-driven target analysis is an option, then GTM is a tool of choice for this task. Beyond the very intuitive illustration of specific space coverage, the functional similarity of targets can be estimated quantitatively, based on the responsibility vectors of their ligand collections, which may be regarded as responsibility vectors “of the target”. There is no need to rely on common compounds that were tested on both targets one wishes to compare, nor is there a need to perform tedious ligand-to-ligand similarity scoring.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

11. Stargate GTM: Connecting Descriptor and Property Spaces Unlike its conventional analogue, Stargate GTM (S-GTM) (10) defines a common latent space acting as a “gate” between two distinct initial spaces (for example, a space of structural descriptors and the space of the experimental activity profile). Two distinct manifolds are built, one for each initial space, but both “anchored” to the common latent 2D grid, so that one grid node in the 2D latent space is associated with the centers of the corresponding normal probability distributions on both manifolds. Formally, a distinct probability for each of the two spaces (S = 1: an A-dimensional distribution descriptor space, S = 2: a B-dimensional property space) will be based upon its corresponding manifold YS, where the dimensionality of vectors, parameter matrices WS and inverse variance parameters βS are specific to each of the spaces, as though two independent GTMs were considered; each space is mapped onto latent space grids of a same dimension K, with the space-specific responsibility vectors:

So far, the two manifolds are independent, having nothing in common except for the coincidence of the size K of their latent space grids. The two manifolds are coupled by introducing the combined responsibility vector:

where w1 and w2 are weight factors attributed to each space, ranging from 0 to 1, and w2 = 1 − w1. The above combined products of probabilities will be used in log likelihood maximization. S-GTM “travels” between the two spaces. Let us denote the origin space by o and the destination space by d, as “jumps” in both directions are possible, according to the following protocol: •

From the starting point n in o

, its latent space projection

is calculated on the basis of the o manifold. 236 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.



The reverse mapping from functions:

is achieved by means of radial basis

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

where μm is the center of the mth RBF of the d manifold, and σ is a user-tunable parameter. These functions are used in conjunction with the parameter matrix Wd to map the 2-dimensional data point into destination space d:

The above procedure makes sense if the latent space distribution defined by is unimodal, so that falling back to the 2D level representation does not represent a penalizing oversimplification of reality. Otherwise, it will be necessary but the entire latent space area associated to high to reverse-map not only probabilities. If destination d is the property profile space, then “jumping” means predicting the property profile for compound n. Quite often, latent space representations of molecules are unimodal, and the simple “jumping” procedure can be applied directly. The elements of the destination vector are nothing but the various properties associated to each axis of property space d. This coupling of descriptors and activity spaces enables simultaneous access to both descriptor and property information at the stage of manifold fitting, and may hence be considered as a mechanism of supervised learning. However, note that this peculiar form of supervision is much less explicit than in classical regression model fitting: the herein optimized objective (log likelihood) is not the same as the criterion used to estimate predictive model quality, the property RMSE. It is not granted that maximal log likelihood will automatically translate into minimal RMSE; however, even a limited amount of supervision should, in principle, improve predictive performance. So far, benchmarks of this still novel approach shows that this is indeed the case. On the contrary, jumping from property space into descriptor space can be considered as the first step of “inverse QSAR” (38, 39). Such a jump would not bring back to chemical structures, but to descriptor values that may correspond to compounds likely to display the desired property profile. Unsurprisingly, the latent space projection of an activity profile point is often multimodal. Therefore, relevant matching latent space areas must be mapped onto corresponding descriptor space zones, harboring several structure candidates.

12. Conclusions Does the already impressive panoply of chemoinformatics software tools really need to adopt Generative Topographic Mapping? Actually, 237 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

every task executed by GTM could also be performed by already confirmed chemoinformatics tools. Chemical space visualization was already a well-established branch of chemoinformatics, including a large diversity of non-linear multidimensional scaling approaches alongside classical Principal Component Analysis. Library comparison techniques existed before, as well as SOM-based library analysis tools. This is not a surprise: GTMs, as fuzzy logic-based Kohonen maps, do inherit the application domains of SOMs. GTM-driven regression and classification techniques exhibit performances close to that of state-of-the-art methods such as Random Forests or SVM. They offer a wide range of monitoring tools of QSAR applicability domains, but are not indispensable to define these. GTMs may be used for polypharmacological activity profile predictions, but this also might be achieved with a battery of independent (or coupled, multi-task learning) QSAR models. What is special about GTMs is their largely federative character, their competence in all these different domains, only marginally below the best tools dedicated to a single task. In particular, the option to grow “universal” maps able to support predictive models for hundreds of completely unrelated chemical and biological properties on the basis of a single common mapping protocol is an excellent illustration of their extreme polyvalence. Being prediction tools and visualization tools at the same time, GTMs can fall back to objective prediction quality scores to validate the choice of parameters. By contrast, tuning of pure visualization approaches escapes any rigorous criteria; the choice of parameters producing the “nicest” plot is hardly objectively defendable. Therefore, GTMs are not only a gateway between different chemical spaces, but also a gateway between predictive modeling and data visualization, which are two rather distinct domains, as underlined by the popular adage in chemoinformatics, according to which a model is either accurate or interpretable, but never both. So far, GTM seems the best candidate to challenge this depressing statement.

References 1. 2. 3.

4. 5.

6.

Bishop, C. M.; Svensén, M.; Williams, C. K. I. GTM: The Generative Topographic Mapping. Neural Comput. 1998, 10, 215–234. Bishop, C. M.; Svensén, M.; Williams, C. K. I. Developments of the Generative Topographic Mapping. Neurocomputing 1998, 21, 203–224. Sander, T.; Freyss, J.; von Korff, M.; Rufener, C. DataWarrior: An OpenSource Program For Chemistry Aware Data Visualization And Analysis. J. Chem. Inf. Model. 2015, 55, 460–473. Agrafiotis, D. K. Stochastic Proximity Embedding. J. Comput. Chem. 2003, 24, 1215–1221. Agrafiotis, D. K.; Rassokhin, D. N.; Lobanov, V. S. Multidimensional Scaling and Visualization of Large Molecular Similarity Tables. J. Comput. Chem. 2001, 22, 488–500. Kohonen, T. Self-Organizing Maps; Springer Series in Information Sciences; Springer: Berlin, Heidelberg, 2001; Vol. 30. 238

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

7. 8. 9.

10.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

11.

12.

13.

14.

15. 16.

17. 18. 19.

20.

21.

Kohonen, T. Self-Organization and Associative Memory, 3rd ed.; SpringerVerlag, Inc.: New York, NY, U.S.A., 1989. Dunteman, G. H. Principal Components Analysis; SAGE: Newbury Park, CA, 1989. Gaspar, H. A.; Baskin, I. I.; Marcou, G.; Horvath, D.; Varnek, A. Chemical Data Visualization and Analysis with Incremental Generative Topographic Mapping: Big Data Challenge. J. Chem. Inf. Model. 2015, 55, 84–94. Gaspar, H. A.; Baskin, I. I.; Marcou, G.; Horvath, D.; Varnek, A. Stargate GTM: Bridging Descriptor and Activity Spaces. J. Chem. Inf. Model. 2015, 55, 2403–2410. Hoekman, D. Exploring QSAR Fundamentals and Applications in Chemistry and Biology, Volume 1. Hydrophobic, Electronic and Steric Constants, Volume 2. J. Am. Chem. Soc. 1995, 117, 9782; J. Am. Chem. Soc. 1996, 118, 10678−10678. Gaspar, H. A.; Marcou, G.; Horvath, D.; Arault, A.; Lozano, S.; Vayer, P.; Varnek, A. Generative Topographic Mapping-Based Classification Models and Their Applicability Domain: Application to the Biopharmaceutics Drug Disposition Classification System (BDDCS). J. Chem. Inf. Model. 2013, 53, 3318–3325. Kireeva, N.; Baskin, I. I.; Gaspar, H. A.; Horvath, D.; Marcou, G.; Varnek, A. Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison. Mol. Inform. 2012, 31, 301–312. Dragos, H.; Gilles, M.; Alexandre, V. Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models. J. Chem. Inf. Model. 2009, 49, 1762–1776. Huang, N.; Shoichet, B. K.; Irwin, J. J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49, 6789–6801. Sushko, I.; Novotarskyi, S.; Körner, R.; Pandey, A. K.; Cherkasov, A.; Li, J.; Gramatica, P.; Hansen, K.; Schroeter, T.; Müller, K.-R.; et al. Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set. J. Chem. Inf. Model. 2010, 50, 2094–2111. Smola, A. J.; Schölkopf, B. A Tutorial on Support Vector Regression. Stat. Comput. 2004, 14, 199–222. Agostinelli, C. Robust Stepwise Regression. J. Appl. Stat. 2002, 29, 825–840. Whitley, D. C.; Ford, M. G.; Livingstone, D. J. Unsupervised Forward Selection: A Method for Eliminating Redundant Variables. J. Chem. Inf. Comput. Sci. 2000, 40, 1160–1168. Zheng, W.; Tropsha, A. Novel Variable Selection Quantitative Structure−Property Relationship Approach Based on the K-Nearest-Neighbor Principle. J. Chem. Inf. Comput. Sci. 2000, 40, 185–194. Cedeño, W.; Agrafiotis, D. K. Using Particle Swarms for the Development of QSAR Models Based on K-Nearest Neighbor and Kernel Regression. J. Comput.-Aided Mol. Des. 2003, 17, 255–263. 239

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

22. Gaspar, H. A.; Baskin, I. I.; Marcou, G.; Horvath, D.; Varnek, A. GTM-Based QSAR Models and Their Applicability Domains. Mol. Inform. 2015, 34, 348–356. 23. Tetko, I. V.; Bruneau, P.; Mewes, H.-W.; Rohrer, D. C.; Poda, G. I. Can We Estimate the Accuracy of ADME-Tox Predictions? Drug Discovery Today 2006, 11, 700–707. 24. Sedykh, A.; Fourches, D.; Duan, J.; Hucke, O.; Garneau, M.; Zhu, H.; Bonneau, P.; Tropsha, A. Human Intestinal Transporter Database: QSAR Modeling and Virtual Profiling of Drug Uptake, Efflux and Interactions. Pharm. Res. 2012, 30, 996–1007. 25. Bullen, R. J.; Cornford, D.; Nabney, I. T. Outlier Detection in Scatterometer Data: Neural Network Approaches. Neural Netw. 2003, 16, 419–426. 26. Sidorov, P.; Gaspar, H.; Marcou, G.; Varnek, A.; Horvath, D. Mappability of Drug-like Space: Towards a Polypharmacologically Competent Map of Drug-Relevant Compounds. J. Comput.-Aided Mol. Des. 2015, 29, 1087–1108. 27. Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2011, 40, D1100–D1107. 28. Varnek, A.; Fourches, D.; Horvath, D.; Klimchuk, O.; Gaudin, C.; Vayer, P.; Solov’ev, V.; Hoonakker, F.; Tetko, I.; Marcou, G. ISIDA - Platform for Virtual Screening Based on Fragment and Pharmacophoric Descriptors. Curr. Comput.-Aided Drug Des. 2008, 4, 191–198. 29. Varnek, A.; Fourches, D.; Solov’ev, V.; Klimchuk, O.; Ouadi, A.; Billard, I. Successful “In Silico” Design of New Efficient Uranyl Binders. Solvent Extr. Ion Exch. 2007, 25, 433–462. 30. Ruggiu, F.; Marcou, G.; Varnek, A.; Horvath, D. ISIDA Property-Labelled Fragment Descriptors. Mol. Inform. 2010, 29, 855–868. 31. Horvath, D.; Brown, J. B.; Marcou, G.; Varnek, A. An Evolutionary Optimizer of Libsvm Models. Challenges 2014, 5, 450–472. 32. Oliver, M. A.; Webster, R. A Tutorial Guide to Geostatistics: Computing and Modelling Variograms and Kriging. CATENA 2014, 113, 56–69. 33. Horvath, D.; Lisurek, M.; Rupp, B.; Kühne, R.; Specker, E.; von Kries, J.; Rognan, D.; Andersson, C. D.; Almqvist, F.; Elofsson, M.; et al. Design of a General-Purpose European Compound Screening Library for EU-OPENSCREEN. ChemMedChem 2014, 9, 2309–2326. 34. Bieler, M.; Heilker, R.; Köppen, H.; Schneider, G. Assay Related Target Similarity (ARTS) - Chemogenomics Approach for Quantitative Comparison of Biological Targets. J. Chem. Inf. Model. 2011, 51, 1897–1905. 35. Lin, H.; Sassano, M. F.; Roth, B. L.; Shoichet, B. K. A Pharmacological Organization of G Protein-Coupled Receptors. Nat. Methods 2013, 10, 140–146. 36. Keiser, M. J.; Roth, B. L.; Armbruster, B. N.; Ernsberger, P.; Irwin, J. J.; Shoichet, B. K. Relating Protein Pharmacology by Ligand Chemistry. Nat. Biotechnol. 2007, 25, 197–206. 240

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 27, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch011

37. Brown, J. B.; Okuno, Y.; Marcou, G.; Varnek, A.; Horvath, D. Computational Chemogenomics: Is It More than Inductive Transfer? J. Comput.-Aided Mol. Des. 2014, 28, 597–618. 38. Wong, W. W.; Burkowski, F. J. A Constructive Approach for Discovering New Drug Leads: Using a Kernel Methodology for the Inverse-QSAR Problem. J. Cheminformatics 2009, 1, 1–27. 39. Skvortsova, M. I.; Baskin, I. I.; Slovokhotova, O. L.; Palyulin, V. A.; Zefirov, N. S. Inverse Problem in QSAR/QSPR Studies for the Case of Topological Indexes Characterizing Molecular Shape (Kier Indices). J. Chem. Inf. Comput. Sci. 1993, 33, 630–634.

241 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.