Visualization of a Multidimensional Descriptor Space - American

tesselation is performed, followed by calculation of the number of normalized eigenvalues of the .... other methods some R and Python packages were us...
0 downloads 0 Views 2MB Size
Chapter 12

Visualization of a Multidimensional Descriptor Space Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

Héléna A. Gaspar,1 Igor I. Baskin,2,3 and Alexandre Varnek1,* 1Laboratoire

de Chemoinformatique, UMR 7140, Université de Strasbourg, 1 rue Blaise Pascal, Strasbourg 67000, France 2Faculty of Physics, M.V. Lomonosov Moscow State University, Leninskie Gory, Moscow 119991, Russia 3Laboratory of Chemoinformatics, Butlerov Institute of Chemistry, Kazan Federal University, Kazan 420008, Russia *E-mail: [email protected]

In this chapter, we review some concepts and techniques used to visualize chemical compounds represented as objects in a multidimensional descriptor space. Several modern dimensionality reduction techniques are compared with respect to their ability to visualize the data in 2D space, using as example a dataset of acetylcholinesterase inhibitors and their decoys.

1. Chemical Space as a Vector Descriptor Space The main representations of molecular structures in chemoinformatics include graphs and molecular descriptors. The latter are structural or physico-chemical properties either measured experimentally or directly deduced from 2D or 3D structures. Thus, a molecule is represented by a vector in a D-dimensional vector space, where each dimension corresponds to a specific descriptor. Alternatively, distances (or dissimilarities) between objects can be used, e.g., kernels. The resulting feature space is also a vector space in which molecules occupy an N-dimensional subspace, where N is the number of molecules and, hence, the number of distances between one molecule and all others (including itself).

© 2016 American Chemical Society Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

1.1. Apparent and Intrinsic Dimensionality

The dimensionality of descriptor space is formally defined as the number D of descriptors. The question then arises: Is this dimensionality the “real” one for a given dataset? If some descriptors convey redundant information, this dimensionality is only apparent, and the space could be characterized by its intrinsic dimensionality IDim, which corresponds to the minimal number of features needed to describe the data set. The intrinsic dimensionality can be assessed using diverse methods (1), divided into local and global approaches. For instance, in the global Principal Component Analysis (PCA) approach, IDim is equal to the number of non-null eigenvalues of the entire data covariance matrix; in the local PCA approach or Fukunaga-Olsen’s algorithm (2), a Voronoi tesselation is performed, followed by calculation of the number of normalized eigenvalues of the covariance matrix in each Voronoi region. Some other methods to assess IDim are described in a review by Camastra et al. (1). Knowing the intrinsic dimensionality might be useful in order to assess the risk of information loss in dimensionality reduction. Indeed, some dimensionality reduction methods are based on the assumption that the data lay in the proximity of a linear or non-linear manifold of low dimensionality (D′ = 2 or 3). The information loss can be significant if IDim >> D′. Although the intrinsic dimensionality is an important concept that should be borne in mind, we will mainly refer to the apparent dimensionality D.

1.1.1. The Curse of Dimensionality The “Curse of Dimensionality” concept introduced by Bellman (3) encompasses the problems of data sampling in a high-dimensional space. Indeed, if n points are necessary to sample a one-dimensional space, then nD points will be necessary to sample a D-dimensional space. For large D, it is rarely possible to gather a sufficient number of data points to train a model, which limits its predictive ability. According to Hughes’s observations (4), the predictive performance of a model trained on a small data set decreases with the increase of dimensionality beyond its optimal value; on the other hand, the accuracy increases with the number of data points. A possible representation of a high-dimensional space could be a hypercube. In three-dimensional space, most of data points, if distributed uniformly, lie near the center of the cube (5); however, seemingly opposed to common sense, data points in a hypercube in high-dimensional space are mainly located in its corners. Why? – This could be explained by plotting the ratio of the volume of the unit hypersphere with radius r = 1 to the volume of a unit hypercube with side length 2r, as shown in Figure 1. In two-dimensional space, 78% of the “hypercube” volume is within the hypersphere; 52% in three-dimensional space, and only 4% for D = 7, which means than 96% of the volume is occupied by the corners of the hypercube. The more the dimensionality increases, the more data points will be found in the corners. This in turn may become a serious problem for classification 244

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

models, because the distance between corners is quite large, which makes difficult the formation of compact clusters.

Figure 1. Ratio of the volume of the unit hypersphere to the unit hypercube as a function of dimensionality.

The neighborhood behavior may also be affected by the Curse of Dimensionality, which raises a serious problem for the performance of k-NN and other distance-based methods. The ratio of the difference between the maximal (distmax) and minimal (distmax) distances to the minimal distance from any reference point to a set of random points inside the hypercube tends to zero with the increase of dimensionality

This means that the concept of neighborhood behavior and similarity may become irrelevant in spaces with a sufficiently high dimensionality. There are two common solutions to fight the Curse of Dimensionality: feature selection and dimensionality reduction. The feature selection procedure (6, 7) aims to decrease the noise caused by irrelevant and redundant descriptors, to yield a better model generalization and to reduce overfitting. For example, the feature selection step is essential for Genome-Wide Association Studies (GWAS) and applied to select the most important single nucleotide polymorphisms associated with specific traits (8–10). Feature selection can also be used as a preprocessing step for dimensionality reduction methods. This procedure can be supervised (e.g., selection of descriptors which correlate the most to a class or activity to be predicted), or unsupervised (e.g., removal of descriptors highly correlated with other descriptors). Dimensionality reduction methods, on the other hand, do not merely remove irrelevant descriptors; they use the information contained in all descriptors and find a smaller number of new, transformed features. If the number of these new features is limited to 2 or 3, the data can be visualized in 2D or 3D spaces. Some modern dimensionality reduction techniques are considered below. 245

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

2. Dimensionality Reduction Approaches There are two main types of dimensionality reduction techniques: linear and non-linear. Linear methods assume that the data points are located near a lowdimensional linear manifold, whereas non-linear methods are used to investigate more complex data structures related to data near a low-dimensional non-linear manifold. We will discuss the main linear and non-linear methods with the help of one dataset consisting of 100 inhibitors (“red” class) and 100 decoys (“blue” class) of the enzyme acetylcholinesterase (AChE), randomly selected from the Directory of Useful Decoys (DUD) database (11). The data are presented in the space of 140 ISIDA fragment descriptors: atoms and bonds, length 2-8, with at least 10% defined values for each descriptor (12). Generative Topographic Mapping (GTM) calculations were performed with the in-house ISIDA/GTM program, whereas for other methods some R and Python packages were used. 2.1. Linear Dimensionality Reduction The goal of dimensionality reduction methods is to project N data points of dimensionality D (data matrix T) into a space of lower dimensionality L (data matrix X):

Linear dimensionality reduction achieves this goal by applying a linear transformation U to the N data vectors in D-dimensional space:

where is optimized by an objective function; for example, tbe reconstruction error in PCA (see below). In this section, we will briefly explore several linear dimensionality reduction methods: PCA, Canonical Correlation Analysis (CCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), Linear MultiDimensional Scaling (MDS), and Exploratory Factor Analysis (EFA).

2.1.1. Principal Component Analysis PCA (13) is one of the most famous linear dimensionality reduction methods. , where In PCA, the linear transformation is performed by the matrix columns correspond to the L covariance matrix eigenvectors with L highest eigenvalues. The eigenvectors identify the directions of new variables (principal components) and eigenvalues the data variance. The goal is to maximize the variance of the projected data, or to minimize the reconstruction error:

T is the original data whereas XUT is the data reconstructed with L eigenvectors. A PCA can be performed by centering the data followed by applying a singular value 246 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

decomposition to calculate the eigenvectors and eigenvalues of the data covariance matrix. There is no supplementary parameter to tune.

2.1.2. Multidimensional Scaling

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

The goal of metric MDS (14, 15) is obtaining a projection that preserves as much as possible the distances between the data points in initial space. In order to achieve this goal, the algorithm minimizes a “stress” function calculated as a squared error between distances in initial space {d(i,j)} and in lower-dimensional space {d′(i,j)} measured between the ith and jth objects

Metric MDS preserves initial space distances, whereas non-metric MDS only preserves the rank order of distances. MDS with Euclidean distances leads to results similar to PCA (or KPCA with a linear kernel). On the other hand, MDS can use any distance measure, as demonstrated in Figure 2. There is no supplementary parameter to tune for MDS; however, results may vary depending on the dissimilarity type used to construct the distance matrix. The basic principle of MDS is one of the corner stones of dimensionality reduction, and has been used as a foundation for several dimensionality reduction methods, such as Sammon mapping or Stochastic Proximity Embedding (SPE) (16).

Figure 2. MDS plots for the AChE dataset corresponding to different dissimilarity measures: (a) Euclidean, (b) Manhattan, (c) Jaccard. Inhibitors and decoys of acetylcholinesterase are represented by red and blue points, respectively. The initial dimensionality of the ISIDA descriptor space is D = 140.

2.1.3. Canonical Correlation Analysis The goal of Canonical Correlation Analysis (CCA) (17) is to maximize the correlation between two linear combinations of variables. Let our data be described in two spaces so that we have data matrices TA in Space 1 and TB in Space 2. Then, two transformations will be performed: 247 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

where UA and TB are optimized by maximizing the correlation between the two projections:

XA and XB are called the canonical variates. This algorithm is useful when the objects are described in two different spaces, e.g., molecules described in two descriptor spaces or molecules described in an experimental activity space and a descriptor space. The visualization of individual objects can be made by plotting the coordinates of the L first canonical variates; for scatterplots, if L = 2, axes will and (or and ). This method does not require to tune any be defined by other parameter. We provide an example of Canonical Correlation Analysis (CCA) dimensionality reduction for the acetylcholinesterase (AChE) data set using two spaces (Figure 3, generated with scikit-learn (18)): the space of ISIDA descriptors (atoms and bonds, length 2-8) and the space of 186 MOE 2D descriptors.

Figure 3. CCA map for the AChE dataset optimized using both ISIDA and MOE descriptor spaces and visualized using and .

2.1.4. Linear Discriminant Analysis Linear Discriminant Analysis (LDA) (19, 20) can be used for classification tasks, with labeled data. LDA maximizes the interclass variance/intraclass variance ratio, and projects the data into an L-dimensional space with L = NC − 1, where NC is the number of classes. And example of LDA for a dataset with two classes (inhibitors and decoys of AChE) is given in Figure 4 (generated with scikit-learn (18)); the model can be represented as a bar plot since for two classes L = 2 − 1 = 1. LDA can also be used as a supervised linear classifier, and has no supplementary parameter to tune. 248 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

Figure 4. One-dimensional LDA map for the AChE dataset described by ISIDA fragment descriptors (red for inhibitors, blue for decoys).

2.1.5. Independent Component Analysis Independent Component Analysis (ICA) (21) is mainly used in blind signal separation but can also be employed for dimensionality reduction. A typical case of blind signal separation is the “cocktail party problem”, where individual voices engaged in simultaneous conversations have to be differentiated, i.e., the source signals have to be retrieved. ICA is dependent on a random initialization but otherwise has no parameter to tune. An example of visualization is given in Figure 5 (made with scikit-learn (18)).

Figure 5. ICA map for the AChE data set described by ISIDA fragment descriptors (red for inhibitors, blue for decoys), where the latent variables are estimated by maximizing the independence of non-Gaussian signal components.

2.1.6. Exploratory Factor Analysis Exploratory Factor Analysis (EFA) is used to build linear generative latent variable models, by means of a linear mapping with Gaussian noise (22), so that, for a data matrix T:

where X are the latent variables or factors, U is the factor loadings matrix, E Gaussian noise and M a bias matrix. Each factor loading’s square is the percent of variance of a variable explained by a factor; the factor loadings can be fitted by maximum likelihood. EFA usually provides results close to PCA (cf. Figure 249 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

6 for an example), but its philosophy is quite different: in PCA, the initial data is represented by a linear combination of factors (descriptive model), whereas in FA, the factors are represented as a linear combination of initial variables (generative model). In PCA, all data variance is analyzed, whereas in EFA only common variance is accounted for. EFA is to be distinguished from CFA (confirmatory factor analysis), which aims to confirm a hypothesized structure instead of finding the latent structure. Another family of dimensionality reduction methods related to factor analysis is Non-Negative Matrix Factorization (NNMF) (23) for nonnegative input matrices, which uses non-negativity constraints.

Figure 6. EFA map for the AChE data set described by ISIDA fragment descriptors (red for inhibitors, blue for decoys). 2.2. Non-Linear Dimensionality Dimension In this section, we will review some non-linear dimensionality reduction techniques listed in Table 1, including kernel principal component analysis (KPCA), Sammon mapping, Isomap, Locally Linear Embedding (LLE), Laplacian Eigenmaps, autoencoders, Self-Organizing Maps (SOMs) and GTM. Sammon mapping is an MDS variant; Isomap, LLE and Laplacian Eigenmaps are based on the construction of a k-NN graph and are closely related to KPCA; t-SNE is a distance-preserving, probabilistic method; Autoencoders, SOM and GTM are based on neural networks. For all these dimensionality reduction techniques, the underlying assumption is that the data lies on a low-dimensional non-linear manifold embedded in the D-dimensional descriptor space. Most of them rely on preserving the neighborhood of points, and aim at unrolling the manifold to obtain a lower-dimensional visualization.

2.2.1. Kernel Principal Component Analysis Kernel Principal Component Analysis (KPCA) (24) is very similar to PCA; however, instead of working with a data matrix T of N nstances and D dimensions, KPCA operations are performed on an N × N kernel K which should be previously centered. The kernel matrix can be considered as a similarity matrix which is positive semi-definite and symmetric. The kernel approach allows us to work in an implicit feature space via the kernel trick: data points t are mapped into a feature space of higher dimensionality with a function φ(t). This function, however, is 250 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

never computed; instead, the inner product between objects in feature space is reduced to a kernel function in input space:

K characterizes objects in the possibly infinite-dimensional feature space. The reason for using the kernel trick is that it is easier to separate objects linearly in a space of higher dimensionality. Therefore, this approach is particularly interesting for non-linear clustering (e.g., with k-means). Examples of KPCA maps obtained with the R package kernlab (25) are given in Figure 7, for the AChE data set described by ISIDA fragment descriptors; RBF and polynomial kernels were used, and the γ parameter regulating the width of the RBF kernel was set to 0.2. It should be noted that a PCA model can been seen as a KPCA model with a linear kernel (24) computed from centered data. Depending on the kernel, different parameters should be tuned. For instance, for the RBF kernel, it is the γ parameter:

For the polynomial kernel, the user should tune the degree d, a constant c and potentially the slope a:

Table 1. Some non-linear dimensionality reduction techniques. Name

Key Concept

KPCA

Performs an eigendecomposition of a kernel

Sammon

Preserves initial distances, favors small distances

Isomap

Preserves geodesic distances

LLE

Performs linear local approximations

Laplacian Eigenmaps

Performs an eigendecomposition of the graph Laplacian

t-SNE

Minimizes divergence between initial and latent distributions

Autoencoder

Learns to reconstruct the data

SOM

Topology-preserving neural network

GTM

Fits iteratively a manifold to the data, probabilistic SOM

251 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

Figure 7. (a) Conventional PCA, compared to kernel PCA maps with (b) an RBF kernel and (c) a polynomial kernel for the AChE data set described by ISIDA fragment descriptors (red for inhibitors, blue for decoys).

2.2.2. Sammon Mapping Sammon mapping is a non-linear variant of MDS; it is the same algorithm, except that the stress function to be minimized is normalized by the initial space distances (26):

where {d(i,j)} are the distances in initial space and {d′(i,j)} the distances in the reduced space measured between the ith and jth objects. By dividing by the initial space distances, the optimization favors small distances (larger stress function) over large distances; as the non-linearity in data is better approximated by smaller distances, the method integrates some non-linear information in this way. Three examples using different dissimilarity measures are used in Figure 8, made with the R package MASS (27). As MDS, Sammon mapping does not require any parameter optimization, but results may vary depending on the chosen dissimilarity type.

Figure 8. Sammon maps with different dissimilarities: (a) Euclidean, (b) Manhattan, (c) Jaccard for the AChE data set described by ISIDA fragment descriptors (red for inhibitors, blue for decoys). 252 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

2.2.3. Isomap Isomap (28, 29) is an MDS-derived method, which uses geodesic distances from neighborhood graphs, and learns the global geometry of a dataset by using local metric information. Isomap can be seen as a special case of KPCA with geodesic distances as kernel. The Isomap algorithm consists in three steps: 1.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

2.

3.

Draw a graph where each data point is connected to its nearest neighbors (k-NN graph); the weight of edges between neighboring vertices (objects) i and j is equal to the initial space distance d(i,j). Compute shortest path distances (matrix G) between all points or “geodesic” distances minimizing the sum of weights of edges (d(i,j)) between the graph’s vertices, with, e.g., Dijkstra’s algorithm for undirected graphs (30). Perform a classical MDS preserving the geodesic distances G.

The Isomap’s calculations result in L highest eigenvalues of the geodesic distance matrix G. The Isomap model depends on the number of neighbors k chosen to build the k-NN graph. Three examples using different dissimilarity measures are used in Figure 9, with 30 neighbors (made with the vegan R package (31)).

Figure 9. Isomaps with 30 neighbors and different dissimilarity measures: (a) Euclidean, (b) Manhattan, (c) Jaccard for the AChE dataset described by ISIDA fragment descriptors (red for inhibitors, blue for decoys).

2.2.4. Locally Linear Embedding Locally Linear Embedding (LLE) (32) computes an embedding preserving the neighborhood of data points. LLE assumes that data points in the initial space and their neighbors are close to a locally linear patch of manifold. The dimensionality reduction is performed in 3 steps: 1.

The k nearest neighbors {tk} of each point ti are computed (k-NN graph). 253

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

2.

The weights {Vik} the best describing the data points {ti} as a combination of their neighbors {tk} are found by minimizing the following cost function:

3.

Projection into the L-dimensional space is performed by finding the low-dimensional coordinates of data points {xi} (with neighbors {xk}) minimizing the following cost function:

Therefore, the data points can be reconstructed from a combination of their neighbors using the same weights in the initial or low-dimensional space. As Isomap, LLE can be seen as a variant of KPCA. The tuned parameter is the number of nearest neighbors k used to construct the k-NN graph. An example of the method with k = 30 calculated with scikit-learn (18) is given in Figure 10.

Figure 10. LLE map for the AChE data set, described by ISIDA fragment descriptors (red for inhibitors, blue for decoys).

2.2.5. Laplacian Eigenmaps Similarly to LLE and Isomap, the Laplacian Eigenmaps technique (33) is based on a neighborhood graph (k-NN graph). It builds a weighted adjacency matrix with a heat kernel or binary weights, followed by computation of the embedding from the normalized Laplacian. The algorithm includes 3 steps: 1. 2.

The k nearest neighbors {tk} of each point ti are computed (k-NN graph). The weights V are computed, using a heat kernel:

where γ is a tunable parameter. Weights {Vij} can also be set as equal to 1 if ti and tj are connected in the k-NN graph and equal to 0 otherwise. 254 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

3.

The projection of points into the lower-dimensional space is performed by the Eigen decomposition of the graph Laplacian:

where Dii = ∑iVij is the degree of vertex i, L is the Laplacian matrix L = D − V, l the eigenvalues and U the eigenvectors used for the embedding in the space of lower dimensionality.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

Laplacian Eigenmaps, as LLE and Isomap, could therefore be interpreted as a variant of KPCA, with a tunable parameter k. If the heat kernel is chosen the parameter γ should also be tuned. An example of dimensionality reduction with Laplacian Eigenmaps as implemented in scikit-learn (18) is given in Figure 11.

Figure 11. Laplacian Eigenmaps for the AChE data set described by ISIDA fragment descriptors (red for inhibitors, blue for decoys).

2.2.6. t-Distributed Stochastic Neighbor Embedding t-Distributed Stochastic Neighbor Embedding (t-SNE) (34), a variant of Stochastic Neighbor Embedding (SNE), is an efficient method for data sets with more than one embedded manifold. It focuses on local data structures, and is a good choice for separating clusters. As in any MDS-related method, the goal of SNE and t-SNE is to preserve dissimilarities in the original data. Conventional SNE measures Gaussian joint probabilities in the original space so that the probability pij is proportional to the similarity between points ti and tj. Gaussian “induced” probabilities qij are also measured for each pair of points in low-dimensional space xi and xj. In other words, the similarities between points in the initial space and in the latent space are translated into probabilities. The position of points in latent space is updated by minimizing the Kullback-Leibler divergence (35) between joint probabilities in the input and latent space using gradient descent; the KLB divergence can be related in this case to the MDS error function. In t-SNE, the probabilities qij between each pair of points in low-dimensional space are computed using a Student t-distribution instead of a Gaussian, so that points would not be gathered at the same place (crowding effect): the Student t-distribution has heavy tails that allow to represent moderate distances in the initial space by larger distances on the 2-dimensional map. t-SNE also uses 255

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

a symmetrized version of the cost function. SNE and t-SNE possess several common parameters: the perplexity of Gaussian distributions in the initial high-dimensional space, the learning rate, and eventually the early exaggeration, a multiplication factor for probabilities in the initial space at early stages of the optimization process. The perplexity can be related to a measure of the number of neighbors (cf. Isomap); the learning rate used for gradient descent has a great impact on the shape of the map, and the early exaggeration improves the separation of clusters. We show a t-SNE example in Figure 12, computed with the scikit-learn (18) implementation, with learning rate = 100, perplexity = 30, and early exaggeration = 6.

Figure 12. t-SNE representation of the AChE data set described by ISIDA fragment descriptors (red for inhibitors, blue for decoys).

2.2.7. Autoencoders Autoencoders (36) are symmetric multilayer neural networks with a small central layer trained to encode and reconstruct the initial dataset (Figure 13). In other words, the autoencoder learns the identity matrix. With the small central layer, the net automatically finds a small number of hidden features in the dataset. The autoencoder uses backpropagation to adjust its weights and to optimize its cost function so that its input (initial data) becomes as similar as possible to its output.

Figure 13. Autoencoders are symmetric multilayer neural networks that learn to reconstruct the data; in this diagram, 2 central hidden units are used to reconstruct the 3 initial variables. 256 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

The autoencoder is an attractive method, very simple to understand, but with several parameters to tune. In Figure 14, we show an example of dimensionality reduction performed by the simplest type of network (implemented in the autoencoder R package (37)) with only 3 layers (input, hidden, output), and 2 hidden units corresponding to the number of “latent” dimensions.

Figure 14. Autoencoder dimensionality reduction for the AChE data set described by ISIDA fragment descriptors (red for inhibitors, blue for decoys).

This network, with 3 layers and 2 hidden units and logistic activation functions depends on four other parameters: the random weight initialization (here random values drawn from (μ = 0, σ2 = 0.001)), a weight decay parameter or regularization coefficient λ (here set to 2 × 10−4), and, since we used a sparse autoencoder, two other parameters for the sparsity: the weight of the sparsity penalty term in the autoencoder objective (set to 6), and the sparsity parameter (set to 0.5).

2.2.8. Self-Organizing Maps Kohonen maps or Self-Organizing Maps (SOMs) (38) are neural networks, inspired by the structure of biological neurons. The particularity of SOM in comparison with other neural networks is the conservation of the initial space topology; neighboring instances in the initial space are supposed to be neighbors on the 2D map. The algorithm begins by arranging artificial “nodes” on the 2D map, usually on a regular grid; to each node is associated a weight vector of dimensionality D, the dimensionality of the initial space. Distances can then be computed between the nodes’ weight vectors and the instances in initial space; the node closest to each data instance is retrieved, and its weights as well as those of its neighbors are updated to move closer to the data instance. The final position of a data point on the 2D map will be at the same position as its best matching node. As represented on Figure 15 showing a SOM map computed with the kohonen (39) package, the representation is usually that of a regular grid; different ways 257

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

were designed to better visualize the proportion of data points mapped at each node; here, we only represent higher populations by a higher color intensity and color each node by the major class. SOM has 4 main parameters: the learning rate, the neighborhood size, the strategy used to modify the learning rate and neighborhood size as the training progresses, and the resolution (the number of nodes, set to 10 × 10 in Figure 15). The SOM can be randomly initialized (this is the case in Figure 15) or the weights can be chosen from the principal components (PCA initialization).

Figure 15. SOM example representing the density of inhibitors (red) and decoys (blue) of AChE by the node size.

2.2.9. Generative Topographic Mapping GTM (40, 41) was introduced by Bishop et al. in the 1990s. It provides a probabilistic framework for Kohonen maps and guarantees convergence, which can be checked with an objective function. GTM is a manifold-based non-linear dimensionality reduction method, for which there is a topological ordering: points on the manifold in the low-dimensional space will also be close to their mappings in high-dimensional space. The method is “generative”, i.e., it is assumed that the D dimensions are generated by a smaller number of latent or hidden variables. GTM has 4 user-defined parameters: the number of grid nodes (map resolution), the number of Radial Basis Function (RBF) centers, the regularization coefficient and a width factor for the RBFs. The probabilistic framework of GTM can be used for regression and classification models and to create different types of data visualization. There are many variants of GTM. For small datasets, the conventional algorithm is sufficient; however, for large amounts of data, the incremental version (iGTM), also described by Bishop (40) is a valuable solution. For data in similarity space, the kernel algorithm was introduced by Olier et al. (42); an LTM algorithm was introduced by Kaban et al. to deal with binary or count data (43), and was applied by Owen et al. to the visualization of molecular fingerprints (44). 258

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

Table 2. Performance (Balanced Accuracy, BA) of non-linear dimensionality reduction techniques separating actives and decoys from the AChE data set in 2D latent space. Method

Figure

BA

MDS (Euclidean), PCA

Figure 2a

0.79

MDS (Manhattan)

Figure 2b

0.83

MDS (Jaccard)

Figure 2c

0.77

CCA

Figure 3

0.87

ICA

Figure 5

0.87

EFA

Figure 6

0.79

KPCA (RBF)

Figure 7b

0.90

KPCA (Polynomial)

Figure 7c

0.84

Sammon (Euclidean)

Figure 8a

0.76

Sammon (Manhattan)

Figure 8b

0.66

Sammon (Jaccard)

Figure 8c

0.60

Isomap (Euclidean)

Figure 9a

0.79

Isomap (Manhattan)

Figure 9b

0.78

Isomap (Jaccard)

Figure 9c

0.76

Locally Linear Embedding

Figure 10

0.79

Laplacian Eigenmaps

Figure 11

0.83

t-SNE

Figure 12

0.91

Autoencoder

Figure 14

0.80

SOM

Figure 15

0.92

GTM

Figure 16a

0.90

KGTM (RBF)

Figure 16b

0.89

KGTM (Polynomial)

Figure 16c

0.87

2.3. Classes Separation in the Latent Space Performance of a particular method to separate classes in 2D latent space can be assessed in the framework of a k-NN approach where a “predicted” class is attributed to a given object by considering its k nearest neighbors on the map. The accuracy of predictions can be measured by balanced accuracy (BA) varying between 0 and 1; the model is inefficient for BA < 0.5 whereas perfect classification corresponds to BA = 1. Table 2 assembles BA values calculated for the maps of the AChE dataset presented on Figures 2, 3, 5-16. Most of these methods, especially SOM, GTM, kernel PCA with an RBF kernel and t-SNE, display a good separation of actives and decoys. 259 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

Figure 16. (a) Conventional GTM, compared to kernel GTM maps with (b) an RBF kernel and (c) a polynomial kernel, for the AChE data set described by ISIDA fragment descriptors (red for inhibitors, blue for decoys).

Table 3. InfoVis techniques. Display Type

Some examples

Standard

scatterplots, bar plots

Tables

Bertin matrices, heatmaps

Parallel coordinates family

parallel coordinates, Andrews curves

Iconographic

Chernoff faces, star glyphs

Pixel-based

VisDB

Stacked displays

Tree maps, dimensional stacking

3. InfoVis Techniques for High-Dimensional Data Information Visualization (InfoVis) techniques provide visual and/or interactive representations of data, and rely on the interpretation of users. The principle is to suggest new “points of view” on a dataset. Some InfoVis techniques are listed in Table 3. They mostly focus on table representations, the “parallel coordinates” family and iconographic displays, which are easily applicable to any data set and are available in R or python packages.

3.1. Standard Approaches The simplest visualization approach represents a 2D (or 3D) scatterplot, where features (descriptors) are visualized two by two (or 3 by 3). Sometimes, this simple but popular method of analysis can detect simple patterns not visible with a complex machine learning model. An example of scatterplot matrix for the AChE data set described by 3 MOE descriptors is given in Figure 17. 260 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

Figure 17. Scatterplot matrices for the AChE dataset described by 3 MOE descriptors: number of H bond acceptors O and N (HBA), number of H bond donors OH and NH (HBD), and number of rigid bonds. 3.2. Bertin’s Permutation Matrices In Bertin permutation matrices (45) the data is represented by a table in which the rows are features (descriptors) and the columns instances. Feature values are represented by bar graphs. The rows and columns are re-ordered in such a way that similar features are placed next to each other. Heatmaps and heightmaps are variants of Bertin matrices, where descriptor values are represented by colors mapped on a 2D table or by heights on a 3D representation, respectively. In the Bertin matrix for the AChE data set (Figure 18) descriptors are re-arranged according to their inter-correlations.

Figure 18. Bertin matrix generated with the bertin R package (46). The values of five structural descriptors (rows) are given for five AChE inhibitors (columns in red square) and five decoys (blue square); the descriptors values above the mean are highlighted in black. The symbol “*” in structural fragments represents an aromatic bond; correlated descriptors are grouped together. 3.3. Parallel Coordinates Family Parallel coordinates (47, 48) are certainly the oldest and most popular visualization technique for datasets characterized by a small number of descriptors. Each line represents a data instance; the y-axis gives the descriptor value and the x-axis the descriptor name (Figure 19). Variants of parallel coordinates include 261 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

circular parallel coordinates or RadViz (49), which considers springs attached to equally spaced “data anchors” on one end and to data points on the other, where the spring constant for each point and dimension is equal to the point coordinate. These methods share the disadvantage of being highly dependent on the ordering of the dimensions along the x-axis for parallel coordinates or around the circle for radial methods.

Figure 19. Parallel coordinates for acetylcholinesterase inhibitors (red) and decoys (blue) in the AChE data set described by 6 MOE descriptors: number of H bond acceptors O and N (HBA), number of H bond donors OH and NH (HBD), number of rigid and rotatable bonds, number of rings, and logP (octanol/water partition coefficient). The plots were generated with the R package MASS (27). Andrews curves (50) are an extension of parallel coordinates and provide a unique representation, which is not dependent on the order of descriptors. Instances (in our case, molecules) are represented by smooth curves on a 2-dimensional plot. Each data point x defines a Fourier series where coefficients are descriptor values x1, x2, x3, ... using the function f(t):

f(t) can then be plotted from t = −π to t = π. This method is especially useful for outlier detection, and is also used for clustering; an example is given featuring acetylcholinesterase inhibitors and decoys in Figure 20, for the same variables that we used for parallel coordinates in Figure 19.

3.4. Iconographic Displays In iconographic techniques (52), each descriptor is encoded by a “glyph”. These techniques are highly biased and often difficult to interpret. Chernoff faces are certainly the most illustrative example of the iconic display. On the example for the AChE dataset characterized by 140 ISIDA descriptors (Figure 21), the 2D coordinates of molecules were generated by GTM. Some of the descriptors are encoded by facial features such as the nose, eyes, and so on; for example, the color of faces represents the number of CC and CCN fragments. An analogous 262 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

representation is the star plot, where descriptor values are represented as spokes radiating from the center of a “star” glyph.

Figure 20. Andrews curves, where each curve represents one inhibitor (red) or decoy (blue) of acetylcholinesterase in the AChE data set generated with the andrews R package (51).

Figure 21. Example of application of Chernoff Faces to the AChE data set (one molecule = one face). The 2D coordinates of molecules were generated by GTM, whereas the face features correspond to ISIDA fragment descriptors. The symbol “*” in structural fragments represents an aromatic bond. The R package aplpack (53) was used to genrate the plot. 263 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

3.5. Other Techniques

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

Some other visualization techniques include hierarchical and pixel-based methods. Hierarchical techniques are used most of the time to deal with hierarchical data. Treemap (54) or Dimensional Stacking (55) are some examples; hierarchical clustering could also be included in this category. Treemap encodes the data with nested rectangles; Dimensional Stacking was originally used for binary data, and divides a grid into embedded rectangles representing the individual dimensions. Pixel-based methods such as VisBD (56) encode data values by pixels: each data instance is represented by a window where each descriptor is mapped to a specific pixel colored by the descriptor value.

4. Conclusion Visualization of chemical data described by a large number of chemical descriptors represents a real challenge. Indeed, the loss of information is a permanent risk, and the goal is to produce an informative and interpretable image of a dataset immensely rich in data points (molecules) and features (descriptors). There is no unique “best” solution for this problem. Even such a relatively simple object as an apple can be observed from different perspectives: from a bird flying in the sky, from the human being keeping the apple in his hands, from the tiny insect crawling on its stalk, and so on. The bird sees only a blurry point, the human being sees the “whole” apple (or so he/she thinks), and the insect has no idea that the stalk it walks on belongs to an apple. All have different perspectives and each of these visualizations is a complementary piece of the “apple” puzzle. Dimensionality reduction methods use mathematical techniques to transform the data and to make them visible in a 2D space. Information Visualization (InfoVis) techniques use simple graphical tricks to represent the data in different ways, using colors, shapes, and other tools belonging to the semiology of graphics. Both dimensionality reduction and information visualization must act harmoniously to produce informative and useful graphical representations of chemical datasets. Different combinations of these methods can merely provide complementary views on the “chemical space puzzle”.

References 1. 2. 3. 4. 5.

Camastra, F. Data Dimensionality Estimation Methods: A Survey. Pattern Recognit. 2003, 36, 2945–2954. Fukunaga, K.; Olsen, D. R. An Algorithm for Finding Intrinsic Dimensionality of Data. IEEE Trans. Comput. 1971, C-20, 176–183. Bellman, R. E. Dynamic Programming; Courier Corporation, 1957. Hughes, G. On the Mean Accuracy of Statistical Pattern Recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. Zaki, M. Data Mining and Analysis: Fundamental Concepts and Algorithms; Cambridge University Press: New York, NY, 2014. 264

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

6.

7. 8.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

9.

10. 11. 12.

13. 14. 15. 16. 17. 18.

19. 20. 21. 22. 23.

Shahlaei, M. Descriptor Selection Methods in Quantitative Structure–Activity Relationship Studies: A Review Study. Chem. Rev. 2013, 113, 8093–8103. Liu, H.; Motoda, H. Feature Selection for Knowledge Discovery and Data Mining; Springer US: Boston, MA, 1998. Bermingham, M. L.; Pong-Wong, R.; Spiliopoulou, A.; Hayward, C.; Rudan, I.; Campbell, H.; Wright, A. F.; Wilson, J. F.; Agakov, F.; Navarro, P.; Haley, C. S. Application of High-Dimensional Feature Selection: Evaluation for Genomic Prediction in Man. Sci. Rep. 2015, 5, 10312. Evans, D. M.; Visscher, P. M.; Wray, N. R. Harnessing the Information Contained within Genome-Wide Association Studies to Improve Individual Prediction of Complex Disease Risk. Hum. Mol. Genet. 2009, 18, 3525–3531. Kooperberg, C.; LeBlanc, M.; Obenchain, V. Risk Prediction Using GenomeWide Association Studies. Genet. Epidemiol. 2010, 34, 643–652. Huang, N.; Shoichet, B. K.; Irwin, J. J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49, 6789–6801. Varnek, A.; Fourches, D.; Hoonakker, F.; Solov’ev, V. P. Substructural Fragments: A Universal Language to Encode Reactions, Molecular and Supramolecular Structures. J. Comput. Aided Mol. Des. 2005, 19, 693–703. Jolliffe, I. Principal Component Analysis. In Encyclopedia of Statistics in Behavioral Science; John Wiley & Sons, Ltd.: Chichester: 2005. Torgerson, W. S. Theory and Methods of Scaling; Wiley: New York, NY, 1958. Householder, A. S.; Young, G. Matrix Approximation and Latent Roots. Am. Math. Mon. 1938, 45, 165–171. Agrafiotis, D. K. Stochastic Proximity Embedding. J. Comput. Chem. 2003, 24, 1215–1221. Hotelling, H. Relations Between Two Sets of Variates. Biometrika 1936, 28, 321–377. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. Fisher, R. A. The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugen. 1936, 7, 179–188. McLachlan, G. J. Discriminant Analysis and Statistical Pattern Recognition; Wiley series in probability and statistics; Wiley: Hoboken, NJ, 2004. Comon, P. Independent Component Analysis, A New Concept? Signal Process. 1994, 36, 287–314. Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press: Cambridge, 2012. Lee, D. D.; Seung, H. S. Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature 1999, 401, 788–791.

265 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

24. Schölkopf, B.; Smola, A.; Müller, K.-R. Kernel Principal Component Analysis. In Advances in Kernel Methods - Support Vector Learning; MIT Press: Cambridge, MA, 1999; 327–352. 25. Karatzoglou, A.; Smola, A.; Hornik, K.; Zeileis, A. Kernlab – An S4 Package for Kernel Methods in R. J. Stat. Softw. 2004, 11, 1–20. 26. Sammon, J. W. A Nonlinear Mapping for Data Structure Analysis. IEEE Trans. Comput. 1969, 18, 401–409. 27. Venables, W. N.; Ripley, B. D. Modern Applied Statistics with S, 4th ed.; Springer: New York, 2002. 28. De’ath, G. Extended Dissimilarity: A Method of Robust Estimation of Ecological Distances from High Beta Diversity Data. Plant Ecol. 1999, 144, 191–199. 29. Tenenbaum, J. B.; de Silva, V.; Langford, J. C. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000, 290, 2319–2323. 30. Dijkstra, E. W. A Note on Two Problems in Connexion with Graphs. Numer. Math. 1959, 1, 269–271. 31. Oksanen, J.; Blanchet, F. G.; Kindt, R.; Legendre, P.; Minchin, P. R.; O’Hara, R. B.; Simpson, G. L.; Solymos, P.; Stevens, M. H. H.; Wagner, H. Vegan: Community Ecology Package, R package version 2.3-0; 2015. 32. Roweis, S. T.; Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326. 33. Belkin, M.; Niyogi, P. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Comput. 2002, 15, 1373–1396. 34. Van der Maaten, L.; Hinton, G. E. Visualizing High-Dimensional Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. 35. Kullback, S.; Leibler, R. A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. 36. Bengio, Y. Learning Deep Architectures for AI. Found. Trends Mach. Learn. 2009, 2, 1–127. 37. Dubossarsky, E.; Tyshetskiy, Y. Autoencoder: An Implementation of Sparse Autoencoder for Automatic Learning of Representative Features from Unlabeled Data, R package version 1.0; 2014. 38. Kohonen, T. Self-Organizing Maps; Springer: Berlin, Heidelberg, 2001. 39. Wehrens, R.; Buydens, L. M. C. Self- and Super-Organising Maps in R: The Kohonen Package. J. Stat. Softw. 2007, 21, 1–19. 40. Bishop, C. M.; Svensén, M.; Williams, C. K. I. Developments of the Generative Topographic Mapping. Neurocomputing 1998, 21, 203–224. 41. Bishop, C. M.; Williams, C. K. I. GTM: A Principled Alternative to the SelfOrganizing Map. International Conference on Artificial Neural Networks, ICANN’96; Springer: 1996; pp 165–170. 42. Olier, I.; Vellido, A.; Giraldo, J. Kernel Generative Topographic Mapping. ESANN 2010 proceedings, European Symposium on Artificial Neural Networks - Computational Intelligence and Machine Learning; 2010; pp 481−486.

266 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch012

43. Kabán, A.; Girolami, M. A Combined Latent Class and Trait Model for the Analysis and Visualization of Discrete Data. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 859–872. 44. Owen, J. R.; Nabney, I. T.; Medina-Franco, J. L.; López-Vallejo, F. Visualization of Molecular Fingerprints. J. Chem. Inf. Model. 2011, 51, 1552–1563. 45. Bertin, J. La Graphique et Le Traitement Graphique de L’information; Nouvelle bibliothèque scientifique, Flammarion: Paris, 1977. 46. Sawitzki, G. Bertin: An R Implementation, R package version 0.1-94; 2014. 47. Hewes, F. W.; Gannet, H. Statistical Atlas of the United States; C. Scribner’s sons: New York, NY, 1883; plate 151. 48. Inselberg, A. Parallel Coordinates; Springer: New York, NY, 2009. 49. Hoffman, P.; Grinstein, G. G.; Marx, K. A.; Grosse, I.; Stanley, E. DNA Visual and Analytic Data Mining. IEEE Visualization 1997 Proceedings; IEEE: 1997; pp 437–442. 50. Andrews, D. F. Plots of High-Dimensional Data. Biometrics 1972, 28 (1), 125–136. 51. Myslivec, J. Andrews: Andrews Curves, R package version 1.0; 2012. 52. Pickett, R. M.; Grinstein, G. G. Iconographic Displays for Visualizing Multidimensional Data. In Proceedings of the 1988 IEEE Conference on Systems, Man, and Cybernetics; 1988; Vol. 1 pp 514−519. 53. Wolf, H. P.; Bielefeld, U. Aplpack: Another Plot PACKage: Stem.leaf, Bagplot, Faces, spin3R, Plotsummary, Plothulls, and Some Slider Functions, R package version 1.3.0; 2014. 54. Shneiderman, B. Tree Visualization with Tree-Maps: 2-D Space-Filling Approach. ACM Trans. Graph. 1992, 11, 92–99. 55. LeBlanc, J.; Ward, M. O.; Wittels, N. Exploring N-Dimensional Databases. In Proceedings of the 1st Conference on Visualization ’90; IEEE: 1990; pp 230–237. 56. Keim, D. A.; Kriegel, H.-P. VisDB: Database Exploration Using Multidimensional Visualization. IEEE Comput. Graph. Appl. 1994, 14, 40–49.

267 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.