Mining Discriminative Patterns from Graph Data with Multiple Labels

Nov 9, 2015 - 3-1-1 Maidashi, Higashi-ku, Fukuoka 812-8582, Japan. §. Institute ... in machine learning and data mining, and its application field pe...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/jcim

Mining Discriminative Patterns from Graph Data with Multiple Labels and Its Application to Quantitative Structure−Activity Relationship (QSAR) Models Zheng Shao,† Yuya Hirayama,† Yoshihiro Yamanishi,‡,§ and Hiroto Saigo*,† †

Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, Fukuoka 820-8502, Japan Division of System Cohort, Multi-Scale Research Center for Medical Science, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka 812-8582, Japan § Institute for Advanced Study, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, Fukuoka 812-8581, Japan ‡

ABSTRACT: Graph data are becoming increasingly common in machine learning and data mining, and its application field pervades to bioinformatics and cheminformatics. Accordingly, as a method to extract patterns from graph data, graph mining recently has been studied and developed rapidly. Since the number of patterns in graph data is huge, a central issue is how to efficiently collect informative patterns suitable for subsequent tasks such as classification or regression. In this paper, we consider mining discriminative subgraphs from graph data with multiple labels. The resulting task has important applications in cheminformatics, such as finding common functional groups that trigger multiple drug side effects, or identifying ligand functional groups that hit multiple targets. In computational experiments, we first verify the effectiveness of the proposed approach in synthetic data, then we apply it to drug adverse effect prediction problem. In the latter dataset, we compared the proposed method with L1-norm logistic regression in combination with the PubChem/Open Babel fingerprint, in that the proposed method showed superior performance with a much smaller number of subgraph patterns. Software is available from https://github.com/axot/GLP.



INTRODUCTION Classical data mining approaches assume that data are represented in a vectorial form, i.e., each training example has a fixed length feature vector. This representation experiences a difficulty in comparing structured objects such as sets, sequences, trees, and graphs. Indeed, much real-world data can be represented as graphs rather than vectors. In order to compare graphs, maximum common subgraph1 and graph kernels2 are proposed. However, graph kernels have a problem in its interpretability, because of its implicit representation of features, and the maximum common subgraph approach has a problem in achieving satisfactory prediction performance, because of the lack of the number of features. Therefore, recent approaches focus on the use of a set of substructures rather than one representative subgraph pattern. To summarize those substructure-based approaches, they are roughly classified into two categories: unsupervised or supervised. The objective of unsupervised approaches involves enumerating all the substructures that satisfy some criteria such as the maximum substructure size or the minimum occurrence frequency (also known as minimum support). After enumeration of the substructures, target labels attached to graphs are used to build a prediction model. However, this type of approach has a major shortcoming, i.e., if the number of enumerated © XXXX American Chemical Society

substructures is too large, then one needs to preselect a set of substructures to be used in the subsequent machine learning step. For selection criteria, one can employ maximum substructure size, minimum support, and so forth. However, generally, the optimal set of patterns is dependent on the subsequent machine learning task and cannot be selected beforehand.4 Interested readers can find closely related discussion in the literature.5 In order to overcome this difficulty, supervised enumeration algorithms are proposed, which directly use target labels as the information source for filtering uninformative patterns. Elementary ones are those based on relative frequency,6 information gain,7 or correlation.8 An another approach is based on an iterative learning algorithm: boosting.9 Boosting collects one hypothesis per one iteration and builds a prediction model in an iterative fashion. In gBoost,4 the hypothetical search process of boosting is replaced by a substructure mining algorithm, and a set of subgraphs are collected as features for classification or regression. In this paper, we consider building a multitarget regression model while simultaneously collecting discriminative subgraphs. Received: June 12, 2015

A

DOI: 10.1021/acs.jcim.5b00376 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 1. Schematic illustration of a graph mining problem with multiple labels, where each example is a graph and the goal is to search for discriminative subgraphs, which can explain a series of a target response labels. In chemoinformatics applications,3 graphs and target response vectors are replaced with drugs and the number of occurrences of each side effect, respectively.

A schematic figure of this type of problem is shown in Figure 1, in which chemical compounds are displayed together with multiple target labels. In this specific example, chemical compouds and target labels correspond to drugs and the number of occurrences of each side effect, respectively, and the goal is to extract functional subgroups that trigger similar side effects. If we regard a chemical compound as a mathematical graph, where atoms and bonds correspond to vertices and edges, respectively, then the problem is turned into extracting subgraphs that explain variance in the frequency of side effect occurrences. The problem raised has already been studied by Kong and Yu,10 in that subgraphs explaining multiple labels are enumerated using Hilbert−Schmidt Independence Criterion, and multilabel classifiers are trained subsequently. From the perspective of feature selection, Kong and Yu’s work is a filter approach that separates feature selection steps from learning steps, while what we propose in this paper is a wrapper approach that integrates feature selection steps and learning steps.5 Generally, essential features are dependent on specific learning problems and cannot be determined in advance. A wrapper approach is better at selecting exact features for specific learning problem than a filter approach, but could be slower when the feature selector must be called repeatedly for the same dataset. From the machine learning point of view, the problem that we address in this paper is close to multitarget/multitask learning, in that the design matrix is already given, and task similarities are learned by regressing the design matrix onto the target response matrix. In our problem setting, the design matrix is not given, but is generated in an iterative fashion by using the target response matrix, such that the current design matrix explains as much covariance between the design matrix and the target response matrix as possible. To this end, our approach does not employ task similarity, but instead focuses on explaining the information in the target response matrix with a relatively small number of columns in the generated design matrix. In cheminformatics literature, the incorporation of multiple labels to a quantitative structure−activity relationship (QSAR) model is known as the multitarget QSAR model, and it applied intensively to diverse areas such as antimicrobial profiles,11,12 anticancer activity,13 neurodegenerative disorders,13,14 and pesticide research.15,16 In most of these approaches, information

in chemical compounds are extracted by using the chemical fingerprint, and linear discriminant analysis or artificial neural networks (ANNs) are trained subsequently. The major difference from our approach is that we do not rely on a predefined chemical fingerprint, such as the PubChem fingerprint, but instead is generated dynamically, based on the information in the target response matrix. Since our objective is the selection of a small number of subgraph features that explains variance in the data, our approach is close to the power method;17 a popular iterative solver for eigendecomposition of a design matrix. In our case, in addition to design matrix X, target response matrix Y is involved. Therefore, we borrow the idea of partial least-squares (PLS),18 which can iteratively decompose the cross-product matrix X′Y. For building PLS model from graph data, we use substructures of chemical compounds as features. More precisely, we use a binary vector that represents the presence or absence of substructures in a chemical compound (see Figure 2). One pitfall of this approach

Figure 2. Feature space representation based on subgraph patterns. The feature vector consists of binary pattern indicators.

is that feature vectors built by subgraph patterns are often correlated to each other, since a subgraph pattern CCC naturally includes subgraph patterns CC and C. PLS is useful in such situations, since it introduces orthogonality constraints to relax collinearity among subgraph patterns. To compute the PLS model, one does not need to explicitly solve full eigendecomposition, but instead can use the NIPALS algorithm,18 which performs eigendecomposition in an iterative fashion. The procedure of NIPALS is very efficient, since it involves only elementary matrix operations. However, NIPALS requires a deflation operation to the design matrix, in that it must change the structure of the design matrix. Therefore, application of B

DOI: 10.1021/acs.jcim.5b00376 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling NIPALS to graph data is only possible if all the subgraph patterns are mined first and the resulting design matrix is loaded into the main memory. In this paper, we first present a variant of PLS called Y-deflation PLS, which does not need to perform deflation to design matrix X; then, we elaborate on how to apply Ydeflation PLS to graph data with multiple labels. Existing approaches such as gBoost4 or gPLS19 cannot use the response matrix directly for substructure mining; therefore, if they are forced to operate on this type of data, the response matrix must be decomposed into a set of response vectors first, and informative substructures must be collected independently from each response vector. Therefore, a naive execution on an n × q response matrix takes approximately q times more time. In contrast to that proposed algorithm, the multiple response graph PLS (denoted as mgPLS below) is much more efficient, since it collects subgraph patterns bearing high covariance to multiple target responses. The contribution of this paper is two-fold: (i) propose an iterative mining algorithm that can incorporate multiple response labels attached to each graph, and (ii) experimental validation of its efficiency, compared to a naive counterpart that runs regression independently on each response vector. This paper is organized as follows. In the “Background” section, we review the NIPALS algorithm for PLS, and we discuss the problem that is caused by deflation. In the “Methods” section, we first present Y-deflation PLS, which can avoid deflation of the design matrix, then propose a multiple response graph PLS (mgPLS) algorithm, by incorporating sparsity to Y-deflation PLS and employing efficient graph enumeration algorithm as a subroutine. In the “Experiments” section, we first evaluate the efficiency of mgPLS to existing graph mining approaches: graph PLS and its variants. We then evaluate the effectiveness of mgPLS in experiments predicting 1385 side effects from 888 drugs. For comparison, we prepared the PubChem/Open Babel fingerprint, and performed L1-norm logistic regression and naive PLS on them. Compared to those fingerprint-based approaches, mgPLS exhibited superior performance with a lesser number of subgraph patterns. The “Conclusions” section concludes this paper.

The objective value is unbounded above, therefore, a unit length assumption on w is typically introduced, such that ∥w∥ = 1. The constrained objective function then becomes

max w′X ′YY ′Xw w

subject to w =1

By introducing a Lagrange multiplier, we can incorporate constraints into objective function as max w′X ′YY ′Xw − λw′w w

Let the objective function be L, and we consider the derivative of L, with respect to w, and set it to zero: ∂L = 2X ′YY ′Xw − 2λw = 0 ∂w

This solution is obtained by solving the following eigenproblem: X ′YY ′Xw = λw

Now we consider performing linear regression in a space spanned by eigenvectors. Recall that the ordinal least-squares problem seeks a vector b that solves min Xb − y b

yOLS ̂ = XbOLS = X((XTX )−1(XTy))

In PLS, prediction is performed in a lower dimensional projected space spanned by w, so we replace X by its projected version Xw to obtain the PLS prediction as yPLS ̂ = XbPLS = (Xw)((Xw)T (Xw))−1((Xw)T y))

If top-k dimensions are extracted instead of the first dimension, we have a more general regression problem, where w vectors are stacked to construct a matrix W = {w1, ..., wk}. So are regression coefficients B = {b1, ...,bk}and response vectors Y = {y1, ...,yk}, which altogether changes the problem from single to multiple response regression:

BACKGROUND In this section, we review partial least-squares regression (PLS), and the NIPALS algorithm for computing the PLS prediction model. Partial Least Squares Regression. Below, we see that PLS is a linear regression method in a lower dimensional projected space. Let us assume n training examples (x1,y1), ..., (xn,yn), where xi ∈ ℜp and yi ∈ ℜq . The resulting design matrix is given as

min XW − Y W

F

̂ = XBPLS YPLS = (XW )((XW )T (XW ))−1((XW )T Y )) = (XW )(WTXTXW )−1(WTXTY ))

yi , j = βj ,1xj ,1 + βj ,2xj ,2 + ... βj , pxj , p

Typically, the motivation to use PLS is a reduction in dimensionality, so the number of dimensions k is chosen to be much smaller than p. The number of dimensions k is typically chosen by cross validation or information criteria such as AIC or BIC.20 In practice, only a small number of dimensions is desired, so full eigendecomposition of a cross-product matrix Y′X is never performed, but an efficient iterative procedure such as NIPALS is used instead. NIPALS Algorithm. The NIPALS algorithm is a popular method for efficiently computing the projection W. The entire

However, when managing graph data, dimensionality p can be huge, so we first consider performing eigendecomposition to a cross-product matrix V = Y′X. Let w be a projection vector that maximizes covariance in V; then, our objective function can be written as (1)

w

= max w′X ′YY ′Xw

2

where ∥·∥F is a Frobenius matrix norm. Predictions of multiple response PLS are obtained as

X ∈ ℜn × p, and the resulting response matrix is given as Y ∈ ℜn × q; each column of matrix Y is assumed to be centered. Our prediction model is a linear combination of features:

w

2

Then, the prediction using bOLS is obtained as



max cov 2(V , w) = max w′V ′Vw

2

(2)

w

C

DOI: 10.1021/acs.jcim.5b00376 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling procedure is presented in Algorithm 1. NIPALS belongs to the family of Krylov subspace-based methods, and it is efficiently

process at lines 8 and 9 to ensure the orthogonality in T, since orthogonality in a latent component is unstable, compared to that observed in NIPALS. Application of Y-Deflation PLS to Graph Data. In this subsection, we explain how to apply Y-deflation PLS to graph data. Let us first define labeled connected graph and its subgraph below. Def inition 1 (Labeled Connected Graph): A labeled graph is represented as a quadtuple G = (V , E , 3, l), where V is a set of vertexes, E ⊆ V × V is a set of edges, 3 is a set of labels, and l: V ∪ E → 3 is a mapping function that assigns labels to the vertexes and edges. A labeled connected graph is a labeled graph such that there is a path between any pair of vertexes. Def inition 2 (Subgraph): Let G′ = (V ′, E′, 3′, l′) and G = (V , E , 3, l) be labeled connected graphs. G′ is a subgraph of G (G′ ⊆ G) if the following conditions are satisfied: (1) V′ ⊆V; (2) E′ ⊆ E; (3) 3′ ⊆ 3 ; (4) ∀v′ ⊆ V′, l(v′) = l′(v′); and (5) ∀e′ ⊆ E′, l(e′) = l′(e′). If G′ is a subgraph of G, then G is a supergraph of G′. Our training data are represented as (G1,y1), ..., (Gn,yn), where Gi represents the ith graph, and yi represents its attributed response vector. Let p be a subgraph pattern, and |P| be the number of all the patterns obtained by an exhaustive search, i.e., a pattern that appears at least once in the entire dataset. The entire design matrix can be represented by the presence or absence of subgraph patterns by

computable by only elementary matrix vector operations. The reason can be confirmed by the fact that regression coefficient B is restricted to Krylov subspace:21,22 2 = {XTy , XXTXTy , (XXT)2 XTy , ...,(XXT)i − 1XTy(XXT)i XTy}

The ith Krylov subspace is recursively computed from the (i − 1)th Krylov subspace by only a matrix vector multiplication; therefore, we do not need a full eigendecomposition of a matrix. The family of Krylov subspace methods includes conjugate gradient, power method, Lanczos, and Arnoldi.17 Note that, in line 7 of Algorithm 1, design matrix X must be updated. This update is called deflation, since it subtracts the rank-1 matrix from the original matrix. In our case, design matrix X is a sparse binary matrix, representing the presence or absence of substructures. Since we exploit this sparsity for efficiently mining subgraphs, destruction of the sparse structure by deflation is critical. In the next section, we present a variant of PLS algorithm, which can avoid deflation of the design matrix.



METHODS Y-Deflation PLS. Before introducing a new PLS algorithm, we review the manner in which the weight vector is obtained in Algorithm 1, since it is a key to understanding the main contribution of the paper. The first weight vector is obtained as W (: , 1) = X1Ty

(3)

where we denote the original design matrix as X1 for clarity. In the second iteration, the second weight vector W(:,2) = XT2 y is obtained by plugging the updating operation of X2 as

⎧ 1 if pj ⊆ Gi ⎪ Xij = ⎨ ⎪ ⎩−1 otherwise

W (: , 2) = X 2Ty = (X1T − X1TT (: , 1)T (: , 1)T )y

However, computation of the entire design matrix X is only possible if the number of all the subgraph patterns is small enough. Therefore, we make a sparsity assumption that only a small number of subgraph patterns are necessary for building a prediction model. By closely examining the prediction model,

= X1T(I − T (: , 1)T (: , 1)T )y

which presents two alternative ways to compute the second weight vector: ⎧ compute X T(I − T (: , 1)T (: , 1)T ), then multiply by y 1 ⎪ ⎪ from the right ⎪ ⎨ or ⎪ ⎪ compute (I − T(: , 1)T(: , 1)T )y , then multiply by X1T ⎪ ⎩ from the left

Y = XB = XW (WTXTXW )−1(WTXTY )

one can notice that X is always coupled with W. Therefore, by assuming sparsity in W, sparsity in X is enforced as well. We can introduce such sparsity to W in line 5 of Algorithm 2, either after computing XTy or during the computation. The level of sparsity can be controlled by soft thresholding or hard thresholding.23 When we understand this operation from a graph mining point of view, it corresponds to mining subgraph patterns with top-k scores, where a score for the jth subgraph pattern pj is defined as

Since the order of matrices and vectors is preserved, either way gives us the same solution. The former solution is same as NIPALS. The latter approach is beneficial, in the sense that it can avoid deflation of matrix X, but requires deflation of y; therefore, we call it Y-deflation PLS. We present the entire algorithm in Algorithm 2. The main difference from NIPALS resides at line 9 in Algorithm 2. In addition, we perform the Gram-Schmidt

n

s(pj ) = |∑ Xijyi | i=1

D

DOI: 10.1021/acs.jcim.5b00376 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling In the experiments, we simply use the hard-thresholding approach and set k = 1, which is equivalent to searching the subgraph with the highest score. For an efficient search, we adopt a DFS code tree of gSpan24 as a canonical search space (Figure 3). In the DFS code tree, we start exploring from the root node

Figure 3. Tree-shaped pattern search space. Each node contains a subgraph, and its child node contains a subgraph with one more edge.

containing the simple structure and extend edges to child nodes containing more-complex structures. (For details about the minimum DFS code, please refer to ref 25.) For the sake of efficiency, we use the pruning theorem used in ref 26 to prune unnecessary space. Suppose that we have reached a node corresponding to subgraph pattern pj, and the best score achieved so far was s* . If any superpattern p′ of pj is guaranteed not to have a score larger than s*, then we can safely prune the downstream nodes without losing the optimal subgraph pattern. Such a pruning condition is described as follows. Theorem 1 (from ref 26): For any pattern p′, such that pj ⊆ p′, if max{s+(pj),s−(pj)} < s*, then we can prune the supernodes of pj, where

In order to make this paper self-complete, we present our subgraph pattern search algorithm in Algorithm 4. It is a recursive algorithm starting from a simple pattern, and extends the search space based on the right-most extension, and prune search nodes by minimum DFS code check and pruning condition in Theorem 1.



EXPERIMENTS In this section, we compare proposed algorithm multiple response graph PLS (mgPLS in short) with gPLS19 using synthetic and real-world data. Note that gPLS does not support multiple response vectors as an input; therefore, for reasons of comparison, we run gPLS using every response vector y extracted from a response matrix Y. In addition to it, we prepared a naive alternative to mgPLS, called the average gPLS (agPLS), that uses a column average vector,

n

s+(pj )= 2



|yi ̂| −

{i | sgn(yi ̂) = 1, xi , j = 1}

∑ yi ̂ i=1 n

s−(pj )= 2

∑ {i | sgn(yi ̂) =−1, xi , j = 1}

|yi ̂| +

∑ yi ̂

y̅ =

i=1

We present a multiple response graph PLS algorithm in Algorithm 3. This algorithm calls the subgraph search algorithm

1 q

q

∑ Y (: , j) j=1

as an input and outputs a single coefficient vector. The resulting predictions of agPLS are the same for different responses. This approach is equivalent to gPLS if the number of response vectors is one. gPLS is obtained from the authors’ web site.27 In gPLS, the number of subgraphs to extract per iteration is set to 1, and the minimum support and maximum pattern size parameters were set to 2 and 10, respectively. The number of principal components (k) to extract is determined in the validation set during 4-fold cross-validation. We set the parameters of mgPLS to be the same as those for gPLS. Synthetic Data. We first evaluate mgPLS on synthetic graph data. Synthetic graph data are generated using graphgen28 with the following default parameters, unless stated explicitly: (i) number of graphs, 10 000; (ii) mean number of edges, 20; (iii) mean number of nodes, 20; (iv) number of edge labels, 20; (v) number of node labels, 20; and (vi) edge density, 0.15. However, graphgen does not generate response vectors, therefore, we prepare it in the following way. After generating random graphs, we ran the frequent subgraph mining algorithm gSpan29 to

(Algorithm 4) as a subroutine, and adds one subgraph pattern to a solution set per iteration. In each iteration, the subgraph pattern search algorithm is driven to search for a different direction from the previously searched directions due to orthogonality constraints in PLS. E

DOI: 10.1021/acs.jcim.5b00376 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 4. Comparison of three PLS methods (mgPLS, agPLS, and gPLS), in terms of covariance in synthetic data with various database sizes (left) and various number of response vectors (right).

the same parameters. When the transaction size is 100, it successfully enumerated all 133 438 subgraphs in 394 s; however, when the transaction size is increased to 500, it did not finish within 24 h. Therefore, it suggests infeasibility of applying unsupervised approaches at this scale. We can also observe from Figure 5 that mgPLS is more time-consuming than gPLS. This is due to Gram-Schmidt process, which scales with the square of the number of subgraphs. Effect of the Number of Response Vectors. Next, we investigate the effect of increasing the number of response vectors. We generated {1, 5, 10, 15, 20} response vectors and measured the cross-validation performance, in terms of covariance. In Figure 4 (right), we can observe that the covariance of agPLS decreases as we increase the number of response vectors, which is not surprising, since the prediction of agPLS is based only on one coefficient vector, while the predictions of mgPLS are based on q coefficient vectors. Covariance of gPLS also decreases, because it searches for a subgraph pattern that explains only one response vector, and its average performance degrades with respect to the increase in the number of response vectors. Effect of Noise. Here, we investigate the effect of noise on regression performance. Recall that our response vector has a Gaussian noise term 5(0, σ ) in eq 4. We set σ to {0, 0.025, 0.05, 0.075, 0.10}, and measured the cross-validation performance. In Figure 6, it is observed that covariance of all three methods degrade as we increase the noise strength, but the covariance of mgPLS remained the best, independent of the noise strength.

enumerate frequent subgraphs appearing at least 2% in the database with a maximum size of 10. We then built each response vector as the sum of five subgraph occurrence indicators: 5

y=

∑ xj + ε j=1

(4)

where x is a binary vector consisting of 0 and 1, and ε is the Gaussian noise 5(0, σ ). Since the objective of PLS is maximizing the covariance in its objective function (eq 2), we evaluate various PLS regression methods in terms of covariance. Maximization of covariance in the projected space is slightly different from minimizing the squared error in the original space, but they are closely related to each other. Please refer to ref 30 for more details. Performance and Scalability. We first evaluate the performance of mgPLS, with respect to the increase in the number of transaction graphs. Figure 4 (left) shows the cross-validation performance of the three methodsmgPLS, gPLS and agPLS in terms of covariance. We can observe that mgPLS performs best, and the performance of agPLS approaches that of mgPLS as the number of transaction graphs increases. This is due to the decrease of variance in the average response vector y.̅ Also note the scalability of iterating the mining approaches: mgPLS, agPLS, and gPLS. They could be performed on datasets with up to 10 000 transactions in this experiment. The times required for training were 759, 643, and 299 s, respectively. (See Figure 5.) For reference, we ran gSpan on the same dataset with

Figure 5. Comparison of three PLS methods (mgPLS, agPLS, and gPLS), in terms of computation time (seconds) in synthetic data with various database sizes.

Figure 6. Effect of noise in the response matrix on prediction performance in synthetic data. F

DOI: 10.1021/acs.jcim.5b00376 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling This is due to the nature of mgPLS that extracts major latent components, which works similarly to principal components for removing noise in principal component analysis (PCA). Real-World Data. We further investigate the effectiveness of mgPLS in real-world data. We used the data compiled by Pauwels et al.3,31 In this dataset, chemical compounds (drugs) are listed with their attribute vectors representing the presence (1) or absence (0) of side effects, where the goal is to extract functional subgroups that trigger similar side effects. The entire dataset consisted of 888 graphs and 1385 response vectors. The statistics of these data are shown in Table 1. Notice that the size of each chemical compound is larger than that of the chemical compound used in the simulation experiments. Table 1. Side-Effect Data Statistics parameter

value

number of compounds number of atom labelsa number of bond labelsb average number of atoms average number of bonds

888 26 6 49.4 ± 30.0 51.0 ± 31.0

molecular weight log P PSA H-bond acceptor H-bond donor

377 ± 223 1.93 ± 2.70 100 ± 91.0 5.78 ± 4.70 2.36 ± 2.96

Figure 7. Comparison of mgPLS to standard chemical fingerprints. The number of features used for mgPLS, “PubChem+PLS”, “PubChem +L1LR”, “FP2&4+PLS” and “FP2&4+L1LR” are 30, 1328, 881, 234, and 166, respectively. For mgPLS and L1LR, we have counted the number of subgraph features as those appearing at least once during 10fold cross-validation. We chose this way of counting because it amounts to the maximum number of features used and compares fairly to fixedlength fingerprints.

corresponds to the number of features used. We can observe that mgPLS performed best, in terms of AUC, followed by PLS using FP2 and FP4 and PubChem fingerprints. Although naive PLS with FP2 and FP4 fingerprint performed almost as good as mgPLS, naive PLS does not perform feature selection, so all 1328 fingerprints had nonzero regression coefficients. In contrast, mgPLS performed feature selection and obtained only 30 subgraph features. This property is advantageous in interpreting the reason for similar side effects caused by common subgraph patterns that could be missing in the PubChem or Open Babel fingerprint. L1-norm logistic regression also performs feature selection, and the number of features selected was similar to that of mgPLS; however, its performance in terms of AUC was worse than that of mgPLS. In this case, it is likely that mgPLS successfully found key subgraph patterns. We can also observe low AUC scores for the benchmark method based on frequent subgraphs. Although the AUC for the benchmark increases as the number of features increases, the increase was very mild. Therefore, this suggests the importance of incorporating target label information into subgraph mining for an efficient feature selection. Table 2 displays a portion of the obtained substructures by mgPLS with corresponding regression coefficients. It is worth noting that we found a high correlation (0.737) between the coefficient vectors of “agranulocytosis” and “eosinophilia”, where both of the two side effects are concerned with blood. In contrast, we found a very low correlation (0.00485) between the two coefficient vectors of “erythema” and “hypersensitivity”. While “erythema” occurs in the lower layer of the skin, “hypersensitivity” occurs in the immune system, so it is convincing that the regression coefficients for the two side-effect prediction problems are totally different from each other.

a

26 atom types are Ag, Al, Au, As, Br, C, Cl, F, Fe, Gd, Ga, H, I, K, La, Li, Mg, Mn, N, Na, O, P, Pt, S, Se. b6 bond types are 1) non-specific stereo double bond, 2) 3-D hydrogen bond, 3) non-specific stereo single bond, 4) a solid wedge stereo bond, 5) a dashed wedge stereo bond, and 6) Aromatic Bond.

We evaluate the effectiveness of mgPLS by comparing it to a traditional chemoinformatics approach: fingerprint + regression. The fingerprint is a binary vector that represents the presence or absence of predefined subgraph patterns. We employ two types of widely used fingerprints. The first one is a 881-bit PubChem fingerprint offered by National Institute of Health (NIH). A full list of structural keys is available from the Pubchem website.32 The second fingerprint is generated by the widely used free software Open Babel (http://openbabel.sourceforge.net/). It offers four types of fingerprints: MACCS, FP2, FP3, FP4. Among them, FP2 is often used for the comparison of small compounds, and it consists of a set of linear fragment patterns up to seven atoms. FP4 consists of a set of SMART patterns corresponding to functional groups. The combination of FP2 and FP4 is reported to achieve good performance in the work of Ahmed et al.33 and, therefore, has been employed in our experiments, resulting in a total of 1328 bit fingerprints. We did not use MACCS, since, out of all the 960 patterns, only 166 patterns were available in Open Babel. As a regression method, naive PLS and L1-norm logistic regression23 are employed, and each method is evaluated through 10-fold cross-validation. As a benchmark, we also ran gSpan to collect frequent subgraphs, then supplied them to PLS and L1-norm logistic regression. In Figure 7, we show the performance of each method in terms of AUC; which is an area under the Receiver Operator Characteristics (ROC) curve, obtained by taking false positive rate as x-axis and true positive rate as y-axis.34 Each AUC value takes real number between 0 and 1, and higher AUC value indicates better classifier performance. In the figure, the y-axis corresponds to AUC and the x-axis



DISCUSSION Connection to Canonical Correlation Analysis (CCA) and Boosting. When a relationship between two sets of variables is concerned, a natural choice is to use canonical G

DOI: 10.1021/acs.jcim.5b00376 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 2. Some of the Extracted Substructures from Drug Side-Effect Dataa

a

The leftmost column shows extracted substructures, and the rest of the columns shows the regression coefficient vector for each side-effect prediction problem.



correlation analysis (CCA),35 the objective of which is to maximize the correlation between the design matrix (X) and the response matrix (Y). However, it requires us to normalize both X and Y. In our case, Y is available and normalizable, but we cannot normalize X and Y simultaneously. One can also recognize the proposed PLS-based approach from the viewpoint of boosting. Both methods have a similarity in adding a variable (or hypothesis) one by one in building a prediction model. Boosting adds a variable that most greatly maximizes the gain, while PLS adds a variable that most greatly maximizes covariance. See ref 4 for more details about the connection between boosting and PLS.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work is supported by the Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research (JSPS KAKENHI ) (Grant No. 23700338, 25700029 and 15K14980), and the Program to Disseminate Tenure Tracking System, MEXT, Japan, and Kyushu University Interdisciplinary Programs in Education and Projects in Research Development.

CONCLUSION



This paper introduced the multiple response PLS algorithm for graph mining. To the best of the authors’ knowledge, this is the first time that the multiple response problem has been considered in combination with tightly coupled substructure mining. The solution proposed in this paper is simple but efficient, since it uses the eigenstructure of a given response matrix. In computational experiments using synthetic data, the proposed method exhibited performance that was superior to existing methods and robustness to noise. In experiments using real-world data, the proposed method showed superior performance to standard chemoinformatics approaches based on fingerprints. We plan to apply the proposed algorithm to identify ligand functional groups that hit multiple targets in the near future.

REFERENCES

(1) Raymond, J. W.; Willett, P. Maximum Common Subgraph Isomorphism Algorithms for the Matching of Chemical Structures. J. Comput.-Aided Mol. Des. 2002, 16, 521−533. (2) Kashima, H.; Tsuda, K.; Inokuchi, A. Marginalized Kernels between Labeled Graphs. In Proceedings of the 21st International Conference on Machine Learning, Washington, DC, USA, Aug. 21−24, 2003; Fawcett, T., Mishra, N., Eds.; AAAI Press: Palo Alto, CA, USA, 2003. (3) Pauwels, E.; Stoven, V.; Yamanishi, Y. Predicting Drug Side-Effect Profiles: A Chemical Fragment-Based Approach. BMC Bioinf. 2011, 12, 169. (4) Saigo, H.; Nowozin, S.; Kadowaki, T.; Kudo, T.; Tsuda, K. gBoost: A Mathematical Programming Approach to Graph Classification and Regression. Machine Learning 2009, 75, 69−89.

H

DOI: 10.1021/acs.jcim.5b00376 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling (5) Kohavi, R.; John, G. H. Wrappers for Feature Subset Selection. Artif. Intell. 1997, 97, 273−324. (6) Jin, N.; Young, C.; Wang, W. Graph Classification Based on Pattern Co-occurrence. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, Hong Kong, Nov. 2−6, 2009; Cheung, D. W.-L., Song, I.-Y., Chu, W. W., Hu, X., Lin, J. J., Eds.; Association for Computing Machinery (ACM): New York, 2009. (7) Cheng, H.; Yan, X.; Han, J. Direct Discriminative Mining for Effective Classification. In 2008 IEEE 24th International Conference on Data Engineering (ICDE 2008), Cancun, Mexico, Apr. 7−12, 2008; Yu, P. S., Ed.; IEEE: Piscataway, NJ, 2008; p 169.10.1109/ ICDE.2008.4497425 (8) Bringmann, B.; Zimmermann, A.; Raedt, L. D.; Nijssen, S. Don’t Be Afraid of Simpler Patterns. In 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, Sept. 18− 22, 2006; Fürnkranz, J., Scheffer, T., Spiliopoulou, M., Eds.; Springer: Berlin, 2006. (9) Freund, Y.; Schapire, R. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55 (1), 119−139. (10) Kong, X.; Yu, P. S. Multi-Label Feature Selection for Graph Classification. In Proceedings of the International Conference on Data Mining, Sydney, Australia, Dec. 14−17, 2010; Webb, G. I., Liu, B., Zhang, C., Gunopulos, D., Wu, X., Eds.; IEEE Computer Society: Washington, DC, 2010. (11) Garcia, I.; Fall, Y.; Gomez, G.; Gonzalez-Diaz, H. First Computational Chemistry Multi-Target Model for Anti-Alzheimer, Anti-Parasitic, Anti-Fungi, and Anti-Bacterial Activity of GSK-3 Inhibitors In Vitro, In Vivo, and in Different Cellular Lines. Mol. Diversity 2011, 15, 561−567. (12) Speck-Planche, A.; Cordeiro, M. N. D. S. Simultaneous Modeling of Antimycobacterial Activities and ADMET Profiles: A Chemoinformatic Approach to Medicinal Chemistry. Curr. Top. Med. Chem. 2013, 13, 1656−1665. (13) Speck-Planche, A.; Kleandrova, V. V.; Luan, F.; Cordeiro, M. N. D. S. Rational Drug Design for Anti-Cancer Chemotherapy: MultiTarget QSAR Models for the In Silico Discovery of Anti-Colorectal Cancer Agents. Bioorg. Med. Chem. 2012, 20, 4848−4855. (14) Alonso, N.; Caamano, O.; Romero-Duran, F. J.; Luan, F.; Cordeiro, M. N. D. S.; Yanez, M.; Gonzalez-Diaz, H.; Garcia-Mera, X. Model for High-Throughput Screening of Multi-Target Drugs in Chemical Neurosciences; Synthesis, Assay and Theoretic Study of Rasagiline Carbamates. ACS Chem. Neurosci. 2013, 4, 1393−1403. (15) Speck-Planche, A.; Kleandrova, V. V.; Scotti, M. T. Fragmentbased Approach for the In Silico Discovery of Multi-Target Insecticides. Chemom. Intell. Lab. Syst. 2012, 111, 39−45. (16) Speck-Planche, A.; Kleandrova, V. V.; Luan, F.; Cordeiro, M. N. D. S. Multi-Target Inhibitors for Proteins Associated with Alzheimer: In Silico Discovery Using Fragment-Based Descriptors. Curr. Alzheimer Res. 2013, 10, 117−124. (17) Golub, G. H.; Loan, C. F. V. Matrix Computations; Johns Hopkins University Press: Baltimore, MD, USA, 1996. (18) Wold, H. Path Models with Latent Variables: The NIPALS Approach. In Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building; Academic Press: New York, 1975; pp 307−357.10.1016/B978-0-12-103950-9.50017-4 (19) Saigo, H.; Krämer, N.; Tsuda, K. Partial Least Squares Regression for Graph Mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, Aug. 24−27, 2008, Li, Y., Liu, B., Sarawagi, S., Eds.; Association for Computing Machinery (ACM): New York, 2008. (20) Krämer, N.; Braun, M. Kernelizing Partial Least Squares, Degrees of Freedom, and Efficient Model Selection. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA, June 20−24, 2007; Gharamani, Z., Eds.; Association for Computing Machinery (ACM): New York, 2007. (21) Lanczos, C. An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators. J. Res. Natl. Bur. Stand 1950, 45, 255−282.

(22) Eldén, L. Partial Least Squares vs. Lanczos Bidiagonalization I: Analysis of a Projection Method for Multiple Regression. Comput. Stat. Data Anal. 2004, 46, 11−31. (23) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, 2009. (24) Yan, X.; Han, J. gSpan: Graph-based Substructure Pattern Mining. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2003), Dec. 9−12, 2002; pp 721−724. (25) Yan, X.; Han, J. gSpan: Graph-based Substructure Pattern Mining; Technical Report No. UIUCDCS-R-2002- 2296, Department of Computer Science, University of Illinois at Urbana−Champaign: Urbana, IL, USA, 2002. (26) Kudo, T.; Maeda, E.; Matsumoto, Y. An Application of Boosting to Graph Classification. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, British Columbia, Canada, Dec. 13−18, 2004, Saul, L. K., Weiss, Y., Bottou, L., Eds.; 2004. (27) Saigo, H. Hiroto Saigo’s web site, http://www.bio.kyutech.ac.jp/ %7Esaigo/ (accessed Sept. 18, 2015). (28) Cheng, J. James Cheng’s web site, http://www.cse.cuhk.edu.hk/ ~jcheng/ (accessed Sept. 18, 2015). (29) Yan, X. Xifeng Yan’s web site, http://www.cs.ucsb.edu/~xyan/ (accessed Sept. 18, 2015). (30) Borga, M.; Landelius, T.; Knutsson, H. A Unified Approach to PCA, PLS, MLR and CCA; Technical Report No. LiTH-ISY-R-1992; Linköping University: Linköping, Sweden, 1997. Linkoeping University's web site, http://www2.cvl.isy.liu.se/ScOut/TechRep/ PaperInfo/blk97b.html. (31) Yamanishi, Y. Yoshihiro Yamanishi’s web site, http://cbio.ensmp. fr/~yyamanishi/ (accessed Sept. 18, 2015). (32) Pubchem, http://pubchem.ncbi.nlm.nih.gov/. (33) Ahmed, J.; Worth, C. L.; Thaben, P.; Matzig, C.; Blasse, C.; Dunkel, M.; Preissner, R. Fragment Store-A Comprehensive Database of Fragments Linking Metabolites, Toxic Molecules and Drugs. Nucleic Acids Res. 2011, 39, D1049−D1054. (34) Gribskov, M.; Robinson, N. Use of Receiver Operating Characteristic (ROC) Analysis To Evaluate Sequence Matching. Comput. Chem. 1996, 20, 25−33. (35) Hotelling, H. Relation between Two Sets of Variates. Biometrika 1936, 28, 321−377.

I

DOI: 10.1021/acs.jcim.5b00376 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX