Feasibility of Active Machine Learning for Multiclass Compound

Jan 7, 2016 - Lightweight adaptive Random-Forest for IoT rule generation and ... Small Random Forest Models for Effective Chemogenomic Active Learning...
0 downloads 0 Views 1MB Size
Article pubs.acs.org/jcim

Feasibility of Active Machine Learning for Multiclass Compound Classification Tobias Lang,†,‡ Florian Flachsenberg,† Ulrike von Luxburg,§ and Matthias Rarey*,† †

Center for Bioinformatics and ‡Department of Computer Science, University of Hamburg, 20146 Hamburg, Germany § Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany ABSTRACT: A common task in the hit-to-lead process is classifying sets of compounds into multiple, usually structural classes, which build the groundwork for subsequent SAR studies. Machine learning techniques can be used to automate this process by learning classification models from training compounds of each class. Gathering class information for compounds can be cost-intensive as the required data needs to be provided by human experts or experiments. This paper studies whether active machine learning can be used to reduce the required number of training compounds. Active learning is a machine learning method which processes class label data in an iterative fashion. It has gained much attention in a broad range of application areas. In this paper, an active learning method for multiclass compound classification is proposed. This method selects informative training compounds so as to optimally support the learning progress. The combination with human feedback leads to a semiautomated interactive multiclass classification procedure. This method was investigated empirically on 15 compound classification tasks containing 86−2870 compounds in 3−38 classes. The empirical results show that active learning can solve these classification tasks using 10−80% of the data which would be necessary for standard learning techniques.



INTRODUCTION In many situations in drug discovery, medicinal chemists need to classify a given molecule collection into a large number of classes. In lead identification, thousands of interesting compounds may result from high-throughput screening. Medicinal chemists group these molecules into different classes based on structural features, according to prototypical anticipated modes of action. The target classification often depends on the application context and the chemist’s preferences. Different classifications for the same compound collection are therefore possible and to our experiences not unusual. Classifying hundreds or even thousands of compounds manually can be a long, tedious, and error-prone process. In contrast, machine learning methods classify compounds in an automated way. Unsupervised clustering methods are of limited help in this scenario, since they cannot handle the user-specific preferences. Instead, supervised classification models can be learned from training compounds of each class. However, the gathering of training examples is often time- or labor-intensive, for instance, when they need to be provided by a human expert. The field of active machine learning1−3 studies how training examples can be selected in such a way that the total number of required training examples for learning is minimized. Active learning has been successfully applied in many domains, ranging from natural language processing4 through image classification5 and robotics6,7 to bioinformatics.8,9 Almost all work in active learning for classification has focused on binary classification problems, however. Only a few studies have investigated active learning for multiclass classification,10,11 mostly in the context of vision and image understanding.12,13 © XXXX American Chemical Society

While active learning has been identified as a useful tool to enable sequential screening,14,15 its potential in the context of multiclass classification in cheminformatics has not been studied so far. Warmuth et al.16 presented the first study in the direction of virtual screening: their active learning method could find active thrombin ligands faster than random selection baselines. Their method employed the uncertainty-based sampling active learning strategy in combination with support-vector machines.17 Successor studies investigated alternative active learning methods for the virtual screening problem, including active learning with random forest models for virtual screening of G-protein-coupled receptors (GPCR)18 and Abl Kinase inhibitors19 as well as Gaussian processes in the context of cancer cell growth inhibition.20 Other studies investigated novelty in active compound selection,21 actively choosing multiple compounds at a time,22 and actively screening for multiple target responses simultaneously.23,24 In virtual screening, the goal is to predict the bioactivity of compounds for specific targets. This is modeled either as a regression problem when the bioactivity level is directly predicted or as a binary classification problem to distinguish active from inactive compounds. This paper studies whether active learning is also suitable for classifying compounds into multiple classes. The main application scenario we envision is the classification of hits from high-throughput primary screens into multiple potential SAR series. Discussions with scientists from pharmaceutical industry revealed that this process is mostly done with substantial manual intervention. FurtherReceived: May 29, 2015

A

DOI: 10.1021/acs.jcim.5b00332 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

the distance function used to compare the similarity of compounds. The distance function needs to be specified by an external expert prior to learning. This is a difficult task as the choice is usually strongly context-dependent. Active learning methods are supervised machine learning methods that select the training data D in an automated way.3 In pool-based active-learning, the training data is chosen from a given pool of unlabeled examples with unknown target classes. The chosen examples are manually labeled and provided to the learning algorithm. In the context of classifying compounds, this corresponds to choosing compounds from the given collection X and getting feedback about the true classes. Active learning methods select examples that help best to learn an accurate classifier from the remaining data. Theoretical results from statistical learning theory show that active learning may reduce the number of required training examples by an exponential factor to achieve the same accuracy as standard supervised machine learning.26 There are different ways to formalize measures to define good examples for learning. Uncertainty sampling1,17 is the most popular approach: given certainty estimates for the predictions of the classification model, the examples with the currently lowest certainty are chosen. Other strategies include query-by-committee,27 optimal experimental design28 and querying representative examples.29 Active learning has focused on regression and binary classification problems. A popular approach to multiclass classification is to reduce the problem to a collection of binary classification problems where either two single classes are compared (one-vs-one) or a single class against all others (onevs-all). For such multiclass approaches, one can derive active learning methods for multiclass classification based on active learning strategies for the binary classification subtasks.30 Interactive Multiclass Classification Method. This work proposes an interactive multiclass classification framework based on active learning for classifying collections of compounds into multiple classes. In this framework, the medicinal chemist interacts with an active machine learning algorithm to produce a multiclass classification for a given compound collection. Whether the class information results from experiments or the intuition of the chemist has no relevance for the computational approach. In this way, active learning facilitates a favorable trade-off between manual work and the accuracy of the resulting classification. In the beginning, a compound collection is given whose compounds all have unknown target classes and are thus unlabeled. To initialize the training set D (the set of labeled compounds with known target classes), the medicinal chemist assigns target classes for a small number of randomly selected compounds. (Alternatively, in situations where some labeled compounds are available already, one could use them to initialize D.) After D has been initialized, an active learning algorithm starts to interact with the medicinal chemist in an iterative process and thereby extends the training set with more labeled compounds. In each iteration, the active learning algorithm chooses a compound xi from the yet unlabeled compounds. This is the key step for the efficiency of the complete method: the active learning algorithm selects compounds which are supposed to improve the prediction model most. The medicinal chemist provides the correct class label yi for this compound xi. The example (xi, yi) is added to the training data D. The active learner learns a classification model from D with the additional training compound. The resulting model is used to predict the classes of all remaining

more, active learning for virtual screening typically focuses on finding as many actives as fast as possible. In contrast, the goal in this work is not to select typical compounds of a class, but to optimize the overall classification accuracy. In this paper, an interactive process based on active learning is proposed to learn from the feedback of a chemist or experiments how to classify a molecule collection into multiple classes. The active learning component chooses iteratively for which compounds to ask for feedback. These compounds serve as training examples for a machine learning algorithm to estimate classes for all remaining compounds in the collection. By focusing on informative molecules, feedback is required only for a small number of compounds. The exact number of class labels to provide depends not only on the complexity of the learning task, but also on the desired degree of accuracy for the resulting model. In most applications, accuracy levels of 80− 90% are satisfactory, resulting in a requirement of providing just 4−22% of the class labels. In order to fully exploit this approach, a simple heuristic empirically estimating the accuracy of the current model is also investigated in this paper.



METHODS Active Learning for Multiclass Classification. In the context of cheminformatics, the problem of multiclass classification of a collection X = {xi}ni=1 of n compounds is to assign them into different classes {1, ..., l}. Each compound xi ∈ X is assumed to belong to exactly one class. The label yi ∈ {1, ..., l} identifies this true class. The definition of classes is application-specific. Examples are the activity of a compound against one over l targets or the belonging into a structurally defined class of actives in an SAR analysis. The compounds xi ∈ X are represented in a feature space - , for example - = d . The active learning method proposed in this paper does not depend on specific choices of molecular descriptors but can deal with arbitrary feature spaces, as long as the underlying machine learning model can process the feature types. For instance, - can include global properties of molecules like molecular weight as well as structural and topological descriptors like extended connectivity fingerprints (ECFP).25 It is not required to know in advance which features in - are relevant to distinguish classes. The method proposed in this work selects the relevant features during its learning process. Machine learning methods learn classification functions f: X → {1, ..., l} which map compounds into classes. Supervised machine learning methods estimate f from a training set D = {(xi, yi)}mi=1 of compounds xi with known classes yi. Typically, the training set contains only a small subset of all compounds (m ≪ n). Here, the training set is created from the feedback of a medicinal chemist, but it can also be gathered from experiments or other external sources. Supervised learning methods adapt to context: for two different training sets D and D′, a supervised machine learning method may produce different classification functions. The choice of the training data D has thus a strong influence on the accuracy of the learned classification function. If D contains mostly redundant or unrepresentative compounds, the resulting classification function may not be accurate for compounds not in the training set. In principle, unsupervised clustering methods can also be used to classify compounds into different groups without the help of any training examples. These methods exploit the structure of the compounds in the feature space - . Their result depends on B

DOI: 10.1021/acs.jcim.5b00332 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 1. Overview of the Data Sets Used in the Experiments name

size

CP19A

1960

DRD4

2339

FDFT

836

KCNH2

2870

Kinases GPCR RTK-50 RTK-25

382 499 1383 2316

DUD(P38) DUD(PDGFRB) DUD(ACHE)

115 109 86

classification Structural Zinc clustering PF clustering Zinc clustering PF clustering Zinc clustering PF clustering Zinc clustering PF clustering Biological target proteins target proteins publications publications Chemotypes reduced graphs reduced graphs reduced graphs

no. of classes

class sizes

3 10 4 10 4 10 5 10

759, 669, 532 372, 312, 283, ..., 9, 8, 4 1041, 762, 281, 255 452, 438, 312, ..., 135, 105, 74 665, 100, 56, 15 164, 107, ..., 56, 44, 34 1703, 792, 141, 128, 106 650, 547, ..., 98, 80, 52

4 3 7 38

138, 225, 824, 911,

6 10 7

61, 19, 11, 10, 9, 5 25, 17, 17, 13, 8, 7, 7, 5, 5, 5 21, 20, 14, 12, 7, 7, 5

89, 79, 76 184, 90 200, 101, 81, 65, 60, 52 204, 56, 47, 43, ..., 25

Uncertainty sampling17 is used as active learning strategy. Given the current multiclass classification model m with classspecific rescaled decision function values {fk̃ }k, the confidence c(x) for the prediction of compound x is estimated according to the following heuristics:30,32 let k = argmaxk fk̃ (x) and l = argmaxl≠k fl̃ (x) denote the binary classifiers with the largest and second largest rescaled decision function values. These are the two classifiers for which x has the largest positive distances (or the smallest negative distances) to their hyperplanes. The confidence is calculated as the difference between both, c(x) = fk̃ (x) − fl̃ (x). The active learner selects the compound x with x = argminx c(x). This is the compound for which the two classifiers with the most positive predictions have the smallest difference in hyperplane distances. In practice, the compounds with the most ambiguous prediction are often helpful in improving model learning as they lie on the classification boundary of the current model. This selection strategy may also select outliers which are hard to predict by any individual model. On a positive side it makes sense to query such outliers if one wants to achieve perfect classification accuracy on a given data set; on a negative side selecting outliers might lead to overfitting, so it is important to regularize the individual models sufficiently. Outliers can be of particular interest if they are singletons. While singletons are not helpful in building predictive models, recognizing them during the classification of a compound collection by means of active learning may be desired. Empirical results for the concrete instantiation of the interactive multiclass classification framework are described in the next section. The training set of labeled compounds is initialized with ten randomly chosen compounds for which labels are provided. Over the course of the interactive learning process, more labeled compounds are added to this training set based on the active learning strategy for compound selection. The cost parameter C of the SVM is chosen with 5-fold cross-validation on the initial random selection of compounds from values in {10i | i = −4, −3, ..., 3, 4}. All binary classifiers use the same C. After an additional 100 iterations of active learning, C is rechosen in a second cross-validation run.

unlabeled compounds. This results in an estimated multiclass classification of the complete compound collection. Then, the next iteration starts, and the process is repeated based on the updated model. Finally, the interactive procedure is stopped when the estimated classification is satisfactory. This can be judged by the medicinal chemist by visually inspecting the current classification or by machine learning heuristics based on validation sets and model inspection.30 The interactive multiclass classification framework can be instantiated with different choices for the molecular descriptors, the underlying machine learning model and the active learning strategy. As a concrete instance of this framework, an interactive multiclass classification model is proposed using ECFP6 fingerprints, linear support-vector machines (SVMs) and uncertainty sampling. This model accounts for multiclass classification with the one-vs-all strategy, which is known for its empirical effectivity despite of its simplicity.31 For each class k, it learns a separate binary classification model mk. The compounds xi ∈ D in the training set with label yi = k constitute the positive examples for learning mk; all other compounds in D form the negative examples. mk maintains a decision function f k: X →  which maps molecules x ∈ X to real values. If f k(x) > 0, mk assigns x to class k; otherwise (f k(x) ≤ 0), it is classified as not to belong to class k. The multiclass classification model m maintains an ensemble of class-specific binary classification models {mk}k. To make predictions, m compares the outputs of these individual models. This requires the outputs to be on a consistent scale. In the proposed model using linear SVMs, the individual models mk correspond to hyperplanes which are all constructed in the same space from the same training points (but the training points are split into different positive/negative partitions). The individual decision function values f k(x) are rescaled to values fk̃ (x), such that fk̃ (x) denotes the distance of points x to the respective hyperplane of mk. As all hyperplanes are constructed in the same space, different distances {fk̃ }k can be compared. For a given molecule x, m predicts the class k* = argmaxk fk̃ (x); this is the class for which x has the largest positive distance to its hyperplane (or the smallest negative distance if fk̃ (x) < 0 for all k). If there are ̃ several classes with a maximum value of f(x), then one of the classes is chosen randomly. C

DOI: 10.1021/acs.jcim.5b00332 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 1. Reference results for unsupervised and fully supervised learning. Classification accuracies are shown for comparing true and predicted classes in the test sets. Euclidean and Tanimoto denote unsupervised k-medoids clustering with the respective distance functions and were calculated on all data (including both, training and test sets), but evaluated only on the test data. SVM is a linear support-vector machine which was trained on the training sets only. Black bars denote standard deviations.



EVALUATION Experiments in 15 multiclass classification scenarios were performed. The ground truth classes in these scenarios were created according to different criteria. The goal was to predict these ground-truth classifications from a second independent criterion using machine learning techniques. Extended connectivity fingerprints (ECFP6)25 were chosen as this second descriptor in all experiments. The ECFP6 was calculated using an in-house implementation. Experiment 1 studied whether active learning with a small number of training compounds can lead to a similar performance on an independent test set as supervised learning with all training compounds. Experiment 2 investigated how active learning can accelerate learning a near-optimal classification of a given compound collection. Experiment 3 studied a simple stopping criterion for the interactive classification procedure. Data Sets. Eleven compound collections were used to create 15 different multiclass classification scenarios (Table 1). Different strategies were used to determine ground truth classes in these scenarios. These strategies simulate different application scenarios in which medicinal chemists classify compound collections into multiple classes. The ground truth classes are unknown to the classification algorithms. They need to be learned from example compounds based on ECFP6 descriptors; note that these ECFP6 descriptors were not used in creating the ground truth classes. The first four compound collections served to investigate structurally motivated classifications. Two independent ground truth classifications, Zinc and PF, were studied for each compound collection, based on clustering with different structural molecular descriptors. This illustrates that for a given compound collection different target classifications can be learned. The four compound collections were taken from the ZINC database33 (accessed on September 19, 2014). Each collection contains active compounds (binding affinity ≤ 10 μM) for related target proteins across different species (human, rat, mouse, guinea-pig). The collection CP19A contains actives for Cytochrome P450 19A1 (Uniprots P11511 and P22443), DRD4 for dopamine receptor D4 (P21917, P51436, P30729), FDFT for squalen synthetase (P37268, Q02769), and KCNH2 for hERG (Q12809, O08703). The molecules were preprocessed using NAOMI.34

For each molecule a default protonation and tautomeric state was generated35 and duplicate molecules were removed. The Zinc classification is provided by the ZINC database and based on path-based fingerprints representing chemical structure. The classes were created with the sphere-exclusion clustering algorithm employing the Tanimoto similarity measure and a minimal distance of cluster centers of 0.85, using the software JKlustor (ChemAxon)36 in version 5.8.2. For the PF classification, compounds were represented by 2D pharmacophores (calculated with the program GenerateMD, Version 14.7.7.0, ChemAxon). The classes were determined by the k-medoids algorithm using a Euclidean distance function. The number of classes was set to k = 10. Other settings for k led to similar results and are not reported here. k-medoids was run ten times with different random initializations. The clustering with the minimal average distance of compounds to cluster centers was selected to create classes. In addition, four compound collections with biologically motivated target classification were used in the experiments. The collections Kinases and GPCR were created from compounds from the ChEMBL database37 (Version 19) which are selective for target proteins of a functional class. The collections contain only compounds which are active for exactly one of the considered target proteins and inactive for all others. The target proteins determine the classes in each collection. A molecule was considered active if it had a Ki/IC50 value below 1 μM and inactive otherwise. The molecules were preprocessed as described above. The data-set Kinases contains active molecules for the protein kinases CSF-1R (Uniprot P07333), PIM1 (P11309), MAPK9 (P45984), and ROCK1 (Q13464). GPCR contains molecules active on the proteins Adenosin-R.A2a (P29274), Adenosin-R.A2b (P29275), and Adenosin-R.A3 (P33765) of the functional class of GPCR A (nucleotide-similar). The collections RTK-50 and RTK-25 were constructed from single targets in the ChEMBL database (Version 19). They contain compounds with published Ki/IC50 values for the receptor tyrosine kinases P00533 (ChEMBL203; epidermal growth factor receptor erbB1) and P35968 (CHEMBL279; vascular endothelial growth factor receptor 2) respectively. The classes in each collection correspond to the publications in which compounds appeared. Compounds with multiple publications were removed. In RTK-50, only classes with at D

DOI: 10.1021/acs.jcim.5b00332 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 2. Experiment 1: reduction of training data by active learning. Fractions of the training sets which are required to achieve more than 99% of the test-set classification accuracy of the models learned on the complete training sets. Random chooses the training instances randomly. Active chooses the training instances with active learning.

estimated instead, potentially leading to worse results.) Euclidean and Tanimoto distance functions were investigated. The unsupervised clustering methods were used to cluster all of the data (both training and test set) at once. For evaluation, only the predicted clusters of the test compounds were taken into account. Unsupervised clustering does not assign classes to clusters. To be able to compute classification accuracies nonetheless, clusters were mapped to classes such that the highest classification accuracy was achieved. For supervised learning, a multiclass SVM with a linear kernel was trained on the training data to learn a classification model and predict classes of the test data. Overall, supervised learning achieves good prediction performance. Unsupervised clustering achieves significantly worse results. This demonstrates that the true target classes are not trivially contained in the data and can never be found without any supervision. Experiment 1: Reduction of Training Data by Active Learning. Experiment 1 investigates whether accurate supervised learning is possible using a small fraction of actively selected compounds from the training set instead of using the complete training set. The SVM trained on complete training sets (see above) is used as a baseline. Its performance is compared to the performance of two methods that select subsets of the training compounds for training an SVM: first a random baseline which chooses training compounds randomly; second, the active learning method. Both methods choose compounds one-by-one, update their models accordingly, and get evaluated after each update on the independent test sets. A method is evaluated by the number of required training compounds to achieve for the first time more than 99% of the classification performance of the baseline model trained on all training data. The results in Figure 2 show the corresponding fractions of the training sets in percent. Both random and active methods require less than 100% of the training data, suggesting that the training sets contain compounds with redundant information. On average, active learning always requires less than two-thirds of the training data and often significantly less. Its required fraction is always substantially less than the required fraction for learning with random selection. Experiment 2: Near-Optimal Classification of Given Collection. Achieving an almost perfect classification of a given compound collection is mandatory in many drug discovery processes. The chemist will therefore do as many working steps as are required for this purpose.

least 50 compounds were considered; in RTK-25, only classes with at least 25 compounds. Finally, three expert data sets based on chemotypes served to investigate a real-life application of the proposed method. Compound collections in the DUD database38 were used. Each collection contains active and inactive compounds for a given target. The clusterings performed by Andrew Good39 based on reduced graphs were used as target classifications. Three targets with many compounds and multiple large classes were chosen. For each data set, classes with less than five compounds were removed. This procedure resulted in the data sets DUD(P38), DUD(PDGFRB), and DUD(ACHE). Experiments. Each experiment was run ten times with different random seeds. The random seeds determine the initial training sets for active learning, the selected compounds for the random selection baseline strategy, and initializations of unsupervised clustering algorithms. The algorithms were evaluated by their classification accuracy (CA). Each data set was split randomly into a training (75%) and a test set (25%) for Experiment 1. As a baseline reference to the state-of-the-art, unsupervised clustering and fully supervised learning were applied to the investigated data sets. Both techniques can be viewed as being at the extremes of interaction with a human user: unsupervised learning requires 0% of the training data to be labeled and thus may operate without human workload. However, its performance may be unsatisfactory. In particular, unsupervised clustering cannot account for different classifications of the same data set. On the other hand, fully supervised learning often results in accurate classification models, but requires 100% of the training data to be labeled. The active learning method proposed in this paper tries to trade-off both techniques: it may require only a small fraction of the training data to be labeled and thus poses less workload on the human user, while still yielding the same performance as fully supervised learning. Reference results for unsupervised clustering and fully supervised learning are shown in Figure 1. The baseline of simply estimating the largest class, which in practice is unknown, would get classification rates of significantly less than 0.5 in most scenarios (not shown in the graph). For unsupervised clustering, each data set was clustered with the k-medoids algorithms based on the ECFP6 descriptors. The value of k was set to the correct number of target classes. (In practice, k is not known and has to be E

DOI: 10.1021/acs.jcim.5b00332 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 3. Experiment 2: near-optimal classification of given collection. Fractions of the complete data sets which are required to learn models with 99% classification accuracy on the remaining compounds not used for training. Random chooses the training instances randomly. Active chooses the training instances with active learning.

Figure 4. Experiment 2: near-optimal classification of given collectionlearning curves. The curves show how the classification performance evolves with the number of training compounds by random or active selection of compounds (after initialization with 10 randomly chosen training compounds). The performance is measured as classification accuracy (CA) on the remaining data which have not yet been chosen for learning. The colored bands indicate the standard deviations.

Experiment 2 investigates to what extent active learning can accelerate finding a near-optimal classification. The goal is to achieve near-optimal classification on the complete data set (and not only on a dedicated test set). Therefore, in contrast to Experiment 1, the data sets are not split into training and test partitions here. Instead, the investigated methods select training compounds one-by-one from the complete data sets. The classification accuracy of the resulting updated models is measured on all remaining unlabeled compounds not yet used for learning. This evaluation procedure rewards to recognize and choose outlier compounds for labeling: if the goal is a perfect classification of the given data set, it makes sense to request labels for them as these are hard to predict by a classification algorithm. Active learning and learning with random selection are compared. A method is evaluated by the number of required training compounds to achieve for the first time a classification accuracy of at least 99% on the remaining compounds. This can be interpreted as having solved the multiclass classification task. Figure 3 shows the results. Active learning always requires less compounds than learning with random selection, often decreasing the fraction by more than half. Figure 4 presents typical learning curves for exemplary data sets. The curves indicate how the accuracies on the remaining data evolve for active and random learning with the number of chosen training points. After some iterations, the accuracies of active learning are always better. In one case, random selection performs better

for some initial steps before it is outperformed by active learning: the purely random initial exploration of random selection is beneficial in this scenario. This shows the potential for developing more sophisticated active learning techniques than the uncertainty-based selection proposed in this paper. Table 2 shows the trade-off between multiclass classification accuracy and human workload in terms of the number of feedback questions about class labels. For learning with active and random selection, the number of labeled compounds required to achieve different accuracy levels on the remaining unlabeled compounds is presented. While full manual labeling leads to optimal classifications, it has a high human workload. At the other extreme, unsupervised clustering requires no human work at all, but leads to poor classification accuracies. Interactive multiclass classification with active learning has the best trade-off of classification accuracy and human workload. Experiment 3: Stopping Criterion. In practical applications, it is important to determine when the interactive classification procedure can be stopped. This is the case when the classification model is sufficiently accurate or, if the focus is on learning a classification model and not on perfectly classifying the given compound collection, when learning with the remaining data would not further improve the model. Different heuristics for estimating stopping points have been proposed.3,30 Experiment 3 investigates a simple stopping criterion based on a validation set, containing nv randomly chosen labeled compounds which are not used for training. In F

DOI: 10.1021/acs.jcim.5b00332 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

ments on their minimum accuracies for stopping may help to alleviate this problem.

Table 2. Experiment 2: Near-Optimal Classification of Given CollectionWorkload−Accuracy Trade-offa

guessing largest class unsupervised manual supervised random

supervised active

DRD4 Zinc

CP19A PF10

GPCR

n = 2339

n = 1960

n = 499

no. of labels

CA

no. of labels

CA

no. of labels

CA

0 0 2339 72 284 1796 2314 2328 67 184 520 781 1332

0.45 0.43 1.00 0.70 0.80 0.90 0.95 0.99 0.70 0.80 0.90 0.95 0.99

0 0 1960 80 169 470 1044 1912 140 223 336 445 638

0.19 0.45 1.00 0.70 0.80 0.90 0.95 0.99 0.70 0.80 0.90 0.95 0.99

0 0 499 12 25 65 192 478 13 17 29 45 104

0.45 0.60 1.00 0.70 0.80 0.90 0.95 0.99 0.70 0.80 0.90 0.95 0.99



DISCUSSION

Experiments were performed in a variety of structurally and biologically motivated as well as real-life compound multiclass classification settings. The results show that standard unsupervised clustering is likely not to find clusters that are consistent with the true classes. This demonstrates that in the studied data sets, the correct multiclass classifications were not trivially contained in the structural molecular description (ECFP6). For the real-life DUD data sets, the results of unsupervised learning were slightly better: here, the differences between target classes are prominent and are reflected to some degree in different molecular features. For the other data sets, the results of unsupervised clustering were most often poor. As discussed above, the results of unsupervised clustering depend strongly on the choice of molecular description and distance function. While standard choices for both were used in the evaluation (k-medoids with ECFP6 and Euclidean/Tanimoto similarity functions), better results might have been achieved with different choices. However, unsupervised learning algorithms cannot determine which choices are best in a given context. Even for a human expert, this choice is usually difficult or even impossible to make. Unsurprisingly, supervised learning methods achieve better results than unsupervised clustering. Yet, the results indicate that these improvements can be substantial. In the investigated scenarios, supervised learning led to accurate, and often very accurate, classification models. This implies that the used structural molecular descriptor (ECFP6) does contain the required information to distinguish true classes (although it was not used to create these classes). This is a necessary precondition for active learning techniques to perform well, as active learning is based on supervised learning. Supervised learning requires a training set of compounds together with class labels. The experimental results show that a small number of actively chosen training compounds is sufficient to learn models that are as accurate as when learning on all training data (Experiment 1). Comparisons to randomly choosing compounds indicate that it is the active selection strategy which is responsible for the large decreases in required compound numbers. The results show that with active learning a given compound collection can often be classified near

a

The workload is measured by the required number of labeled compounds (no. of labels). CA is the classification accuracy on the data set. Guessing largest class assigns all compounds to the largest class. Unsupervised is k-medoids clustering with Euclidean distance. Manual is fully human multiclass classification. Supervised random is supervised learning with random selection of training compounds. Supervised active is the proposed method using active selection.

each round, the performance of the current classifier is evaluated on this validation set. If the performance is satisfactory, the interactive classification procedure can be stopped. The active learning component may still choose compounds from the validation set. In this case, the compound is replaced by a randomly chosen compound from the yet unlabeled compounds. The replacing compound is then labeled by the chemist. Overall, using this stopping criterion requires additional nv labelings of the chemist. Figure 5 compares on three exemplary data sets how the accuracies on the validation set (nv = 50) evolve in comparison to the accuracies on all remaining compounds not used for training. Overall the results show that the validation-set accuracies are helpful to estimate the overall classification accuracy. Unsurprisingly, the validation-set accuracies often exhibit a larger variance due to the random subsampling. Thus, they may overestimate the total accuracies. Stronger require-

Figure 5. Experiment 3: stopping criterion. The curves show how the classification performance evolves with the number of training compounds when testing on all compounds not yet used for learning (test all) and when testing on 50 compounds in a validation set (test 50) (after initialization with 10 randomly chosen training compounds). The colored bands indicate the standard deviations. G

DOI: 10.1021/acs.jcim.5b00332 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

training compounds. It is promising to investigate more sophisticated active learning strategies, such as combining strategies that take the data density into account with strategies that focus on refining the classification model. More sophisticated automated heuristics to decide when to stop the interactive learning process30 are also worth studying. In particular, interactive semiautomated stopping criteria appear interesting: they take the feedback of the chemist about the current multiclass classification quality into account. A limitation of the proposed method is that it assumes class labels to be disjunct: a compound has to belong to exactly one class. Future research should investigate how the proposed method can be extended to the multilabel setting where compounds may belong to multiple classes at the same time. This may be relevant for applications like predicting off-target effects. In the long run, it appears promising to apply active learning in the scenario of unsupervised clustering of compound collections. In active classification learning, the medicinal chemist provides directly a class label for a compound. In contrast, in active clustering,40 the feedback has to be relative in the sense that cluster membership cannot be decided by looking at a single compound, but has to consider similarities to other compounds. For instance, in active clustering approaches with constraints41 feedback is given about whether two compounds belong to the same cluster. Investigating active clustering in the context of cheminformatics appears a fruitful avenue of future research.

perfectly into a large number of classes with feedback for only half or less of the data (Experiment 2). These results are consistent across different types of target classification: structurally and biologically motivated classifications as well as real-life classifications based on chemotypes. This suggests that the success of active learning does not depend on the type of target classification. Rather, it depends on the concrete compound collection. For example, active learning was more successful for both FDFT-Zinc and FDFTPF10 than for KCNH2-Zinc and KCNH2-PF10: the type of target classification (Zinc, PF10) had less effect than the compound collection (FDFT, KCNH2). Similarly, among the biologically motivated data sets as well as among the real-life data sets, active learning showed different degrees of efficiency gains. For example, the required fraction of training compounds for DUD(P38) was less than for DUD(PDGFRB). In Experiment 1, the greatest improvement of active learning over random selection was achieved on some structurally motivated data sets (FDFT-Zinc, CP19A-Zinc). At the same time, active learning on other structurally motivated data sets showed less benefit than on data sets based on biological and real-life classification types. In Experiment 2, the best improvements within each classification-type class were similar (CP19A-Zinc, FDFT-PF10, GPCR, RTK-50, DUD-P38). Overall, these findings suggest that as long as the target classification is learnable in a fully supervised setting one may expect active learning to accelerate the learning process significantly. The extent of this acceleration depends rather on the compound collection than on the target classification type. Simple heuristics are helpful to determine good stopping points of the interactive procedure as the findings in Experiment 3 indicate. Many improvements on this stopping strategy are conceivable, for example taking the global confidence of the model on the yet unlabeled training data into account.30 Overall, the experimental results show that the proposed interactive multiclass classification method can find a good trade-off between unsupervised clustering and full manual classification. In applications, the workload of the medicinal chemist using this method depends on the desired degree of accuracy: in practice, good, but not perfect, accuracy levels are often satisfactory, resulting in a strongly reduced number of required labeling steps.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The work of U.L. has been supported by the German Research Foundation (LU1718/1).



REFERENCES

(1) Lewis, D. D.; Catlett, J. Heterogeneous Uncertainty Sampling for Supervised Learning. In Proceedings of the International Conference on Machine Learning, New Brunswick, NJ, July 10−13; Cohen, W. W., Hirsh, H., Eds.; Morgan Kaufmann Publishers: San Francisco, CA, 1994; pp 148−156. (2) Cohn, D.; Ghahramani, Z.; Jordan, M. Active Learning with Statistical Models. J. Art. Intell. Res. 1996, 4, 129−145. (3) Settles, B. Active Learning; Synthesis Lectures on Artificial Intelligence and Machine Learning; Morgan & Claypool: San Rafael, CA, 2012. (4) Olsson, F. A Literature Survey of Active Machine Learning in the Context of Natural Language Processing; Tech Report, Swedish Institute of Computer Science, 2009. (5) Qi, G.; Hua, X.; Rui, Y.; Tang, J.; Zhang, H. Two-Dimensional Active Learning for Image Classification. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska, June 24−26; IEEE Computer Society: Piscataway, NJ, 2008. (6) Cakmak, M.; Chao, C.; Thomaz, A. L. Designing Interactions for Robot Active Learners. IEEE T. Autonomous Mental Development 2010, 2, 108−118. (7) Kulick, J.; Toussaint, M.; Lang, T.; Lopes, M. Active Learning for Teaching a Robot Grounded Relational Symbols. In Proceedings of the Int. Joint Conf. on Artificial Intelligence, Beijing, China, August 3−9; Rossi, F., Ed.; AAAI Press: Menlo Park, CA, 2013.



CONCLUSIONS AND OUTLOOK This paper presented a feasibility study of active learning for multiclass classification of compounds. The proposed method is based on the active learning strategy to select compounds for training about which the current multiclass classification model is the least certain. Empirical results on collections with 86− 2870 compounds in 3−38 classes demonstrate the effectivity of this method. The active learning technique could solve these tasks using only 10−80% of the data which would be necessary for standard learning techniques. This indicates that active learning is beneficial not only for regression and binary classification tasks in cheminformatics, but also for multiclass classification of compound collections. In a broader sense, this work provides empirical evidence that active machine learning is suitable for multiclass classification in general. This work has focused on a straightforward choice of an active learning algorithm that was sufficient to outperform standard supervised learning in terms of the number of required H

DOI: 10.1021/acs.jcim.5b00332 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling (8) Danziger, S. A.; Baronio, R.; Ho, L.; Hall, L.; Salmon, K.; Hatfield, G. W.; Kaiser, P.; Lathrop, R. H. Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning. PLoS Comput. Biol. 2009, 5, 1−12. (9) Mohamed, T. P.; Carbonell, J. G.; Ganapathiraju, M. K. Active Learning for Human Protein-Protein Interaction Prediction. BMC Bioinf. 2010, 11, 1−9. (10) Körner, C.; Wrobel, S. Multi-class Ensemble-Based Active Learning. In Proceedings of the European Conference on Machine Learning, Berlin, Germany, September 18−22; Fürnkranz, J., Scheffer, T., Spiliopoulou, M., Eds.; Springer: Berlin, Germany, 2006; pp 687− 694. (11) Jain, P.; Kapoor, A. Active Learning for Large Multi-Class Problems. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Miami, Florida, June 20−25; IEEE Computer Society: Piscataway, NJ, 2009; pp 762−769. (12) Joshi, A. J.; Porikli, F.; Papanikolopoulos, N. Multi-Class Active Learning for Image Classification. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Miami, Florida, June 20−25; IEEE Computer Society: Piscataway, NJ, 2009; pp 2372−2379. (13) Yang, Y.; Ma, Z.; Nie, F.; Chang, X.; Hauptmann, A. Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization. Int. J. of Computer Vision 2015, 113, 113−127. (14) Murphy, R. F. An. Active Role for Machine Learning in Drug Development. Nat. Chem. Biol. 2011, 7, 327−330. (15) Reker, D.; Schneider, G. Active-Learning Strategies in Computer-Assisted Drug Discovery. Drug Discovery Today 2015, 20, 458−465. (16) Warmuth, M. K.; Liao, J.; Rätsch, G.; Mathieson, M.; Putta, S.; Lemmen, C. Active Learning with Support Vector Machines in the Drug Discovery Process. J. Chem. Inf. Model. 2003, 43, 667−673. (17) Tong, S.; Koller, D. Support Vector Machine Active Learning with Applications to Text Classification. J. Mach. Learning Res. 2001, 2, 45−66. (18) Fujiwara, Y.; Yamashita, Y.; Osoda, T.; Asogawa, M.; Fukushima, C.; Asao, M.; Shimadzu, H.; Nakao, K.; Shimizu, R. Virtual Screening System for Finding Structurally Diverse Hits by Active Learning. J. Chem. Inf. Model. 2008, 48, 930−940. (19) Desai, B.; Dixon, K.; Farrant, E.; Feng, Q.; Gibson, K.; van Hoorn, W.; Mills, J.; Morgan, T.; Parry, D.; Ramjee, M.; Selway, C.; Tarver, G.; Whitlock, G.; Wright, A. Rapid Discovery of a Novel Series of Abl Kinase Inhibitors by Application of an Integrated Microfluidic Synthesis and Screening Platform. J. Med. Chem. 2013, 56, 3033− 3047. (20) De Grave, K.; Ramon, J.; De Raedt, L. Active Learning for High Throughput Screening. Lecture Notes in Computer Science: Discovery Science 2008, 5255, 185−196. (21) Besnard, J.; Ruda, G. F.; Setola, V.; Abecassis, K.; Rodriguiz, R. M.; Huang, X.-P.; Norval, S.; Sassano, M. F.; Shin, A. I.; Webster, L. A.; Simeons, F. R.; Stojanovski, L.; et al. Automated Design of Ligands to Polypharmacological Profiles. Nature 2012, 492, 215−220. (22) Garnett, R.; Gärtner, T.; Vogt, M.; Bajorath, J. Introducing the ’Active Search’ Method for Iterative Virtual Screening. J. Comput.Aided Mol. Des. 2015, 29, 305−314. (23) Kangas, J.; Naik, A. W.; Murphy, R. F. Efficient Discovery of Responses of Proteins to Compounds using Active Learning. BMC Bioinf. 2014, 15, 143−153. (24) Naik, A. W.; Kangas, J. D.; Langmead, C. J.; Murphy, R. F. Efficient Modeling and Active Learning Discovery of Biological Responses. PLoS One 2013, 8, 1−13. (25) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (26) Friedman, E. Active Learning for Smooth Problems. In Proceedings of the Conference on Learning Theory, Montreal, Quebec, June 18−21, 2009. (27) Seung, H. S.; Opper, M.; Sompolinsky, H. Query by Committee In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, July 27−29; ACM: New York, NY, 1992; pp 287−294.

(28) Yu, K.; Bi, J.; Tresp, V. Active Learning via Transductive Experimental Design. In Proceedings of the Int. Conf. on Machine Learning, Pittsburgh, Pennsylvania, June 25−29; Cohen, W. W., Moore, A., Eds.; ACM: New York, NY, 2006. (29) Nguyen, H.; Smeulders, A. Active Learning Using PreClustering. In Proceedings of the Int. Conf. on Machine Learning, Banff, Alberta, July 4−8; Brodley, C. E., Ed.; ACM: New York, NY, 2004; pp 623−630. (30) Vlachos, A. A. Stopping Criterion for Active Learning. Comput. Speech Lang. 2008, 22, 295−312. (31) Rifkin, R.; Klautau, A. In Defense of One-vs-All Classification. J. Mach. Learning Res. 2004, 5, 101−141. (32) Schapire, R. E.; Freund, Y.; Barlett, P.; Lee, W. S. Boosting the Margin: A New Explanation for the Effectiveness of Voting methods. In Proceedings of the Int. Conf. on Machine Learning, Nashville, TE, July 8−12; Fisher, D. H., Ed.; Morgan Kaufmann: Nashville, TN, 1997; pp 322−330. (33) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. Zinc: A free Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757−1768. (34) Urbaczek, S.; Kolodzik, A.; Fischer, J. R.; Lippert, T.; Heuser, S.; Groth, I.; Schulz-Gasch, T.; Rarey, M. NAOMI: On the Almost Trivial Task of Reading Molecules from Different File Formats. J. Chem. Inf. Model. 2011, 51, 3199−3207. (35) Urbaczek, S.; Kolodzik, A.; Rarey, M. The Valence State Combination Model: A Generic Framework for Handling Tautomers and Protonation States. J. Chem. Inf. Model. 2014, 54, 756−766. (36) ChemAxon Ltd., JKlustor. http://www.chemaxon.com/ products/jklustor/ (accessed Dec 12, 2014). (37) Bento, A.; Gaulton, A.; Hersey, A.; Bellis, L.; Chambers, J.; Davies, M.; Kruger, F.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka, M.; Papadatos, G.; Santos, R.; Overington, J. The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42, 1083−1090. (38) Huang, N.; Shoichet, B. K.; Irwin, J. J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49, 6789−6801. (39) Good, A.; Oprea, T. Optimization of CAMD Techniques 3. Virtual Screening Enrichment Studies: A Help or Hindrance in Tool Selection? J. Comput.-Aided Mol. Des. 2008, 22, 169−178. (40) Wauthier, F. L.; Jojic, N.; Jordan, M. I. Active Spectral Clustering via Iterative Uncertainty Reduction. In Proceedings of the Int. Conf. on Knowledge Discovery and Data Mining, Beijing, China, August 12−16; Yang, Q., Agarwal, D., Pei, J., Eds.; ACM: New York, NY, USA, 2012; pp 1339−1347. (41) Basu, S.; Banjeree, A.; Mooney, E.; Banerjee, A.; Mooney, R. J. Active Semi-Supervision for Pairwise Constrained Clustering. In Proceedings of the Int. Conf. on Data Mining, Lake Buena Vista, Florida, April 22−24; Berry, M. W., Dayal, U., Kamath, C., Skillicorn, D. B., Eds.; SIAM: Philadelphia, PA, 2004; pp 333−344.

I

DOI: 10.1021/acs.jcim.5b00332 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX