Demystifying Multitask Deep Neural Networks for Quantitative

Peter PogányNavot AradSam GenwayStephen D. Pickett. Journal of Chemical Information and Modeling 2018 Article ASAP. Abstract | Full Text HTML | PDF ...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/jcim

Cite This: J. Chem. Inf. Model. 2017, 57, 2490-2504

Demystifying Multitask Deep Neural Networks for Quantitative Structure−Activity Relationships Yuting Xu,*,† Junshui Ma,† Andy Liaw,† Robert P. Sheridan,‡ and Vladimir Svetnik† †

Biometrics Research Department, Merck & Co., Inc., Rahway, New Jersey 07065, United States Modeling and Informatics Department, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States



J. Chem. Inf. Model. 2017.57:2490-2504. Downloaded from pubs.acs.org by OPEN UNIV OF HONG KONG on 01/23/19. For personal use only.

S Supporting Information *

ABSTRACT: Deep neural networks (DNNs) are complex computational models that have found great success in many artificial intelligence applications, such as computer vision1,2 and natural language processing.3,4 In the past four years, DNNs have also generated promising results for quantitative structure−activity relationship (QSAR) tasks.5,6 Previous work showed that DNNs can routinely make better predictions than traditional methods, such as random forests, on a diverse collection of QSAR data sets. It was also found that multitask DNN modelsthose trained on and predicting multiple QSAR properties simultaneouslyoutperform DNNs trained separately on the individual data sets in many, but not all, tasks. To date there has been no satisfactory explanation of why the QSAR of one task embedded in a multitask DNN can borrow information from other unrelated QSAR tasks. Thus, using multitask DNNs in a way that consistently provides a predictive advantage becomes a challenge. In this work, we explored why multitask DNNs make a difference in predictive performance. Our results show that during prediction a multitask DNN does borrow “signal” from molecules with similar structures in the training sets of the other tasks. However, whether this borrowing leads to better or worse predictive performance depends on whether the activities are correlated. On the basis of this, we have developed a strategy to use multitask DNNs that incorporate prior domain knowledge to select training sets with correlated activities, and we demonstrate its effectiveness on several examples.



INTRODUCTION Over the past 10 years, deep neural networks (DNNs) have come to dominate the machine learning field and have made dramatic improvements in many domains,1,7 such as speech recognition,8 object detection,9 genomics,10 and drug discovery.5,11 In 2012, this laboratory sponsored a data science competition via Kaggle (www.kaggle.com/c/MerckActivity) to find state-of-the-art methods for quantitative structure−activity relationship (QSAR) in drug discovery. Using DNNs, the winning team achieved a 15% improvement in prediction accuracy on given data sets over the internal baseline performance using random forests (RF).6 The success of DNNs was inspiring. RF has served as a “gold standard” among various QSAR methods in this laboratory since its introduction into the QSAR field by Svetnik et al.12 Compared with other QSAR methods, RF has the advantages of high prediction accuracy, ease of use, and robustness toward adjustable parameters. In the past 10 years, no QSAR methods other than DNNs had outperformed RF by such a margin. However, the large number of tunable parameters in neural networks are difficult to optimize without in-depth knowledge and familiarity of DNN models. Previous work by Ma et al.5 studied the impacts of different DNN parameters on the predictive performance and recommended a single set of those parameters for DNN that performs well for many diverse QSAR tasks. It also confirmed that DNNs generally achieve better predictive performance than RF. © 2017 American Chemical Society

DNN has since become a routine QSAR method for drug discovery in the pharmaceutical industry. However, according to Ma et al.,5 there is an unsolved puzzle regarding the use of multitask DNNs. The multitask DNNs, which were called “joint DNNs” in Ma et al.5 can simultaneously model more than one molecular activity (task). All activities share the same input and hidden layers, but each activity has its own output node. More detailed explanation of multitask DNNs will be provided subsequently in Methods. For clarity, we will refer to DNNs that model each QSAR activity separately as single-task DNNs. While the neural network architecture with multiple outputs is commonly used in other machine learning applications, such as classification tasks for multiple objects,13−15 joint modeling of several tasks has not been a standard approach in QSAR. Multitask DNNs for QSAR were first introduced by the winning team in the Kaggle QSAR competition, led by George Dahl, and this was described in a follow-up interview16 as the “single most import insight into the data”. However, the winning team had an explanation only at the conceptual level, which is that the multitask architecture facilitated reuse of features learned from multiple tasks and shared statistical strength. However, exactly how the information was shared across different tasks was unclear. It was also not obvious to chemists how multitask models can Received: February 14, 2017 Published: September 5, 2017 2490

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling

Two hypotheses regarding the performance edge of multitask DNNs were proposed by Ma et al.:5 • Hypothesis 1: Training a multitask DNN has a regularization effect. Incorporating several diverse tasks into a multitask DNN produces a regularization effect to overcome potential overfitting, which is a common problem for DNNs with a substantial number of parameters, especially for QSAR tasks with relatively small training sets. This hypothesis was also suggested by Caruana20 as the overfitting prevention hypothesis. • Hypothesis 2: Multitask DNNs can take advantage of larger training data set. A multitask DNN can learn better from more training data in multiple QSAR data sets, which is especially beneficial for tasks with a smaller training set. On the other hand, for the very large QSAR data sets in the Kaggle competition, single-task DNNs performed better since they already had a sufficient amount of training data. Ma et al.5 did not examine either of those hypotheses. Instead they gave a preliminary recommendation, based on heuristic observations, that smaller data sets should be combined in a multitask DNN model. Dahl et al.6 provided more empirical results on the comparison between multitask DNNs and other methods for QSAR problems, but they did not explain how and when there is an advantage to use multitask DNNs over singletask DNNs. Subsequently, additional investigations into multitask DNNs were published by other research groups.24,25 They again observed the generally higher predictive performance of multitask DNNs but did not shed more light on the reason. Therefore, investigation is needed into why a multitask DNN can outperform separate single-task DNNs and under what scenarios multitasking has an advantage. In this effort, we focused on demystifying multitask DNNs by exploring their underlying mechanism. As a result, we found both of the hypotheses regarding multitask DNNs stated by Ma et al.5 to be incorrect. This study explains the reason why multitask DNNs outperformed single-task DNNs on average on the Kaggle data sets. Given the reason, we propose a strategy for constructing multitask DNNs to achieve better prediction for QSAR tasks in a realistic industrial drug discovery setting. Several application examples are provided in the Results to demonstrate and validate our finding and proposed strategy. The multitask DNN codes are provided at https://github. com/Merck/DeepNeuralNet-QSAR to facilitate independent validation of our findings.

usefully borrow information from tasks that have very different natures, as was the case for the Kaggle competition. Furthermore, although multitask DNNs usually outperform single-task DNNs when averaged over many tasks, the predictive performance of multitask DNNs can be worse for specific tasks. Therefore, in order to use multitask DNNs effectively, an in-depth understanding of when and how they work becomes critically important. In the cheminformatics literature, an analogous approach called transfer learning has been widely studied. Pan17 and Weiss18 provide comprehensive reviews of a variety of transfer learning algorithms. The goal of transfer learning is to improve a learner from one domain by transferring information from another domain. In contrast to traditional machine learning methods, this allows the training and test data to have different distributions or come from different feature spaces. When the training data of the target task are insufficient, there is a critical need for transfer learning, which creates models trained with easily obtained data from different tasks. The major challenge for transfer learning is to identify what knowledge should be transferred and how to transfer helpful information while avoiding transfer of information that results in reduced predictivity. The latter effect, called ”negative transfer”, is not well understood and requires further investigation.17−19 Multitask training of neural networks4,20 is one of the learning techniques for inductive transfer learning, where the target domain has available labeled training data. However, the inductive transfer learning problem focuses only on improving performance in the target task, while multitask learning usually tries to build a better model for multiple tasks simultaneously.17 Few examples of the application of multitask learning in QSAR have been reported in the transfer learning literature. Varnek et al.21 compared the performance of conventional single-task learning and two inductive transfer approaches, multitask learning and feature nets,22 which are performed using associative neural network23 and partial least-squares methods. Their empirical results showed that the prediction accuracy of associative neural network models increases with the number of simultaneously modeled tasks. However, the study focused on the common transfer learning setting, where the training data are scarce and single-task modeling is unable to produce any predictive model. The size of the experimental data set used for comparison varied from 27 to 138 compounds, and the neural network had only one hidden layer with four neurons, which is very small compared with current deep learning models. In our QSAR problems, there are abundant labeled training data for each task, which is very small compared with current deep learning models. In our QSAR problems, there are sufficient labeled training data to build single-task models. In the early multitask learning work by Caruana,20 several hypotheses for the better performance of multitask learning over single-task learning are proposed, including data amplification, eavesdropping, attribute selection, representation bias, and overfitting prevention. That work provided a comprehensive discussion of the general multitask learning mechanism. However, because of the limit imposed by computing technology at the time, most of the conclusions are based on small-scale shallow neural networks. With the rapid development of the machine learning technique in recent years, further investigation of stateof-the-art multitask learning methods is needed. In this work, we focused on understanding the difference in performance of multitask and single-task DNNs on QSAR tasks. It is unclear whether the conclusions in previous literature are applicable to this problem.



METHODS Data Sets. In this study, the Kaggle data sets were used to study the mechanism of multitask DNNs, and the CYP data sets were used to validate our findings regarding multitask DNNs. These data sets are described as follows: • The Kaggle data sets are the 15 QSAR tasks used for the Kaggle competition. The tasks are of various sizes (i.e., 2000−50000 molecules each) for either on-target potency or off-target absorption, distribution, metabolism, and excretion (ADME) activities. They were explained in detail and provided in their complete form by Ma et al.5 In these data sets, only a small fraction of molecules have measured activities in more than one task. Also, the activities are mostly unrelated. • In the CYP data sets, a set of 49 550 molecules have measured activities for three CYP isoenzymes: 3A4, 2C9, 2491

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling Table 1. Description of the Data Sets data set

type

3A4 CB1 DPP4 HIVINT HIVPROT LOGD METAB NK1 OX1 OX2 PGP PPB RAT-F TDI THROMBIN

ADME target target target target ADME ADME target target target ADME ADME ADME ADME target

3A4 2C9 2D6

ADME ADME ADME

description Kaggle Data Sets CYP P450 3A4 inhibition −log(IC50/M) binding to cannabinoid receptor 1 −log(IC50/M) inhibition of dipeptidyl peptidase 4 −log(IC50/M) inhibition of HIV integrase in a cell based assay −log(IC50/M) inhibition of HIV protease −log(IC50/M) logD measured by HPLC percent remaining after 30 min of microsomal incubation inhibition of neurokinin1 (substance P) receptor binding −log(IC50/M) inhibition of orexin 1 receptor −log(Ki/M) inhibition of orexin 2 receptor −log(Ki/M) transport by p-glycoprotein log(BA/AB) human plasma protein binding log(bound/unbound) log(rat bioavailability) at 2 mg/kg time-dependent 3A4 inhibitions log[(IC50 without NADPH)/(IC50 with NADPH)] human thrombin inhibition −log(IC50/M) CYP Data Sets CYP P450 3A4 inhibition −log(IC50/M) CYP P450 2C9 inhibition −log(IC50/M) CYP P450 2D6 inhibition −log(IC50/M)

number of molecules

number of unique AP/DP descriptors

50000 11640 8327 2421 4311 50000 2092 13482 7135 14875 8603 11622 7821 5559 6924

9491 5877 5203 4306 6274 8921 4595 5803 4730 5790 5135 5470 5698 5945 5552

49550 49550 49550

9177 9177 9177

Figure 1. Visualization of the molecular activities in the Kaggle data sets. Activities of a subset of molecules in Kaggle data sets are shown. We randomly sampled 100 molecules in the data set for each of the 15 QSAR tasks and plotted the 15 activities of these selected molecules. Almost all of the molecules were tested in less than three activities. The unmeasured activities are displayed in gray. The molecular activities from different tasks have been scaled to the same range based on the mean and standard deviation for the purpose of visualization.

Table 1 provides a detailed description of all of the data sets. The Kaggle data sets were provided in the Supporting Information of the paper by Ma et al.5 The additional proprietary CYP data sets are provided in the Supporting Information of the present article with the descriptor names disguised. Figures 1 and 2 show the activities of the Kaggle and CYP data sets. We randomly sampled 100 molecules from each Kaggle data

and 2D6. The 3A4 data set in the Kaggle data sets contains the same data except that it has only the 3A4 activities. Unlike in the Kaggle sets, the three activities in the CYP data sets were measured on the same set of molecules, and the activities are somewhat correlated since the isoenzymes were from the same cytochrome P450 family. 2492

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling

Figure 2. Visualizations of the molecular activities in the CYP data sets. A comparison between (left) the dense full CYP data sets and (right) the sparse partial CYP data sets is shown. In both panels, molecules numbered >37216, which are shown above the black horizontal lines, constitute the time-split test sets. In order to make the sparse data sets (right) from the original dense CYP data sets (left), we kept only the first one-third of training set molecular activities for 2C9, the middle one-third for 2D6, and the last one-third for 3A4. The test sets were unchanged. The missing activities are displayed in gray. The molecular activities from different tasks have been scaled to the same range for the purpose of visualization.

discovery environment, QSAR models are used for prospective prediction, which means we always build models on currently available molecules and make predictions for molecules that have not yet been tested/assayed. The untested molecules may or may not be similar to the molecules in the training set. The time split better simulates such a scenario. Since the training set and test set are not randomly selected from the same pool of molecules, this presents a challenge for many machine learning methods. Descriptor. Each molecule is represented as a list of features (i.e., “descriptors”) in QSAR nomenclature. In this work, we combined two descriptor types: atom pair (AP) descriptors from Carhart et al.27 and donor−acceptor pair (DP) descriptors, called BP in Kearsley et al.28 Both types of descriptors are of the following form:

set and stacked them together to form a set of 1500 molecules. Their activities are displayed in Figure 1, in which each row corresponds to one molecule and each column to a QSAR task. If the activity of a task for a molecule is unavailable, it is colored in gray. Since the activities of the 15 Kaggle data sets were separately generated for different tasks, only a few molecules have activities across different tasks. Thus, the activities of the Kaggle data sets exemplify a “sparse” pattern. Figure 2a shows the activities in the three CYP data sets. Since every molecule was measured in all three assays, the activities of the CYP data sets exhibit a “dense” pattern. The molecules were generally ordered according to their assay dates, and the black horizontal line separates the data sets into the training sets (bottom) and the test sets (top). We can make the activities in the CYP training sets “sparse” by removing from them two-thirds of the molecules for each of the three activities, as shown in Figure 2b. These “partial CYP” data sets will be used subsequently to validate our findings and proposed strategy for using multitask DNNs. For evaluation of QSAR methods, each data set is partitioned into two subsets: a training set and a test set, as shown in Figure 2. The training set is used to build the QSAR models, and the test set is used to evaluate the accuracy of the predictions from the models. The training and test set were derived by “time split”: in each data set, the 75% of the molecules that were assayed first were put into the training set, and the remaining 25% of the molecules assayed later were included in the test set. We have found that the predictive performance of the time-split validation is more realistic compared to true prospective prediction than the prediction accuracy of the more common random-split validation, which tends to be optimistic.26 In a real-world drug

atom type i − (distance in bonds) − atom type j

For AP, the atom type includes the element, number of nonhydrogen neighbors, and number of π electrons. For DP, the atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, or other). Single-Task and Multitask DNNs. A neural network model has a hierarchical network structure consisting of multiple layers. The lowest layer is the “input layer”, which accepts the molecular descriptors. The “output layer” at the top produces the predicted activities. In a single-task neural network, there is a single neuron in the top layer; in a multitask neural network for N data sets, there are N neurons in the top layer. In the middle are one or more “hidden layers”, which generate a very complex nonlinear transformation from the input variables to the output variable(s). A “deep” neural network has more than one hidden layer and is 2493

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling able to model more complicated interactions or relationships among input variables. A basic neural network model is characterized by three key components:29 1. The interconnections between nodes. The parameters that represent the strengths of the interconnections are called “weights”. The input signal for a node is a weighted sum of the the outputs of nodes in the previous layer. 2. The activation function, which is a nonlinear function that converts the weighted sum of the input signals to output at each node. 3. The optimization algorithm, which adjusts the weights to best fit the activities. The learning process for updating weights is called “training”, which occurs in an iterative fashion. During each optimization step, the weights are adjusted to reduce the difference between the prediction and the measured activity. For regression models, the typical cost function for optimization is the mean-square error (MSE). Because of the layered structure of a neural network, the training procedure is usually called “back-propagation” of errors. There are many types of neural networks designed for different engineering tasks, such as the convolutional neural networks for two-dimensional image data and the recurrent neural network for a sequence of inputs. We used the fully connected feed-forward neural network trained by backpropagation,30 which has been shown by Ma et al.5 to be an appropriate model for QSAR prediction tasks. Conventional multitask DNN programs can naturally handle the case where the activities of multiple training sets are in a “dense” form, like those of the full CYP data sets in Figure 2a. For the “sparse” activities, like those shown in Figures 1 and 2b, Dahl et al.6 proposed a slightly revised version of the conventional training procedure that essentially treated those unavailable activities as missing data. Assistant Data Sets and Reduced Assistant Data Sets. When we are interested in building a predictive model for a certain QSAR task, we call the task of interest the “primary data set” or “primary task”. In order to benefit from the superior performance of the multitask DNNs, we need to collect training data on other QSAR activities to be combined with the primary data set. Such data sets are thus called “assistant data sets” or “assistant tasks”. In a later session, we will introduce a strategy for selecting such assistant data sets for a primary task based on insights into multitask DNNs that we gained in this study. When the number of molecules in assistant tasks is so large that training the multitask DNNs become computationally expensive, it is desirable to select subsets of molecules from the assistant tasks to reduce the computational burden without significantly degrading the performance of the multitask DNNs. These can be called “reduced assistant data sets”. The reduced assistant data sets were constructed using a molecular similarity measure, here the Dice similarity metric and AP descriptor.27,31,32 For each molecule in the test set of the primary task, first the nearest neighbor in the training set of the primary task was found, and the similarity with this nearest neighbor was recorded as a baseline; then molecules in the training sets of assistant tasks that were more similar to the test set molecule than the baseline were selected. Those molecules were used to form the reduced assistant data sets. This procedure is illustrated in Figure 3. Performance Metrics. To evaluate the predictive performance, we used R2, which is the same metric as was used in the

Figure 3. Construction of reduced assistant training sets.

QSAR Kaggle competition and by Ma et al.5 It is defined as the square of the Pearson correlation coefficient between the predicted and observed activities in the test set. R2 measures the degree of concordance between the predictions and corresponding observations. It is attractive for comparing models across many data sets with different values and natures of the activities because it is unitless and ranges from 0 to 1 for all data sets. We also found that the conclusions based on R2 are generally in agreement with those based on other popular performance metrics, such as root-mean-square error (RMSE). Recommended DNN Parameter Settings. Most previous work on applying DNNs for QSAR problems optimized the adjustable parameters in the neural net model separately for each task, which is unrealistic for practical use. We used the set of parameters recommended by Ma et al.,5 which has been shown to work well for most QSAR tasks in both single- and multitask DNNs with time-split training and test sets. The recommended settings are as follows: • Data preprocessing: logarithmic transformation of inputs (i.e., y = log(x + 1)) • The DNNs should have four hidden layers. The recommended numbers of neurons in these four hidden layers are 4000, 2000, 1000, and 1000, respectively. • The recommended dropout rates of the DNNs are 0 in the input layer, 25% in the first three hidden layers, and 10% in the last hidden layer. • The activation function is the rectified linear unit (ReLU). • No unsupervised pretraining should be used. The network parameters should be initialized as random values. • The parameters for the optimization procedure were fixed as their default values. That is, the learning rate was 0.05, the momentum strength was 0.9, and the weight cost strength was 0.0001. Implementation. The multitask DNN algorithm was implemented in Python and based on the code in the winning entry of the QSAR Kaggle competition. The Python modules gnumpy33 and cudamat34 were used to implement GPU computing. The hardware platforms used in this study were two NVIDIA Tesla C2070 GPU cards and eight Tesla K80 GPU cards. The codes are available at https://github.com/Merck/ DeepNeuralNet-QSAR, and training times on the Kaggle data sets are provided in the Supporting Information.



RESULTS Our results can be summarized as the two findings below: • Finding 1 explains why multitask DNNs produce better or worse predictions than single-task DNNs in different scenarios. • Finding 2 specifies the conditions under which multitask DNNs do not make a significant difference in 2494

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling

Figure 4. Comparison of single-task and multitask DNNs for the Kaggle data sets. There are two box plots comparing the test set R2 values from the single-task and multitask DNNs for each of the 15 Kaggle data sets and also their average: results from single-task DNNs are shown on the left in blue, and those from multitask DNNs are shown on the right in red.

Figure 5. Two-task DNNs for OX2. Each black dot represents the OX2 test set R2 for a two-task DNN trained with the OX2 training set and one of the other 14 Kaggle training sets. The dashed red lines show the first and third quartiles of the OX2 test set R2 from multitask DNNs trained with all 15 Kaggle data sets. The solid blue lines show the first and third quartiles of the OX2 test set R2 from the single-task DNN model for the OX2 task.

• Showing the benefits of Finding 2: using the full CYP data sets to show that multitask DNNs can speed up training of three data sets without degrading the predictive performance. Comparison between Single-Task and Multitask DNNs for Each Kaggle Data Set. First, we made a detailed comparison of the predictive performance between single-task and multitask DNNs for each Kaggle data set. Although we used the same “recommended DNN settings” from Ma et al.5 to train the DNN model, the algorithm is not deterministic and is influenced by many random factors in the training process, such as initialization of weights, the randomly sampled minibatch of training data, random dropout, etc. The randomness can be controlled by setting a random seed prior to the training process. In order to make a fair and rigorous comparison, we trained each DNN model multiple times (usually 20 times) with different random

predictive performance but might improve computational efficiency. This section is arranged as follows: • Establishing a more rigorous comparison between singletask and multitask DNNs by considering the variation in prediction accuracy with different random seeds. • Gaining insights by focusing on two special cases in Kaggle data sets, i.e., OX2/OX1 and DPP4/LOGD, which have large performance differences between single-task and multitask DNNs. • Using data sets contrived from OX2/OX1 and DPP4/ LOGD to illustrate Finding 1. • Using data sets built from the Kaggle data sets to illustrate Finding 2. • Showing the benefits of Finding 1: boosting the performance of the partial CYP data sets by using multitask DNNs. 2495

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling

Figure 6. Comparison between different DNN models for OX2. Left (blue): box plot of OX2 test set R2 from single-task DNN models for the OX2 task. Middle (red): box plot of OX2 test set R2 from multitask DNNs trained with all 15 Kaggle data sets. Right (yellow): box plot of OX2 test set R2 from twotask DNNs trained with the OX2 and OX1 training sets.

Figure 7. Comparison between single-task and multitask DNNs for Kaggle data sets for the non-overlapped test set molecules. For each of the 15 Kaggle data sets and their average (last column), the results from single task DNNs are shown on the left in blue, and the results from multitask DNNs are shown on the right in red.

seeds and made comparisons based on the distributions of test set R2 values across the 20 runs. Figure 4 shows the side-by-side box plots to compare the test set R2 values obtained from singletask DNNs trained with each data set and multitask DNNs trained with all 15 data sets for each QSAR task, and also the average R2 over all tasks in the last column. We performed the two-sample t test to compare the R2 values from the single-task and multitask DNNs for each data set, and the p values are provided in the Supporting Information. The difference in R2 is significant at the 0.05 level for all of the data sets except CB1 and PGP. Ma et al.5 divided the 15 data sets into two groups on the basis of training set size, and it was concluded that for very large data sets (i.e., 3A4 and LOGD) the single-task DNNs perform better, while for the other smaller data sets a multitask DNN model performs better. However, Figure 4 suggests that the difference in prediction accuracy between single-task and multitask DNNs varies greatly across data sets regardless of their sizes.

While most previous work on multitask DNNs has focused on the average performance for a group of data sets,5,6,24,25 as in the last column (“mean”) in Figure 4, we investigated the properties of individual data sets. Our strategy was to focus on several special cases, such as the OX2 data, which has the largest increase in R2 by multitask DNNs over single-task DNNs, and the DPP4 data, which shows the largest decrease in R2. Studying the OX2 Task. We started with an investigation of the multitask improvements for OX2. Each of the other 14 Kaggle data sets was paired with OX2 to train a two-task DNN. Then we compared the OX2 test set R2 from the 14 two-task DNNs against that of the single-task DNN for OX2 as well as that of OX2 from the multitask DNNs trained on all 15 of the Kaggle data sets. Figure 5 shows the OX2 test set R2 from 14 two-task DNNs as dots. It also shows the first and third quartiles of R2 of 20 singletask DNN runs as two blue horizontal lines, while those of 2496

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling

Figure 8. Two-task DNNs for DPP4. Each black dot represents the DPP4 test set R2 for a two-task DNN trained with the DPP4 training set and one of the other 14 Kaggle training sets. The dashed red lines show the first and third quartiles of the DPP4 test set R2 from multitask DNNs trained with all 15 Kaggle data sets. The solid blue lines show the first and third quartiles of the DPP4 test set R2 from single-task DNN models for the DPP4 task.

degradation of test set R2 values. Our next step was to focus on another special case: the DPP4 data set. Because the DPP4 test set does not have much overlap with any other training set, we believed that there must be a reason other than common molecules that causes the performance difference in multitask DNNs. Studying the DPP4 Task. We applied a similar approach as before, using two-task DNNs to study the pairwise interactions between the DPP4 task and each of the other 14 Kaggle tasks. Figure 8 shows that several two-task DNNs result in worse predictions for the DPP4 test set compared with single-task DNNs. Since the DPP4/LOGD two-task DNNs resulted in the largest decrease in performance, we focused on investigating how the LOGD data set affects the predictions for the DPP4 test set. Only 2% of the molecules in the DPP4 test set appear in the LOGD training set. Inspired by the previous results for OX2, we evaluated the molecular similarity in the DPP4 test set and LOGD training set using the Dice similarity measure.27,31,32 For each DPP4 test set molecule, we compared its similarity to its nearest neighbor (NN) in the LOGD training set and the nearest neighbor in the DPP4 training set. The results show that nearly half of the DPP4 test set molecules are closer (i.e., have larger NN similarity) to the LOGD training set than the DPP4 training set. That is, a large portion of molecules in the DPP4 test set are structurally more similar to those in the LOGD training set than to those in their own training set. For the molecules in the DPP4 test set that are closer to the LOGD training set, we compared their activities with the activities of their nearest neighbors in the LOGD training set and found that these nearest-neighbor pairs have nearly uncorrelated activities. This observation suggests that the two conditions below result in degraded performance of the multitask DNNs: • Large portion of similar molecular structures between the DPP4 test set and the LOGD training set. • Uncorrelated molecular activities between the similar molecules in the two tasks. It seems that the similarity in molecular structure played an important role. The LOGD training set is one of the largest, with 37 388 diverse molecules, most of which are not similar to the DPP4 test set of 2045 molecules. Our previous observation suggests that only the similar molecules influence the prediction for the DPP4 test set in multitask DNNs. We thus introduced the

20 15-task DNN runs are shown by the two red lines. We found that the R2 of the OX2 test set from the OX2/OX1 two-task DNN is remarkably higher than the others and falls between the first and third quartiles of the 15-task DNN results, while most of other two-task DNNs give R2 within the range for single-task DNNs for OX2. The second-best model, OX2/METAB pair, shows a certain improvement, but it is not as outstanding as the OX2/OX1 pair. This observation suggests that for OX2 test set, the benefits of the 15-task DNNs mainly come from the contribution of the OX1 training set. Figure 6, which displays a box plot comparison of three cases, also supports the previous statement. Further investigation of the OX2 and OX1 data sets revealed that approximately 63% of the molecules in the OX2 test set are also found in the OX1 training set because of the time-split criterion for generating the test sets and the overlapped dates of these two assays. For these overlapped molecules, the activities of the OX1 and OX2 tasks are highly correlated (r = 0.653). This observation gave us a hint that the following two conditions may lead to multitask DNNs having improved predictive performance over single-task DNNs: • Large number of common molecules between the OX2 test set and OX1 training set. • Positively correlated molecular activities among the common molecules between OX1 and OX2. However, it is rare for the same molecules to exist in training set in one task (i.e., OX1) and the test set in another task (i.e., OX2) among the Kaggle data sets, so the case of the OX2/OX1 pair is a special case. The preliminary hypothesis needs to be generalized to be useful. Comparison between Single-Task and Multitask DNNs for Each Kaggle Data Set with Overlapped Test Set Molecules Removed. In order to study the multitask effect in a more general situation, we excluded the test set molecules that appear in any of the other training sets and kept only the nonoverlapped subset of molecules in each test set. Figure 7 is a modified version of Figure 4 that shows the comparison between single-task and multitask DNNs after removal of the overlapped test set molecules. Comparing Figure 7 with Figure 4 reveals that for some data sets the differences between the multitask and single-task results are greatly reduced or completely eliminated, such as 3A4, METAB, OX1, OX2, and RAT-F. However, there are still several data sets with significant improvement or 2497

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling

Figure 9. Comparison between different DNN models for DPP4. Shown are box plots of DPP4 test set R2 from different DNN models (from left to right): (1) from single-task DNN models for the DPP4 task; (2) from multitask DNNs trained with the 15 Kaggle data sets; (3) from two-task DNNs trained with the DPP4 and LOGD training sets; (4) from two-task DNNs trained with the DPP4 training set and the reduced LOGD training set.

Figure 10. Summary of Finding 1.

Figure 11. Box plots of DPP4 test set R2 from different DNN models (from left to right): (1) from single-task DNN models for the DPP4 task; (2) from multitask DNNs trained with the 15 Kaggle data sets; (3) from the two-task DNNs trained with the DPP4 and LOGD training sets; (4) from two-task DNNs trained with the DPP4 training set and the reduced LOGD training set; (5) from two-task DNNs trained with the DPP4 training set and the reduced LOGD training set with simulated activities that are positively correlated with the DPP4 activities; (6) from two-task DNNs trained with the DPP4 training set and the reduced LOGD training set with simulated activities that are negatively correlated with the DPP4 activities.

Experiments To Study the Effect of “Similar” Molecules on Multitask DNNs: Finding 1. In this section, we conducted more experiments using both original and artificial data sets to study the effect of multitask DNNs when the assistant training set contains a large number of similar molecules with primary test set. Figure 10 illustrates this situation. Given a primary task (training and test sets) and an assistant training set that has molecules structurally similar to those in the primary test set, we can select a subset of molecules to form the reduced assistant data set, as detailed in Methods. A multitask DNN can be trained from the primary data set and the reduced assistant data set. The molecular activities are determined by the natures of the

concept of the “reduced assistant data set” (as described in Methods) to select a subset of the LOGD training set (approximately 10%). We then trained a two-task DNN on the DPP4 training set and the reduced LOGD training set to check whether the “reduced” subset was as effective as the original LOGD training set. This experiment can show whether the decrease in the performance of the two-task DNNs was primarily due to those similar molecules. Figure 9 shows the box plot comparison of R2 from four kinds of DNN models, each with 20 different random seeds. The two-task DNNs of DPP4 and reduced LOGD show an even larger decrease in the R2 of the DPP4 test set, confirming our expectations. 2498

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling

Figure 12. Box plots of OX2 test set R2 from different DNN models (from left to right): (1) from single-task DNN models for the OX2 task; (2) from multitask DNNs trained with the 15 Kaggle data sets; (3) from two-task DNNs trained with the OX2 and OX1 training sets; (4) from two-task DNNs trained with the OX2 training set and the reduced OX1 training set with simulated activities that are uncorrelated with OX2 activities; (5) from two-task DNNs trained with the OX2 training set and the reduced OX1 training set; (6) from two-task DNNs trained with the OX2 training set and the reduced OX1 training set with simulated activities that are negatively correlated with the OX2 activities.

task while the case of no correlation degrades it. The amounts of improvement in multitask DNNs coming from positive and negative correlations are similar, which is reasonable because the neural network model can easily fit the negative correlation by flipping the signs of the weights in the output layer. We summarize the results as Finding 1: In the situation shown in Figure 10, where the primary test set has a large portion of molecules that are closer to the assistant data set than to the primary training set, multitask DNNs trained with primary and assistant data sets will seem to perform better or worse than single-task DNNs trained only with the primary data set. In addition, if the molecular activities of the primary and assistant task are correlated either positively or negatively, the multitask DNNs will perform better than single-task DNNs trained with only the primary data set. On the other hand, if these tasks have uncorrelated activities, the multitask DNNs will show worse predictive performance than single-task DNNs. Experiments on the Effect of Neighbors of Training Molecules on Multitask DNNs: Finding 2. According to Finding 1, if the assistant training data set has molecules that are similar to those in the primary test set, the predictive performance statistics for a multitask DNN will be affected, and the reduced assistant data set is generally as effective as the full assistant data set, which suggests that those molecules in the assistant data set that are very different from those in the primary test set do not significantly influence the performance of multitask DNNs. In Figures 5 and 8, many data sets show a negligible difference between the results of single-task and multitask DNNs. Those observations propelled us to explore the situation where multitask DNNs perform similarly to single-task DNNs, which is shown in Figure 13. The condition in Figure 13 differs from that in Figure 10 in that the primary test set is structurally closer to the primary training set than to the assistant data set. We created a subset of each test set in such a way that this subset would contain only molecules whose nearest neighbors among all of the molecules in the 15 training data sets are from the training set of the same task. In doing so, we created a subset of test molecules whose nearest neighbors are in the training set for their own tasks. We refer to this test subset as “Neighbors-ofTraining”. Using the previously trained single-task and multitask DNNs (each repeated with 20 different random seeds), we recalculated

tasks, and they can be correlated (either positively or negatively) or uncorrelated between the primary test set and the reduced assistant training set. In order to explore all of the cases of positive, negative, and no correlation of activities, we used simulated data sets with the same molecular structures as those in the reduced assistant data set but with activities either correlated (either positively or negatively) or uncorrelated to those in the primary test set. Figure 11 shows the resulting box plots of primary test set R2 of 20 repeated runs with different random seeds under different scenarios, where the primary task is DPP4 and the assistant task is LOGD. The first four cases were already shown in Figure 9, all of which used real activities in assistant data set. Since the activities of similar molecules in the DPP4 and LOGD tasks are uncorrelated, for the last two cases in Figure 11 we used two artificial data sets in which the reduced LOGD training set was given activities that are positively or negatively correlated with DPP4 test set activities. The molecular activities of the reduced artificial LOGD training sets were random variables that were correlated with the nearest neighbors in DPP4 test set with a correlation coefficient approximately 0.65 or −0.65. The box plots for the artificial data sets are shown in gray. Another experiment, shown in Figure 12, used OX2 as the primary task and OX1 as the assistant task. The first three cases were already shown in Figure 6. Comparing the “reduced OX1” case with the “pair OX1” case, we found that the benefits for the OX2 test set coming from two-task DNNs trained on OX2 and OX1 can also be achieved by training on OX2 and a reduced assistant OX1 training set. This provides additional support for the concept that a reduced assistant data set can help to identify the major source of influence on the performance of multitask DNNs. Since the true correlation of the activities between the reduced OX1 training set and the OX2 test set is positive, we simulated molecular activities with negative correlation and no correlation, where the molecular activities of the reduced artificial OX1 training set are either the negatives of the values for the nearest neighbors in the OX2 test set or random variables that are uncorrelated with the activities for the nearest neighbors in the OX2 test set, which completes the stories of all the three possible scenarios. Figure 11 and 12 jointly suggest that both positive and negative correlations improve the predictive performance of the primary 2499

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling

Figure 13. Summary of Finding 2.

Figure 14. Comparison between single-task and multitask DNNs for Kaggle data sets for the Neighbors-of-Training test set molecules. For each data set with a significant amount of Neighbors-of-Training test set molecules, results from single-task DNNs are shown on the left in blue, and results from multitask DNNs are shown on the right in red.

“partial” training data sets have similar numbers of molecules (∼12 400), and the activities of these molecules are positively correlated. The molecules in the CYP training set were generally ordered according to their assay times. We chose the first onethird of the molecules of the full training set as the 2C9 training set, the second one-third of the molecules as the 2D6 training set, and the last one-third of the molecules as the 3A4 training set. Since the molecules in the test sets were assayed relatively closer to those in 3A4 training set, it is reasonable to assume that their molecular structures are relatively more similar to those in the 3A4 training set and relatively farther away from those in the 2C9 and 2D6 training sets. According to Finding 1, multitask DNNs with the three data sets should have better predictive performance for the test set than individual DNNs for the 2C9 and 2D6 tasks since the former incorporate information from molecules in 3A4 training set, which has more similar structures and correlated activities with the test set. For each single-task or multitask DNN model, we trained the model 20 times with different random seeds, and the side-by-side comparison of test set R2 for each data set and the their mean R2 is shown in Figure 15. The improvements of multitask DNNs for the 2C9 and 2D6 data sets are consistent with our analysis, while the R2 for the 3A4 test set does not show much difference, which could be explained by our Finding 2.

R2 for these Neighbors-of-Training test sets for all of the Kaggle tasks, and the side-by-side box plot comparisons are displayed in Figure 14. Comparing Figure 14 with Figures 4 and 7, we found that the Neighbors-of-Training subset of test sets give overall much smaller differences between single-task and multitask DNNs, with the difference becoming negligible for many tasks. This result is summarized as Finding 2: In the situation shown in Figure 13, where the molecules in the primary test set are more similar to the molecules in the primary training set than those in the assistant data sets, multitask DNNs trained with primary and assistant data sets show neither improved nor degraded predictive performance compared with single-task DNNs trained on only the primary task data. Although the predictive performance does not change, using multitask DNNs still has the benefits of parsimonious parametrization and higher computational efficiency for modeling of multiple tasks simultaneously. That is, it is simply faster to train multiple tasks simultaneously than to train each separately. This is especially true for data sets like the CYP data sets, where molecule activities are “dense”, as shown in Figure 2a. We will subsequently demonstrate this benefit. Application of Finding 1: Boosting the Prediction of Partial CYP Test Sets. We use the partial CYP data sets shown in Figure 2b to illustrate how to apply Finding 1. The three 2500

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling

Figure 15. Application of Finding 1 to the partial (sparse) CYP data sets. For each of the three data sets and also their average, there are two box plots comparing the test set R2 from different DNNs: (left, blue) results from single task DNNs, each trained with one partial training set; (right, red) results from multitask DNNs trained with all the three partial training sets.

with different random seeds for all three tasks in the CYP data set. As we expected, there was little difference in performance on the individual tasks and on average. However, if we want to model all three QSAR tasks in the CYP data set, training a multitask DNN for the three tasks simultaneously is much more efficient and parsimonious than training three separate single-task DNNs. Using the same recommended DNN parameter settings, all three tasks share all of the hidden-layer parameters in a multitask DNN, and the training time is almost the same as that for training a single-task DNN for one task. Our Finding 2 gives us the peace of mind that we can computationally efficiently train multitask DNNs without worrying about degrading performance.

This result indicates that the conclusion presented by Ma et al.5 that large data sets do not benefit from multitask DNNs was inaccurate. Moreover, this experiment shows that the knowledge of our Finding 1 gives us the confidence to predict when using multitask DNNs can improve the predictive performance over single-task DNNs for a certain task. Application of Finding 2: Efficiently Modeling the Full CYP Data Sets with Multitask DNN. For each molecule in the original full CYP data sets (both training and test sets), activities of three QSAR tasks (i.e., 2C9, 2D6, 3A4) were measured. That is, the three QSAR tasks share the same sets of molecules in both the test and training sets. According to our Finding 2, without more similar molecules from different QSAR tasks, the multitask DNNs will not be significantly better than single-task DNNs in prediction accuracy. In Figure 16, R2 of single-task DNNs are compared to those of multitask DNNs from 20 repeated runs



DISCUSSION In this work, we have made important progress in solving the puzzle regarding multitask DNNs. We have revealed the reason why multitask DNNs perform significantly better or worse for some QSAR tasks in the Kaggle data sets yet make no difference for others. Our two findings can be summarized as follows: • Finding 1: When assistant tasks have molecules in the training set with structures similar to those in the test set of the primary task and the activities between these similar molecules are correlated (either positively or negatively), building a multitask DNN can boost the predictive performance. In contrast, if the activities between these similar molecules are uncorrelated, using multitask DNNs can degrade the predictive performance. • Finding 2: When assistant tasks do not have molecules structurally similar to those in the primary task test set, multitask DNNs will show neither improved nor degraded predictive performance, regardless of whether the activities in the tasks are correlated. A concise summary of the findings is also provided in Table 2. An intuitive explanation of our findings is that multitask DNNs make it possible to learn the informative features from molecules in some QSAR training sets and then use the features to predict the activities of similar molecules in other QSAR test sets. A DNN generally has multiple hidden layers and one output layer.

Figure 16. Application of Finding 2 on the CYP (dense) data set. For each of the three CYP tasks and also their average, there are two box plots comparing the test set R2 from single-task and multitask DNNs: the results from single-task DNNs are shown on the left in blue, and the results from multitask DNNs are shown on the right in red. 2501

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling Table 2. Summary of Findings molecular structure

molecular activity

Finding 1

primary test set molecules are more similar to assistant training set molecules

primary data set and assistant data set have correlated activities (positive or negative) uncorrelated biological activities

Finding 2

primary test set molecules are very different from assistant training set molecules

correlated or not

The output layer produces the predictions on the basis of the features extracted from the input data by the hidden layers. In multitask DNNs, there are multiple nodes in the output layer, each corresponding to a QSAR task. However, as explained in Methods, the outputs of these tasks utilize the same features extracted by the hidden layers with likely different or zero weights. When the activities of all these tasks are either positively or negatively correlated, the features that are learned from molecules in any training sets are informative for those structurally similar molecules in any test set. In contrast, when the activities of these tasks are uncorrelated, the features that are learned from molecules in some training sets can provide contradictory information in predictions for structurally similar molecules in some test sets. An implication of this phenomenon is that a large-scale complex DNN probably has the capacity to “memorize” the structural features of all the molecules it learned. When it predicts the activity of a molecule in a QSAR task, DNNs allow those structurally similar molecules in its “memory” to play an important role in prediction. The closer a “memorized” molecule is to the molecule of interest, the more influence it can have on the prediction result for that molecule. Thus, tasks with structurally similar compounds but uncorrelated activities in some training sets can provide contradictory information if they are trained together in a multitask DNN. Therefore, it is important for users with domain knowledge to select assisting training data sets according to the rules in Table 2 in order to achieve better prediction of a molecular activity that is of primary interest. In fact, the use of an “assistant data set” suggested by domain experts in multitask DNNs can be considered as a novel approach to incorporate prior domain knowledge in building QSAR models. We have also demonstrated that the use of “reduced” assistant data sets generally achieves similar improvements in predicting the primary test set as the full assistant data sets in many examples. We did observe a few cases in which the performance changes induced by reduced assistant data sets in multitask DNNs were not as dramatic as those by the full assistant data sets. Some evidence suggests that this may occur because the Dice similarity we used to construct the reduced assistant data sets may not fully align with the structural similarity evaluated by the trained DNNs. In real applications, if we want to build multitask DNNs targeted for a set of molecules but only a small portion of the very large assistant data sets is similar to those targeted molecules, using the reduced assistant data sets helps to speed up the training process. Our findings not only explain the source of the performance advantage of multitask DNNs over single-task DNNs but also provide insight into how to assemble the most effective multitask DNNs. This work makes multitask DNNs a practical method for QSAR tasks in drug discovery applications. However, it also leaves several areas for future investigation: (1) The impact of different molecular descriptors, such as AP and ECFP4, is worth studying. (2) Currently, one of the conditions in Finding 1 is that

results improved prediction R2 for the primary test set decreased prediction R2 for the primary test set no significant change of prediction for the primary test set

a correlation exists between activities of different tasks. However, because of the ability of DNNs to model any nonlinear relationship, we expect that the correlation condition can be relaxed to a more general relationship. (3) If the correlation of molecular activities among multiple QSAR tasks is not widely known, it is useful to develop a data-driven method to find suitable assistant data sets instead of relying solely on prior knowledge of domain experts. There are a number of interesting topics that are not covered in this work but worth further investigation. (1) The inductive transfer learning approaches, such as multitask learning and feature nets, could be performed using a variety of machine learning models, such as partial least-squares, random forests, neural networks, boosting, and support vector machine.21,35−39 While multitask learning usually aims at benefiting several tasks simultaneously, the feature nets approach focuses on improving the model performance for the primary task with help from the other assistant tasks. Also, the feature nets approach needs an extra step to construct additional input features for the primary tasks from the assistant tasks. However, the multitask learning algorithms require a special structure of the model to enable simultaneous modeling of multiple data sets. On the other hand, the feature nets approach is more flexible and can be easily implemented using any machine learning algorithm. In addition, the architecture of the feature nets approach could potentially be more resistant to the negative transfer effect since it provides only a small amount of additional input feature and leaves most of the single-task model of the primary task unchanged. Although it is usually difficult to conclude which technique is more appropriate, a comprehensive comparison of those methods with application to QSAR tasks could be very useful and provide more choices. (2) How to further reduce the computational cost and number of parameters by using simpler neural network structures is also of interest. A recent work by Winkler et. al40 used a shallow Bayesian regularized neural network41−45 to achieve results comparable to those for deep neural nets.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00087. List of contents of the Supporting Information (PDF) Numerical results (tables of medians and standard deviations for box plots, tables of p values for Figures 4, 7, 14 and 15, and computation time results) (ZIP) Training and test data sets (ZIP)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Yuting Xu: 0000-0003-2091-3854 Robert P. Sheridan: 0000-0002-6549-1635 2502

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling Notes

(22) Davis, I. L.; Stentz, A. Sensor fusion for Autonomous Outdoor Navigation Using Neural Networks. In Proceedings of the 1995 IEEE/RSJ International Conference on Intelligent Robots and Systems: Human Robot Interaction and Cooperative Robots; IEEE: New York, 1995; pp 338−343. (23) Tetko, I. V. Associative neural network. Neural Process. Lett. 2002, 16, 187−199. (24) Ramsundar, B.; Kearnes, S. M.; Riley, P.; Webster, D.; Konerding, D. E.; Pande, V. S. Massively multitask networks for drug discovery. 2015, arXiv:1502.02072 [stat.ML]. arXiv.org e-Print archive. http:// arxiv.org/abs/1502.02072 (accessed May 31, 2017). (25) Kearnes, S.; Goldman, B.; Pande, V. Modeling Industrial ADMET Data with Multitask Networks. 2016, arXiv:1606.08793 [stat.ML]. arXiv.org e-Print archive. http://arxiv.org/abs/1606.08793 (accessed May 31, 2017). (26) Sheridan, R. P. Time-split cross-validation as a method for estimating the goodness of prospective prediction. J. Chem. Inf. Model. 2013, 53, 783−790. (27) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Model. 1985, 25, 64−73. (28) Kearsley, S. K.; Sallamack, S.; Fluder, E. M.; Andose, J. D.; Mosley, R. T.; Sheridan, R. P. Chemical similarity using physiochemical property descriptors. J. Chem. Inf. Comput. Sci. 1996, 36, 118−127. (29) Haykin, S. Neural Networks: A Comprehensive Foundation; Prentice Hall: Upper Saddle River, NJ, 1999. (30) Sandberg, I. W.; Lo, J. T.; Fancourt, C. L.; Principe, J. C.; Katagiri, S.; Haykin, S. Nonlinear Dynamical Systems: Feedforward Neural Network Perspectives; Adaptive and Cognitive Dynamic Systems: Learning, Signal Processing, Communications, and Control, Vol. 21; John Wiley & Sons: New York, 2001. (31) Lin, D. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning; PMLR, 1998; pp 296−304. (32) Sheridan, R. P.; Feuston, B. P.; Maiorov, V. N.; Kearsley, S. K. Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR. J. Chem. Inf. Comput. Sci. 2004, 44, 1912− 1928. (33) Tieleman, T. Gnumpy: an easy way to use GPU boards in Python; UTML TR2010−002; Department of Computer Science, University of Toronto: Toronto, ON, 2010. (34) Mnih, V. Cudamat: a CUDA-based matrix class for Python; UTML TR2009−004; Department of Computer Science, University of Toronto: Toronto, ON, 2009. (35) Brown, J.; Okuno, Y.; Marcou, G.; Varnek, A.; Horvath, D. Computational chemogenomics: Is it more than inductive transfer? J. Comput.-Aided Mol. Des. 2014, 28, 597−618. (36) Dai, W.; Yang, Q.; Xue, G.-R.; Yu, Y. Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning; PMLR, 2007; pp 193−200. (37) Wu, P.; Dietterich, T. G. Improving SVM accuracy by training on auxiliary data sources. In Proceedings of the 21th International Conference on Machine Learning; PMLR, 2004; p 110. (38) Shin, H.-C.; Roth, H. R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R. M. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 2016, 35, 1285−1298. (39) Bossard, L.; Dantone, M.; Leistner, C.; Wengert, C.; Quack, T.; Van Gool, L. Apparel classification with style. In Computer Vision ACCV 2012; Springer: Berlin, 2012; Part IV, pp 321−335. (40) Winkler, D. A.; Le, T. C. Performance of deep and shallow neural networks, the Universal Approximation Theorem, activity cliffs, and QSAR. Mol. Inf. 2017, 36, 1600118. (41) Winkler, D. A. Neural networks as robust tools in drug lead discovery and development. Mol. Biotechnol. 2004, 27, 139−167. (42) Burden, F.; Winkler, D. Bayesian regularization of neural networks. Methods Mol. Biol. 2008, 458, 23−42. (43) Burden, F. R.; Winkler, D. A. Robust QSAR models using Bayesian regularized neural networks. J. Med. Chem. 1999, 42, 3183− 3187.

The authors declare no competing financial interest. The code for the “DeepNeuralNet-QSAR” package is available at https://github.com/Merck/DeepNeuralNet-QSAR.

■ ■

ACKNOWLEDGMENTS The authors thank George E. Dahl for sharing his original DNN codes via the QSAR Kaggle competition. REFERENCES

(1) LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436−444. (2) Bengio, Y. Learning deep architectures for AI. Found. Trends Mach. Learn. 2009, 2, 1−127. (3) Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493−2537. (4) Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning; Association for Computing Machinery: New York, 2008; pp 160−167. (5) Ma, J.; Sheridan, R. P.; Liaw, A.; Dahl, G. E.; Svetnik, V. Deep neural nets as a method for quantitative structure-activity relationships. J. Chem. Inf. Model. 2015, 55, 263−274. (6) Dahl, G. E.; Jaitly, N.; Salakhutdinov, R. Multi-task neural networks for QSAR predictions. 2014, arXiv:1406.1231 [stat.ML]. arXiv.org ePrint archive. http://arxiv.org/abs/1406.1231 (accessed May 31, 2017). (7) Arel, I.; Rose, D. C.; Karnowski, T. P. Deep machine learning-a new frontier in artificial intelligence research [research frontier]. IEEE Comput. Intell. Mag. 2010, 5, 13−18. (8) Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A. R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T. N.; Kingsbury, B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82−97. (9) Ripley, B. D. Pattern Recognition and Neural Networks; Cambridge University Press: Cambridge, U.K., 2007. (10) Wu, C. H., McLarty, J.W. Neural Networks and Genome Informatics; Elsevier: New York, 2012; Vol. 1. (11) Agatonovic-Kustrin, S.; Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 2000, 22, 717−727. (12) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947−1958. (13) Bridle, J. S. Neurocomputing; Springer: Berlin, 1990; pp 227−236. (14) LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541−551. (15) Ou, G.; Murphey, Y. L. Multi-class pattern classification using neural networks. Pattern Recognit. 2007, 40, 4−18. (16) Dahl, G. Deep Learning How I Did It: Merck 1st place interview. http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-itmerck-1st-place-interview (accessed Dec 12, 2016). (17) Pan, S. J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345−1359. (18) Weiss, K.; Khoshgoftaar, T. M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 9. (19) Rosenstein, M. T.; Marx, Z.; Kaelbling, L. P.; Dietterich, T. G. To transfer or not to transfer. Presented at the NIPS 2005 Workshop on Transfer Learning, 2005. (20) Caruana, R. Multitask Learning. In Learning to Learn; Thrun, S., Pratt, L., Eds.; Springer: New York, 1998; pp 95−133. (21) Varnek, A.; Gaudin, C.; Marcou, G.; Baskin, I.; Pandey, A. K.; Tetko, I. V. Inductive transfer of knowledge: application of multi-task learning and feature net approaches to model tissue-air partition coefficients. J. Chem. Inf. Model. 2009, 49, 133−144. 2503

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504

Article

Journal of Chemical Information and Modeling (44) Burden, F. R.; Winkler, D. A. New QSAR methods applied to structure- activity mapping and combinatorial chemistry. J. Chem. Inf. Comput. Sci. 1999, 39, 236−242. (45) Burden, F. R.; Winkler, D. A. An Optimal Self-Pruning Neural Network and Nonlinear Descriptor Selection in QSAR. QSAR Comb. Sci. 2009, 28, 1092−1097.

2504

DOI: 10.1021/acs.jcim.7b00087 J. Chem. Inf. Model. 2017, 57, 2490−2504