Article Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
pubs.acs.org/jcim
Building of Robust and Interpretable QSAR Classification Models by Means of the Rivality Index ́ ez-Nieto Irene Luque Ruiz* and Miguel Á ngel Gom Department of Computing and Numerical Analysis, University of Córdoba, Albert Einstein Building, Campus de Rabanales, E-14071, Córdoba, Spain
Downloaded by ALBRIGHT COLG at 12:15:13:436 on June 01, 2019 from https://pubs.acs.org/doi/10.1021/acs.jcim.9b00264.
S Supporting Information *
ABSTRACT: An unambiguous algorithm, added to the study of the applicability domain and appropriate measures of the goodness of fit and robustness, represent the key characteristics that should be ideally fulfilled for a QSAR model to be considered for regulatory purposes. In this paper, we propose a new algorithm (RINH) based on the rivality index for the construction of QSAR classification models. This index is capable of predicting the activity of the data set molecules by means of a measurement of the rivality between their nearest neighbors belonging to different classes, contributing with a robust measurement of the reliability of the predictions. In order to demonstrate the goodness of the proposed algorithm we have selected four independent and orthogonally different benchmark data sets (balanced/unbalanced and high/low modelable) and we have compared the results with those obtained using 12 different machine learning algorithms. These results have been validated using 20 data sets of different balancing and sizes, corroborating that the proposed algorithm is able to generate highly accurate classification models and contribute with valuable measurements of the reliability of the predictions and the applicability domain of the built models.
1. INTRODUCTION QSARs predict biological or toxicological properties on the basis of the physicochemical and structural properties of chemicals. The mathematical basis of the relationship between the structural properties (independent variables) and biological or toxicological properties (dependent variables) is named an algorithm, and the importance of the foundation and interpretability of the QSAR algorithms has been reported by the OECD (Organization for Economic Co-operation and Development).1,2 For regulatory purposes and in order to establish a convenient framework for evaluating and comparing different QSAR models, the OECD describes that a QSAR model should satisfy the following five characteristics: (i) have a defined end point, (ii) use an unambiguous algorithm, (iii) have a defined domain of applicability, (iv) provide appropriate measures of goodnessof-fit, robustness, and predictivity, and (v) present a mechanistic interpretation, when possible. The OECD reports state that the QSAR algorithm should be unambiguous and ought to be described thoroughly so that the user will understand exactly how the estimated end-point values were produced and how the calculations can be reproduced. A lot of QSAR algorithms have been proposed and described in the literature; some of them (i.e., artificial neural network algorithms) are ambiguous algorithms whose results are hard to infer. Other algorithms are clearly unambiguous algorithms; however, for some of them, the results can be hard to reproduce, and therefore, they can be of low applicability. Therefore, in the © XXXX American Chemical Society
proposal of a QSAR model, it is advisable to test different algorithms based on different mathematical foundations.3−5 The selection of an algorithm in the building of a QSAR model is related not only to the statistic results of the model but also to the applicability domain of the resulting model. The OECD identifies, in the third requirement on QSAR proposal, the need to establish the scope and limitations of a QSAR model (the applicability domain).1,6−8 Netzeva et al.9 have proposed a definition of the applicability domain (AD) as follows: “The applicability domain of a QSAR model is the response and chemical structure space in which the model makes predictions with a given reliability”. This definition establishes that the predictions of new chemicals by a model should only be reliable when the chemicals fall within the scope determined by the model, currently existing multiple methods for the measurement, and setting of the AD.10 Recently, different approaches to the measurement of the AD have been proposed based on the range of the descriptor variable values, the similarity between the molecules of the data set, geometric methods, distance-based methods, density methods, etc.11−16 As a result, different methods usually generate different results with a different list of chemicals falling inside/outside the AD, thus making convenient the comparison of different AD approaches and the evaluation of results from several possible Received: March 28, 2019
A
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
domain of the model built. Finally, we have discussed and validated the results, describing the advantages and limitations of our proposal.
strategies before assessing a new molecule using the model built.17−21 Finally, in the case of QSAR classification models (CM), the goodness-of-fit of a CM can be assessed in terms of its Cooper statistics,22 which for binary data sets are calculated by means of the confusion or contingency matrix.23 Although models with high values of the classification accuracy describe robust models, this measurement can be influenced by the performance of the most numerous class in unbalanced data sets. As a result, it is necessary to carry out the analysis of the misclassified chemicals, mainly for unbalanced data sets where the minimization of the error should be considered alongside the maximization of accuracy.1,2 Usually, this analysis could be carried out by evaluating the number of misclassifications by building the classification model using leave-one-out method (LOO) or calculating the classification error or posterior probability (or weights) of classification/ misclassification of the chemicals; also by applying mathematical methods in the fitting process in order to minimize these errors. In this paper, we describe a proposal for the construction of a classification QSAR model satisfying these requirements. Our proposal is based on: (i) the use of a fast, simple and unambiguous algorithm, (ii) the proposal of an AD measurement able to determine the scope of applicability of the model and the molecules of the data set falling within and outside of the AD, and (iii) the proposal of a measurement of the reliability of the predictions. The proposed classification algorithm is based on the rivality index24 and the values of this index for each molecule of the data set regarding different values of the cardinality of its neighborhood. This algorithm is called RINH (rivality of the neighborhood). In addition to its simplicity, speed and clarity, the RINH algorithm has the advantage of generating measurements of the reliability of the predictions. These measurements are obtained for each one of the molecules of the data set, and they facilitate the analysis of the misclassification error and the applicability domain of the generated model. In order to present our proposal, we have carried out the study described in this paper using four binary, orthogonal, and independent data sets with different number of molecules, modelability, and balancing. A large set of statistic classification parameters have been calculated, and the results obtained with the RINH algorithm have been compared with those obtained using 12 well-known and more widely used classification algorithms. In addition, in order to validate our proposal, we have build classification models using RINH, Support Vector Machine, and Random Forest algorithms for other 20 data sets with different sizes and balancing and the results have been compared. The manuscript has been organized as follows: after the Introduction, in the Materials and Methods section, we have revisited the definition of the rivality index. In this section, the RINH algorithm and the measurement of confidence and reliability of the predictions are also described. Besides this, the four data sets used in this study are described. In the following section, we have described the calculations carried out. Using 12 different machine learning algorithms we have built classification models for the four data sets, and the results have been compared with the classification models generated with the RINH algorithm. Moreover, in this section, a deep analysis of the results has been performed, studying the misclassified molecules, the reliability of the prediction, and the applicability
2. MATERIALS AND METHODS In this section, we have initially performed a revisited study of the rivality index. Currently, this index has been proposed and studied for binary data sets, being a normalized distance measurement between each molecule of the data set and its first nearest neighbor belonging to the two classes of molecules existing in the data set. 2.1. Rivality Index. The rivality index24,25 (RI) is a measurement of the capability to correctly predict the activity of a molecule by a statistic algorithm. For any molecule of a given data set, the rivality index is defined as follows: RIi =
dix − diy dix + diy
(1)
where: dxi is the distance between the molecule i and its nearest neighbor molecule belonging to its same class, and dyi is the distance between the molecule i and its nearest neighbor molecule that belongs to any class different from the class of molecule i. RI is a normalized index which takes values between 1 and −1. Thus, values lower than zero imply that the first nearest neighbor of molecule i is a molecule that belongs to its same class, and values of RI greater than zero mean that the first nearest neighbor of molecule i is a molecule belonging to a different class. Therefore, molecules with an RIi value close to −1 represent those molecules that will be correctly classified by a QSAR algorithm and vice versa. The implementation of eq 1 is quite simple. The Euclidean distances between all pairs of molecules of the data set are obtained, and for each molecule, the distance to the remaining molecules is sorted (this process is performed in the preprocessing stage as described above). When that is done, it is only necessary to find, for each molecule, the first nearest neighbor for the same and different classes.
(
In addition, the term RIi =
dix − diy dix + diy
)
≤ 0 , allows us to include
those cases in which several molecules from different classes are at the same distance of a given molecule, that is, more than one first nearest neighbor could be considered in the calculation. Thus, if a molecule has some molecules belonging to the same and to a different class at the same distance, this molecule is considered as correctly classified. In those cases in which dix = diy = 0, the result is dix − diy dix + diy
=
0 0
= NaN , and the value of this term is set to −10−6,
in order to be differentiated of those cases in which dxi = dyi ≠ 0 and this term takes the value of 0. 2.2. Weighted Scheme Based on the Density of the Neighborhood. We have observed that high positive values of the rivality index allow us to find those molecules with a high possibility to be incorrectly classified and that high negative values of the rivality index describe molecules with a high capability to be correctly classified by a QSAR algorithm. However, we have also observed two types of deviations between the information provided by the rivality index for the molecules of the data set and the experimental values obtained by the QSAR algorithm in the calculations carried out:24 (i) some molecules are correctly or incorrectly classified depending B
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
measure the reliability of this information. The reliability of a QSAR model is a measurement aimed to notify about the quantity, quality, and relevance of the information available for the model to perform predictions.16,20 We can observe from eq 2, that depending on the value set for TN, different values of the rivality index can be calculated for each molecule of the data set. Thus, for a given molecule i of the data set and a given value of TN, a different cardinality of the neighborhood is considered, containing this set of neighbors the same or different number of molecules belonging to both classes. Then, we have proposed the calculation of the confidence of the values of the rivality index taking into account the cardinality of the neighborhood for each j value of TN as follows:
on the composition of the training and test subsets, randomly partitioned for the five models built (redundancy); (ii) some molecules with a negative value of the rivality index are incorrectly classified. Mainly, the first type of deviation is due to the activity borders, molecules with a rivality index close to zero (positive or negative). It is difficult to classify these molecules correctly; besides this, they are very dependent on the second type of deviation, that is, the composition of the randomly selected molecules of the training and test sets. In many cases, the first two neighbors of a molecule are molecules of both the same or different classes, but a considerable number of the following neighbors belong to a unique class (equal or different to the one of the considered molecule). Thus, when some of the N firsts neighbors are selected in the training set, a different result is obtained than when some of these molecules are not in the training set. Thus, we defined a weighted kernel (density kernel) for the rivality index as follows: RIi =
(dix × wix) − (diy × wiy) (dix × wix) + (diy × wiy)
CTN RI i
(4)
where: CTN RIi is the confidence value of the rivality index for the molecule i belonging to the same class as molecule x at a given value of TN, M is the number of molecules of the data set, and Mx and My are the number of molecules belonging to classes x and y, respectively. Analyzing the eq 4, we can appreciate that as the value considered for TN increases, the confidence also increases due to the increase of the cardinality of the neighborhood. Thus, it is possible to calculate the confidence of the rivality index at different values of TN. This value of TN is named the scope of the prediction. For balanced data sets, the scope can take values M between 1 and 2 − 1, and for unbalanced data sets, the scope can take values between 1 and min(Mx, My) − 1. Moreover, for a data set perfectly separated in two classes, that is, all the molecules of a given class are nearest to any molecule of that class than to any other molecule of a different class, the confidence takes a value equal to 1 for any value of TN. For a perfectly mixed data set the confidence would take values in the ÅÄÅ (TN + 1) ÑÉÑ interval ÅÅÅÅ max(M , M ) , 1ÑÑÑÑ x y ÅÇ ÑÖ Now, we can calculate the reliability (RE) of the prediction of the molecules of the data set for each value of TN as follows:
(2)
where dxi is the Euclidean distance from the molecule i to its first nearest neighbor x belonging to the class of i, dyi is the Euclidean distance from the molecule i to its first nearest neighbor y that belongs to a different class than the one of i, wxi is the weight assigned to the neighbor that belongs to the same class of i, and wyi is the weight assigned to the neighbor belonging to a different class. These weights are calculated through the following expression: wix =
x (My − CNiy + TN) yzz 1 ijjj CNi + 1 zz = jj + zz 2 j Mx My k {
CNi − CNix y CNi − CNiy x , wi = , wi + wiy = 1 CNi CNi (3)
wxi
where: is the weight assigned to the distance between the molecule i and its first nearest neighbor x of the same class of i, wyi is the weight assigned to the distance between the molecule i and its first nearest neighbor y of a different class than the one of i, CNi is the cardinality of the neighborhood assigned to the molecule i, CNix and CNiy are the cardinalities of the neighborhood of the molecules belonging to the same and different class of molecule i, respectively. The value of the cardinality of the neighborhood is calculated using a threshold of neighbors (TN) describing the minimum number of neighbors of each class that must exist in the neighborhood. Thus, selecting a TN for the data set, the cardinality of the neighborhood could be different for each molecule due to the fact that, for each molecule, its TN first nearest neighbors of each class can be in a different place/order or distance. Thus, considering this threshold (TN) for the calculation of the CNi value, we can partially estimate that the status of the neighborhood of each molecule of the data set is like the one existing in for any partition of the data set in the cross-validation stage and, therefore, obtain a valuable measurement about the molecules within or outside of the AD. 2.3. Confidence and Reliability of the Rivality Index. However, although the rivality index can contribute with a measurement for the prediction of the activity of a molecule and the values of this index at different threshold of neighbors provide information about the AD, it is also necessary to
TN
REiTN
∑j = 1 RIij × C RIj i TN = × TN min(Mx , My) − 1 ∑j = 1 |RIij × C RIj i|
(5)
REi takes values in the interval [−1, 1]. Thus, high negative values of REi describe molecules with a high reliability to be correctly classified, and high positive values of REi describe molecules with a high reliability to be incorrectly classified. Therefore, values of abs(REi) close to 1 designate high reliability in the prediction of the molecule i as correctly or incorrectly classified. Values of REi close to zero describe molecules that are hard to correctly predict to belong to one of another class and, therefore, molecules that are highly dependent on the composition of the data set. We can appreciate that REi is a robust and accurate measurement describing the applicability domain related to the molecule i. For high values of TN, REi takes into account different compositions of molecules of the data set, because different neighborhoods are considered for each value of TN. As TN increases, the cardinality of the neighborhood also increases, and therefore, the confidence of the calculated rivality index also increases. C
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
devoted to the description of the RINH algorithm and its comparison with other machine learning algorithms. The data representation of the data sets used in the calculations has been through their descriptor matrices, also gathered from the Chembench web site.26 The type of descriptor matrix selected has been CDK descriptors,27 containing 202 variables describing 1D and 2D molecular descriptors. 2.6. Experimental Methods. The descriptors matrix data representation of each data set was analyzed, and columns (variables) having any value equal to NaN/Inf, columns with a value equal to zero for all rows, or columns with the same value for all rows were removed. After this preprocessing, the corresponding descriptors matrixes contained 157, 144, 156, and 163 descriptors for DS01, DS02, DS03, and DS04 data sets, respectively. The resulting matrices were normalized (range-scaled) by columns. As a result, the maximum and minimum value of each column was calculated, and Max/Min criterion was used to update the values of the matrix to their normalized values in the range [0,1]. These normalized matrices were used in the calculations of the classification models and the indexes described in this paper. For the calculation of the nearest neighbors, Euclidean distance was considered, and distance matrices were generated for each data set. These matrices are symmetrical matrices of size M × M, being M the number of molecules of the data set, where the elements (i, j) store the Euclidean distance between the molecules i and j. This Euclidean distance has been calculated using the range scaled descriptor matrix. The diagonal of these matrices was set to Inf, and the distance matrices were sorted in ascending order by rows. Then, for each row (molecule) the neighbors to each molecule were ordered from the nearest (lower distance value) to the furthest (greater distance value). For the calculations performed in this work, Matlab2018Rb28 was used. In addition to using the Statistic and Machine Learning Toolbox,29 12 different algorithms have been used: Complex Tree with maximum number of splits equal to 20 (CT), kNN with k = 10 (KNN), support vector machine with linear (SVMLI), Gaussian (SVMGA) and cubic polynomial (SVMCU) kernels, ensemble with Gentle boost (ENSGB), Logit boost (ENSLB), subspace discriminant (ENSSD), subspace kNN (ENSKNN) and Random Forest (ENSRF) kernels, linear regression (LIR), and logistic regression (LR). In order to allow the experiments to be easily reproduced by other researchers, the algorithms were executed using the default values for the parameters established by Matlab2018b.29 Thus, no changes were introduced for improving the results in the building of the classification models for the four data sets. The building of the classification models was performed using 5-folds cross-validation (CV), with random partitioning of the data set in the training set (80% of molecules) and test set (20% of molecules). A high number of statistics parameters have been extracted from the results of the classification models (see Supporting Information). Some of these parameters were calculated as follows:
Thus, considering those different neighborhoods for correctly/incorrectly predicting a molecule i and the confidence of each prediction, the reliability contributes as a measurement of the capacity of an algorithm to correctly/incorrectly predict a molecule independently of the data set composition. 2.4. Rivality Neighborhood Algorithm (RINH). The rivality of neighborhood algorithm (RINH) classifies the molecules of the data set according to the positive/negative value of their rivality index for different cardinalities of the neighborhood obtained for different TN values. This algorithm receives as input the following: • An array of size M containing the labels corresponding to the true class of the molecules of the data set. • A matrix M × D, M being the number of molecules of the data set, and D the number of variables or descriptors describing the characteristics of the molecules of the data set. • The scope value for the predictions. This parameter can take a value between 1 and min(Mx, My) − 1, determining the maximum value of the threshold of neighbors that can be considered for the algorithm in the calculation of the rivality index for the molecules of the data set. The algorithm calculates the rivality and the reliability values for each value of TN in the interval [1, scope], returning the rivality index value, the class predicted for the molecules of the data set for each value of TN, and the corresponding reliability of the predictions. As observed, RINH is a very fast algorithm providing not only the predictions but also the scores of these predictions as well as a measurement of the applicability domain of the predictions for different values of the neighborhood of the molecules of the data set. 2.5. Data Sets Description and Representation. From the Chembench web site,26 four data sets have been selected with the following criteria: (i) data sets should have different modelability, and (ii) data sets should be balanced and unbalanced. Accordingly, 422_OCT1x (DS01) and 423_PEPT1i100d (DS02) data sets are balanced data sets having 82 molecules, with 50% of those molecules belonging to different classes. Besides this, the DS01 data set has a high modelability reference (0.80) while the DS02 data set has a very low modelability reference (0.62). In addition, ACK1 (DS03) and 234_BCRP (DS04) data sets are unbalanced data sets presenting different modelability. DS03 contains 171 molecules, 1−107 (107) belonging to the class 1/ positive and 108−171 (64) belonging to the class 0/negative, presenting a high modelability reference (0.81). DS04 contains 169 molecules, 1−93 (93) belonging to the class 1/positive and 94−169 (76) belonging to the class 0/negative, presenting a low modelability reference (0.69). The values of the modelability reference for the four data sets have also been gathered from Chembench web site.26 Thus, by means of this orthogonal criterion in the selection of the data sets, we can support the results obtained by the classification models built using the RINH algorithm, demonstrating the applicability of our proposal to any other data sets with similar or different characteristics. The Chembench web site does not provide information about the end points of the four data sets or any reference paper; however, this information is not relevant in our study, which is
sensitivity (SE) = D
TP MP
(6) DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 1. Results of the classification models built for the four data sets using the 12 algorithms considered in this study.
specificity (SP) =
accuracy (ACC) =
J = 2·AUC − 1
TN MN
(7)
(SE·MP + SP ·MN ) M
correct classification rate (CCR) = geometric mean (GM) =
SE + SP 2
SE·SP
(9) (10) (11)
Matthews correlation coefficient (MCC) TPTN − FPFN = (TN + FP)(TN + FN )(TP + FN )(TP + FP)
(12)
Kappa =
TPFP + TN FN M TPFP + TN FN M
In addition, we have calculated the area under the ROC curve (auROC) using the function perfcurve() included in the Statistic and Machine Learning Toolbox of Matlab 2018Rb.29 This function calculates auROC using as input data the true class labels and the scores obtained by the classification algorithm in the building of the models. For most of the algorithms, the score (i, j) can be interpreted as the posterior probability of observing a molecule i belonging to the class i that was predicted to belong to class j. Although the scores usually take value in the interval [0,1], perfcurve() does not impose any requirements on the input score range. Thus, the rivality index values have been used as scores for the calculation of auROC when the RINH algorithm is used as the classifier.
(8)
Youden (J ) = SE + SP − 1
3. RESULTS In this section, we initially describe the classification models built using the 12 machine learning algorithms for the four data sets under study. From the results, we will select the algorithms with the best behavior in order to compare the best classification models generated with those models using the RINH algorithm. 3.1. Construction of the Classification Models Using Machine Learning Algorithms. Classification models were generated for the four data sets using the 12 algorithms considered. Detailed information on the results is included in Supporting Information (and is also shown in Figure 1). Figure 1 shows the values of some statistic parameters obtained for the classification models built (ACC, SE, SP, MCC and auROC). A global view of the results shows that the best behavior is observed for the SVM algorithm with different kernels and for the ensemble algorithms such as Random Forest, Subspace Discriminant and Subspace kNN. The statistical parameters MCC and Kappa (see Supporting Information) show close or equal values independently of the balancing of the data sets and algorithm used. We can also observe the high correlation between different statistics. For
ACC − 1−
(14)
(13)
where M is the number of molecules of the data set, MN is the number of molecules belonging to the class 0/negative, MP is the number of molecules belonging to the class 1/positive, TN and TP are the number of molecules correctly predicted as corresponding to the class 0/negative and class 1/positive, respectively, and FN and FP are the number of molecules incorrectly predicted as corresponding to the class 0/negative and class 1/positive, respectively. Ballabio et al.30 have studied different metrics for the assessment of the classification performance, proving the excellent behavior of the sensitivity, specificity, accuracy and CCR in the analysis of classification results for binary and balanced data sets. In this study, the authors show the correlation between MCC and non error rate of the classification for binary data sets, as well as between the Youden index and the area under the ROC curve calculated in the form: E
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Table 1. Results of the Classification Process Using SVMLI and ENSRF Algorithms for the Four Datasets DS01
DS02
DS03
DS04
statistics
SVMLI
ENSRF
SVMLI
ENSRF
SVMLI
ENSRF
SVMLI
ENSRF
true class 0 true class 1 accuracy sensitivity specificity CCR MCC Kappa Youden GM auROC
35 36 0.866 0.878 0.854 0.866 0.732 0.732 0.732 0.866 0.927
36 32 0.829 0.780 0.878 0.829 0.662 0.659 0.659 0.828 0.933
27 30 0.695 0.732 0.659 0.695 0.391 0.390 0.390 0.694 0.744
26 27 0.646 0.659 0.634 0.646 0.293 0.293 0.293 0.646 0.705
56 92 0.865 0.860 0.875 0.867 0.722 0.719 0.735 0.867 0.917
54 95 0.871 0.888 0.844 0.866 0.727 0.727 0.732 0.866 0.944
38 79 0.692 0.849 0.500 0.675 0.377 0.360 0.349 0.652 0.765
49 75 0.734 0.806 0.645 0.726 0.459 0.456 0.451 0.721 0.790
Figure 2. Values of the accuracy, sensitivity, and specificity of MCC and auROC for the four data sets using the RINH algorithm.
instance, in the case of the DS01 data set, the correlation between the geometric mean (GM), Kappa, and MCC show a value of r2 = 0.999. Consequently, the analysis of any of these parameters could support the study of the behavior of the different algorithms and different data sets. Table 1 shows a summary of the results included in Supporting Information gathering the results of the classification models generated for SVM with linear kernel and Random Forest algorithms. For data sets with high modelability (DS01 and DS03) the ACC reaches values close to 0.87 with values of auROC of 0.94 when using SVM and RF algorithms, reaching also high values of MCC, GM and Kappa (see Supporting Information). For data sets with low modelability (DS02 and DS04) these values diminish to 0.65−0.73 for the ACC and 0.71−0.79 for auROC, presenting very low values of MCC and Kappa (see Supporting Information). In addition, the values of Youden show a perfect correlation with the GM. 3.2. Construction of the Classification Models Using the RINH Algorithm. The complete results of the classification models built for the four data sets using the RINH algorithm can be found in the Supporting Information. In this file, all the values
of the statistic parameters described from eq 6 to eq 14 are gathered. Figure 2 shows the classification results using the RINH algorithm with density kernel. We can appreciate a clear increase of the values of the statistics parameters for all the data sets regarding the results obtained with the 12 machine learning algorithms. For data sets with high modelability, the auROC reaches values very close to 1, and for data sets with low modelability (DS02 and DS04), the values of auROC also show a clear increase of its value (greater than +0.2). This behavior is also observed for the remaining statistic parameters. Thus, the low values of MCC obtained for the DS02 and DS04 data sets using the different algorithms using default values for the execution parameters (see Figure 1) are improved using the RINH algorithm for most of the TN values. In Figure 2, the behavior of the different statistical parameters versus the threshold of neighborhood (TN) is portrayed. In this representation, we have shown values of TN between 1 and 40, for DS01 and DS02 data sets, TN = 40 is the maximum value of the scope that could be considered for these data sets. That is, F
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 3. Percentage and number of molecules correctly and incorrectly classified at different values of TN using the RINH algorithm for the four data sets studied.
min(M0, M1) − 1 = min(41, 41) − 1. In the case of DS03 and DS04 data sets, the maximum scope is 63 and 75, respectively. We can observe in Figure 2 that the results obtained with the RINH algorithm are in the order or better than those obtained with the machine learning algorithms. The behavior of balanced and unbalanced data sets and high and low modelable data sets is clearly different. DS01 data set is barely impacted by the values of TN. This balanced and high modelable data set shows high values of the statistic parameters for any value of TN. The worst results are obtained for intermediate values of TN, because for these values of TN there is a high number of molecules belonging to a different class in the neighborhood of each molecule and, therefore, some molecules take values of the rivality index positive but close to zero. This fact is because there are a considerable number of molecules belonging to a different class which are located at a similar intermediate distance; that is, they are similar to each other with an intermediate similarity value. However, at low values of TN, more similar molecules to each molecule of the data set belong to the same class of that molecule, and at high values of TN, a clear global dissimilarity between the molecules of both classes is shown. These characteristics of the DS01 data set are not found in the other balanced DS02 data set. This data set is also barely affected by TN. However, the worst results are obtained for low values of TN. That is due to the fact that, for some molecules of the data set, there are some molecules that belong to a different class but are very close. Consequently, when for a given molecule of the data set, the cardinality of the neighborhood increases, the number of molecules in the neighborhood belonging to the same class also increases, and its rivality index tends to take negative values. This behavior of the DS02 and DS04 demonstrates the low modelability of these data sets. The results obtained for DS03 data set are a clear response to the unbalancing of this data set. Thus, for high values of TN we can observe a high decrease of the specificity value, due to the
higher number of molecules belonging to class 1 (positive) in the neighborhood of the molecules belonging to class 0 (negative). Thus, this data set shows similar characteristics as the DS01 data set, because there are a considerable number of molecules with an intermediate value of similarity that belong to a different class. Although the DS04 unbalanced data set is also affected by the increase of TN; in this case, the effect is lower because the percentage of molecules belonging to both classes is more similar than in the DS03 data set. Therefore, the number of molecules belonging to both classes is also more similar with the increase of TN. This behavior described above can be clearly observed in Figure 3. This figure shows the number and percentage of molecules of both classes correctly classified by the RINH algorithm with density kernel for the four data sets. Once again, we can appreciate, that for balanced data sets the results of the classification process are quite similar independent of the TN values. However, for intermediate values of TN, the classification performance is more balanced for the molecules of both classes, obtaining similar values of sensitivity and specificity. However, for unbalanced data sets of low modelability, this behavior is dependent on the balancing characteristics and the number of molecules of the data set. Thus, in the case of the DS03 data set, this data set has a low number of molecules belonging to class 0, with low values of TN generating better results for the molecules of class 0 than higher values of TN. In the case of the DS04 data set, the higher number of molecules belonging to class 1 generates that with higher values of TN the performance of the classification of these molecules increases, decreasing the performance of the classification of molecules belonging to class 0. Thus, for the DS01 data set the best performance is obtained for low and high values of TN (1, 4, 5, 37 and 40) being capable to predict 72 of the 82 molecules of the data set. In addition, we can also observe than for this data set, the low impact of TN in G
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 4. Percentage of molecules participating in activity cliff pairs at different values of TN for the four studied data sets.
the molecules belonging to class 0 as consequence of the higher number of molecules existing in the data set belonging to class 1 regarding those exiting belonging to class 0. Figure 4 also explains the behavior of the DS04 data set abovedescribed. We can observe that the percentage of molecules belonging to class 1 participating in activity cliffs is higher than for DS03, and they are limitedly affected with the increase of TN, as it is in the DS03 data set. Then, the values of sensitivity for DS04 will be lower than for DS03, although these values are barely impacted by the increase of TN. Regarding the balanced data sets DS01 and DS02, we can also observe in Figure 4 the greater number of activity cliffs in DS02 with respect to DS01. However, in DS02, the responsible of those activity cliffs are molecules belonging to both classes for the mostly all the TN values, with the sensitivity and specificity tending to be constant for any value of TN (see Figure 2). On the contrary, for DS01 the main responsible belongs to the molecules of class 0 for high values of TN, with the sensitivity tending to decrease for high values of TN as shown in Figure 2. Comparing balanced and unbalanced data sets, we can observe that the detection of activity cliffs by means of the rivality index allows us to estimate the results in the classification process. Thus, high performance in the classification will be obtained for DS01 and DS03, as expected. On the other hand, for the DS02 and DS04 data sets, a low-medium performance will be obtained, although better results should be obtained for the DS04 data set. In addition, for both data sets sensitivity and specificity values should be similar. This Figure 4 highlights the low modelability of these data sets. At any value of TN, for a lot of molecules of the data set, a high number of molecules belonging to a different class is always present in their neighborhood. This fact generates low values of the performance of the machine learning algorithms and it corroborates the improvement of this performance generated by the RINH algorithm. 3.3. Reliability of the Predictions Using the RINH Algorithm. In the building of a classification model, the reliability of the prediction is as important as the number of
the performance, so the number of molecules correctly classified varies from 66 to 72 for the different values of TN. The low modelability of DS02 can also be observed in Figure 3. The best performance is obtained for intermediate values of TN (13, 14, 31 and 32), correctly predicting 57 out of the 82 molecules of this data set. In addition, we can observe clear changes in the performance for close values of TN, being the performance scarcely stable with low changes in TN. DS03 is a data set with high modelability as DS01, and it also shows a behavior similar to the DS01 data set. We can observe in Figure 3, the low influence of the TN values until intermediate values in the performance of the classification process. However, for high values of TN the performance in the prediction of molecules that belong to class 0 decreases, which is the cause of many molecules of class 0 being quite similar to molecules belonging to class 1. For the DS04 data set, an unbalanced data set as DS03, the best performance is obtained for low values of TN, because at high values of TN, a higher number of nearest neighbors belonging to class 1 are in the neighborhood of the molecules belonging to the class 0. This fact is due to the unbalancing of this data set and the similarity between the molecules belonging to both classes. 3.2.1. Analysis of the High and Low Modelability of the Four Data Sets. The reasons for this different behavior of DS03 and DS04 regarding TN is clearly observed in Figure 4. This figure shows the percentage of molecules of each class participating in the formation of pairs of activity cliffs for the values of TN between 1 and the maximum scope of each data set. We consider activity cliffs, as its standard definition, to those pairs of molecules belonging to different classes, nearest to each other than to any other pair of molecules that belong to the same class.31,32 As observed, for DS03 as TN increases a higher number of molecules belonging to class 0 participate in activity cliffs. Then, as described above, the specificity of the prediction decreases with the increases of TN. This event is due to the increase of the number of molecules belonging to class 1 in the neighborhood of H
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 5. Reliability values at the maximum value of scope for the four studied data sets (red bullets are the molecules incorrectly classified, and blue bullets are the molecule correctly classified).
correctly predicted molecules. The reliability of the prediction is closely related to the applicability domain of a classification model because high-reliability values determine that the algorithm is able to correctly perform the prediction despite the composition of the molecules of the data set, and vice versa. Analyzing the eq 5, we can observe that the values of the reliability can be obtained for different values of the scope (value of TN considered). For high values of the scope, a higher number of predictions are performed considering a higher number of TN values and, therefore, considering a higher molecules’ diversity in the neighborhood of each molecule of the data set. The higher diversity in the neighborhood means the consideration of different composition of the data set, and therefore, it determines that the prediction could be considered within a higher applicability domain. Figure 5 shows the reliability values considering the maximum scope for the four data sets studied. In the case of the DS01 data set, most of the molecules can be classified (true or false) with the 100% of reliability. This fact demonstrates the high modelability of this data set. Only a few molecules, belonging to both classes show a reliability value lower than 50%, besides, as can be observed, the values of sensitivity and specificity are quite similar. On the contrary, the DS02 data set, although it is also a balanced data set and with the same number of molecules as the DS01 data set, shows a high number of molecules with intermediate values of reliability in their classification. As these molecules belong to both classes, the values of sensitivity and specificity are similar (see Figure 2) but clearly lower than in the case of the DS01 data set. In the case of the DS03, most of the 107 molecules belonging to class 1 can be classified with a 100% reliability (molecules 1− 107). However, we have found low and intermediate values of reliability for the molecules of class 0. As a result, as shown in Figure 2, the values of the specificity will be affected by the cardinality of the neighborhood. A similar effect happens in the
case of the DS04 data set. However, although DS04 has a higher number of molecules with values of reliability influenced by the scope value, this influence is low, obtaining values of the reliability of the prediction (correct or incorrect) of these molecules greater than 0.7 for most of the cases. In summary, values of reliability close to 100% mean that the prediction of these molecules as correctly or incorrectly classified can be extended to any data set composition. This results in a high applicability domain of the model. Values of reliability close to zero mean that the prediction of these molecules is hard and very dependent on the data set composition. This results in a low applicability domain of the model. 3.3.1. Rivality and Reliability Values Describing the Different Behavior of the Molecules. In order to explain the different behavior of the molecules of the data set, we have centered our attention, for instance, in the DS01 data set. Three types of behavior are observed for the molecules of the data set as is shown in Figure 6. Molecule 1 has a reliability equal to 1 (see Figure 5). As observed in Figure 6 (top-left), the values of the rivality index for this molecule are always greater than zero, independent of the value of TN. As observed in Figure 6 (bottom), the first three nearest neighbors of molecule 1 belong to a different class. Indeed, seven of the first 10 nearest neighbors of this molecule belong to class 1. As a result, this molecule, belonging to class 0, will always be predicted as belonging to class 1, and therefore, this molecule will be misclassified at any data set composition. Molecule 24 has a reliability value equal to −1 (see Figure 5). As observed in Figure 6 (top-left), the values of the rivality index for this molecule are always lower than zero, independent of the value of TN. As observed in Figure 6 (bottom), the first three nearest neighbors of molecule 24 belong to the same class 0 than the molecule 24. Indeed, the first ten nearest neighbors of this molecule belong also to class 0. I
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 6. Rivality index values for different values of TN (top-left), reliability values for different values of scope (top-right) for the molecules 1, 24, and 54 of DS01 data set (top), and molecular structure of molecules 1, 24, and 54 and their first three nearest neighbors (bottom)
included in the neighborhood will produce that the molecule 54 results as predicted to belong to one or another class. This fact can be observed in Figure 6 (bottom). The two first nearest neighbors of molecule 54 belong to a different class. However, the third nearest neighbor belongs to the same class as molecule 54. In the first ten nearest neighbors of this molecule, the neighbors 1, 2, 9, and 10 belong to class 0, and the neighbors 3, 4, 5, 6, 7, and 8 belong to class 1. This behavior is observed for molecule 54 along the interval of values of TN considered. On the other hand, Figure 6 (right) shows the values of the reliability of these three molecules belonging to the DS01 data set, considering different values of scope in the calculation of the reliability of the classification. Thus, we can observe the three different behaviors of these three molecules. In the first place, low values of scope always generate low values of reliability, because, observing eq 5, low confidence values are obtained for low cardinality of the neighborhood values, and therefore, the applicability domain of the prediction also is low.
Consequently, this molecule, belonging to class 0, will always be predicted as belonging to class 0, and therefore, this molecule will be correctly classified at any data set composition. However, the molecule 54 has a reliability value equal to 0.16 (see Figure 5). As observed in Figure 6 (top-left), this molecule shows values of the rivality index positive very close to zero for low values of TN (lesser than 15). Moreover, as TN increases, the values of the rivality index decrease, passing from positive to negative values, also very close to zero. In addition, for intermediate values of TN, the rivality index takes negative values close to zero, passing to positive values at high values of TN. Therefore, the reliability of correctly/incorrectly predicting this molecule is low. If the cardinality of the neighborhood is low/high, the molecule would be incorrectly predicted as belonging to class 0. As the cardinality of the neighborhood increases (TN increases), the prediction of this molecule 54 as belonging to class 0 or class 1 is highly uncertain, because any new molecule J
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 7. Representation of the number of molecules correctly classified (left Y-axis), and reliability and 1-confidence values (right Y-axis) for the four studied data sets.
Thus, at any value of scope, molecules 1 and 24 would be misclassified and correctly classified, respectively. Thus, for both molecules, the reliability of the prediction is 100%. However, observing the behavior of molecule 54, we can see that the reliability value decreases down to zero for values of scope from 15 to 35, oscillating for values of scope greater than 15, between zero to low positive values. This behavior highlights the huge problem in the correct prediction of this molecule, because the prediction will be highly dependent on the presence, or not, in the data set of other molecules, and always the prediction as correctly/incorrectly classified will have a low reliability value 3.4. Applicability Domain Analysis Using the RINH Algorithm. Combining the values of confidence, reliability and rivality for each molecule of the data set for different values of TN, it is possible to gather valuable information on the goodness and applicability domain of the classification model and, as will be commented upon below, to set the best conditions for the prediction parameters values in order to obtain an accurate classification model. In Figure 7, we have represented the number of molecules correctly classified (left Y-axis), and in the right Y-axis, we have represented the values of the corresponding reliability and 1confidence values for the four studied data sets and different values of TN. The graphics of Figure 7 show a clear visualization of the data sets characteristics regarding their modelability and the applicability domain of the classification models using the RINH algorithm. First, we can observe the reliability values for the maximum value of the scope of each data set. Data sets with high modelability (DS01 and DS03) have values of reliability greater than data sets with low modelability (DS02 and DS04). These values of the reliability are a clear measurement of the data set modelability. Thus, if we represented the reliability versus 1 minus the confidence as Figure 7 shows, we can obtain a clear representation and information about the minimum value of
TN that should be considered in the building of the classification model and, therefore, a robust measure of the data set applicability domain for the construction of a classification model. We can clearly observe in Figure 7, that data sets with high modelability require a low value of TN. For DS01, the intersection between the realibility and the confidence is close to a TN value equal to 12, and this value is close to 22 for the DS03 data set. For these data sets, a low number of molecules (around 30% of the maximum scope) is enough to build a robust model and to generate high ratios in the classification. Thus, for the DS01 data set, the consideration of the 12 molecules of each class is enough for obtaining a reliable model (30% of molecules of each class), reaching a maximum reliability close to 0.8. In the case of DS03, this value is around 35 of molecules of each class, that is, 55% molecules belonging to class 0 (the class with the lower number of molecules) and 33% of molecules belonging to class 1, and reaches a maximum reliability close to 0.6. On the contrary, data sets with low modelability require high values of TN. However, for the balanced data set DS02, this value is close to 25 (around 60% of molecules of each class), reaching a maximum reliability close to 0.35. Finally, for the unbalanced data set DS04, this value is close to 40 (around 55% of molecules belonging to class 0 and 43% of molecules belonging to class 1), reaching a maximum reliability close to 0.5. Then, for these data sets, the molecules’ composition hardly affects the classification model reproducibility and, therefore, the robustness and applicability domain of the model. As a result, the classification models built will have low reliability. Table 2 shows the statistics parameters values obtained in the building of the models using the RINH algorithm for the four data sets. This table shows the best results for (i) a given value of TN (first column), (ii) a value of TN greater than the intersection between the confidence and reliability (second column), (iii) the best local TN for each molecule of the data set K
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
LTN
72 83 125 0.740 0.892 0.553 0.480 0.865 0.920 67 89 156 0.923 0.957 0.882 0.845 0.924 0.795 38 87 125 0.740 0.936 0.500 0.494 0.838 0.848
TN = 71 TN = 2
57 72 129 0.763 0.774 0.750 0.523 0.907 0.026 43 99 142 0.830 0.925 0.672 0.632 0.929 0.971
GTN LTN
56 104 160 0.936 0.972 0.875 0.862 0.981 0.848 43 97 140 0.819 0.907 0.672 0.605 0.962 0.444
TN = 30 TN = 2
43 101 144 0.842 0.944 0.672 0.659 0.961 0.031 24 29 46 0.646 0.707 0.585 0.295 0.816 0.946
GTN LTN
34 37 71 0.866 0.902 0.829 0.734 0.901 0.799 29 28 57 0.695 0.683 0.707 0.390 0.873 0.708
TN = 32 TN = 13
25 32 57 0.695 0.780 0.610 0.396 0.829 0.292 36 35 71 0.866 0.854 0.878 0.732 0.970 0.985
GTN LTN TN = 5
38 34 72 0.878 0.829 0.927 0.760 0.990 0.122
statistics
true class 0 true class 1 true molecules accuracy sensitivity specificity MCC auROC reliability
TN = 40
39 38 77 0.939 0.927 0.951 0.878 0.982 0.902
DS04 DS03 DS02 DS01
Table 2. Statistics Parameters Obtained for the Classification Models Generated for the Four Datasets Using Different Values of TN
(third column), and (iv) the global TN of each molecule of the data set (fourth column). The best local TN for each molecule of the data set is calculated by means of a localTN() function (LTN) in charge of assigning to each molecule of the data set the value of TN that generates the lower value of the rivality index with the higher value of the reliability for all negative values of the rivality index, in case that this value exists, or that generates the higher value of the rivality index with the higher value of the reliability for all positive values of the rivality index, for the contrary case. Therefore, this LTN function gathers the best and more reliable prediction of each molecule of the data set as correctly classified, if it is possible, or misclassified, in another case. The predictions for the global TN of each molecule of the data set is calculated by means of a globalTN() function (GTN) consisting in the independent summation of the rivality index and reliability values of negative (correct prediction) and positive (incorrect prediction) for the values of TN from 1 to the maximum scope. As a result, this function contributes with a measurement of the probability of the accuracy of the classification of each molecule of the data set as correctly or incorrectly classified. As observed in Table 2, in this case, the values of the statistics of the classification model using the GTN function are always the lower ones. However, the reliability of the prediction using this GTN function is the highest, so all possible cardinalities of the neighborhood have been considered in the correct/incorrect classification of each molecule of the data set. As observed in Table 2, for the DS01 data set it is possible to obtain the best classification performance with the higher reliability (TN = 40, reliability = 0.99). Then, in this data set, there is a clear dissimilarity between the molecules of both classes with any data set composition, which will allow the generation of robust models. In the case of the DS02 data set, for low cardinalities of the neighborhood (low TN) the predictability of molecules belonging to class 0 is also low, obtaining low values of the specificity. Therefore, a high number of molecules is needed to improve the prediction of the molecules belonging to this class. However, when this number of molecules increases in the neighborhood, the predictability of molecules belonging to the class 1 decreases (sensitivity value). As a result, the modelability of this data set is low, having the model a low reliability at any data set composition. Similar behavior can be observed for the DS04 data set, although, in this case, the behavior of the molecules of each class is the opposite than in the DS02 data set and the reliability of the models is higher than for the DS02 data set. In addition, DS03 has a behavior similar to the one of DS01. The differences are a consequence of the unbalancing of the data set. Thus, although the number of molecules true predicted corresponding to class 0 diminishes at high values of TN (diminishing the specificity value), at intermediate values of TN, the reliability of the model is high and also a high applicability domain is detected, being an intermediate percentage of molecules of each class necessary for generating robust models. We can appreciate that, even without the LTN function, the performance of the classification process using RINH algorithm rivals and sometimes outperforms different machine learning algorithms that were shown in Table 1. In addition, by using the LTN function, the values of the statistical parameters are clearly improved.
38 34 72 0.878 0.829 0.927 0.760 0.977 0.940
GTN
Journal of Chemical Information and Modeling
L
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 8. Results of the classification using the RINH algorithm with LTN for the four data sets studied.
Then, the best local TN for these molecules is the maximum cardinality of the neighborhood, that is, the value max(M0,M1) − 1 = 40. The molecules 57, 73, and 80 have a similar behavior because they belong to class 1 and are always predicted as false class 0. On the contrary, a lot of molecules are always predicted correctly, independently of the data set composition, with the maximum value of reliability (equal to −1). Consequently, all these molecules contribute with a high applicability domain to the classification models. However, we can appreciate in Figure 8 some molecules correctly predicted by LTN with a very high value of reliability but at a very low TN value. These molecules, such as molecule 47, will only be predicted correctly at very low values of TN; thus, any change in the data set composition will generate an incorrect prediction of these molecules. In addition, other molecules with a very low value of reliability (see molecule 29) will also be correctly predicted only when using LTN function at low values of TN. In those cases, these molecules show an oscillating behavior regarding their correct/ incorrect prediction with the increase of TN, generating values of the rivality index close to zero (positive/negative) with the increase of TN. Therefore, the composition of the data set strongly affects the capacity of prediction of this type of molecules. This analysis can be extended to the remaining data sets as can be observed in Figure 8. For the DS02 data set, a high number of molecules can be predicted correctly, but with low values of reliability. These molecules have a similar percentage in both classes, although a higher number of molecules belonging to class 0 will always be incorrectly predicted (reliability equal to 1). Observing DS03 and DS04 data sets, we can appreciate the poor behavior of the molecules belonging to class 0 regarding the molecules belonging to class 1. Most molecules belonging to class 0 show a low reliability value in their prediction. Although those molecules can be correctly predicted for a given data set composition, these molecules are greatly affected by the
Thus, although the improvement is low for data sets with high modelability (DS01 and DS03), for data sets with low modelability (DS02 and DS04), the RINH algorithm is capable of obtaining very high values of the statistic parameters, being able to correctly classify most of the molecules of these data sets. However, although the use of the LTN function allows us to improve the classification results, the reliability of these results is questionable and so is the applicability domain of the generated classification model. As it is also observed in Table 2, using the GTN function the performance of the classification decreases for the four data sets. However, the reliability of these prediction is the highest. For data sets with high modelability (DS01 and DS03), the accuracy of the prediction using GTN is similar to the one obtained for the different values of TN, and slightly lower than the one obtained using LTN. However, the reliability of the prediction using GTN is clearly higher. Thus, demonstrating the high modelability of this data set and the high reliability of the classification model built. For little modelable data sets (DS02 and DS04) we can observe in Table 2 a clear decrease of the accuracy of the model built when using GTN. This decrease is moderate regarding the different values of TN, but it is very significant regarding the accuracy of the models using the LTN function. Thus, this demonstrates the difficulty to correctly/incorrectly classify several molecules of these data sets. Figure 8 shows the results of the classification models generated for the four data sets studied using the LTN function. The analysis of this figure and the results provide a very important piece of information about the data set characteristics and about the applicability domain of the models generated with RINH or any other machine learning classification algorithm. In Figure 8, we can observe the high modelability of DS01 data set. Most molecules can be predicted with a high reliability. For instance, molecules 1 and 17 will always be predicted as false positive (false class 1) independent of the data set composition. These molecules, which belong to class 0, are always predicted as class 1 at any value of TN. M
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 9. Reliability of the classification model for the four studied data sets using the GTN function.
existence in the data set of a very low number of molecules belonging to their same class. Therefore, these molecules will be misclassified at almost every data set composition. In addition, these molecules correctly classified by LTN at high values of TN but with low/intermediate reliability values (i.e., molecules 54 and 67), are also correctly classified by GTN with high reliability (60.06 and 88.21, respectively). Therefore, the results obtained due to the use of GTN prove the high modelability of DS01 data set, obtaining high reliabilities of the prediction (as true or false) for all the molecules of the data set. In Figure 9, we can also appreciate the clear differences between data sets with high and low modelability. For DS02 data set we also observe that most of molecules are predicted by GTN function with a 100% of reliability, or close to it. However, for this data set the number of molecules misclassified is higher than for DS01, in spite of both data sets having the same number of molecules. We ought to point out that the most important results might be the ones which are related to those molecules correctly/ incorrectly classified by GTN with low/intermediate value of reliability. If we center our attention in molecule 6 (or molecules 12, 14, 21, 25, 46, 47, 52, 59, 61, and 64), it is correctly classified by LTN with a positive value of reliability (see Figure 8). This positive value of the reliability is inconsistent to the consideration of the molecule 6 as belonging to class 1. This inconsistence is solved by GTN, misclassifying this molecule with a 100% of reliability. As a result, molecules correctly classified by LTN with positive values of reliability are correctly misclassified by GTN and vice versa. Thus, although in some specific conditions of the data set composition molecules with this type of behavior could be correctly classified by some machine learning algorithms (or RINH), these molecules are outside of the AD for most of the data set compositions, and this information should be considered in the construction of the classification model and in the new predictions.
presence of other molecules in the data set. As a result, these molecules will only be correctly predicted at low cardinality values of the neighborhood or if the data set composition is more balanced. A deep analysis of this study also allows us to extract valuable information about the molecules of the data set. For instance, we can appreciate that the molecules 145 to 171 of the DS03 data set (or some molecules between the 101 to 140 in DS04 data set) will always be incorrectly/correctly predicted but with very low reliability values. These molecules have been detected as participating in a high number of activity cliffs31,32 and are, therefore, molecules that belong to class 0 but are very similar to molecules that belong to class 1. These molecules usually show values of their rivality index close to zero, and their behavior has been described in previous papers being called “activity borders”24 (for instance, see Molecule 31 in Figure 6). Figure 9 shows the results of the classification model for the four data set using the GTN function. We have centered our attention mainly in DS01 data set in order to simplify the analysis. The results described for this data set can be extended to the remaining data sets studied in this manuscript. We can observe in Figure 9 the high reliability of the classification model built for DS01 data set using the GTN function. Most molecules are classified (correctly/incorrectly) with a reliability close to 100%. Regarding the remaining molecules described above for this data set using the LTN function, the analysis carried out can be clearly observed in Figure 9. Molecules correctly/incorrectly classified by LTN with a 100% of reliability, as expected, are also correctly/incorrectly classified by GTN with a 100% of reliability. Molecules correctly classified by LTN at low values of TN, and therefore, very low values of reliability (molecules 29, 34, 41, 47 and 66), are predicted as false class 1 by GTN with a high reliability (78.28, 57.86, 99.75, 99.75 and 99.23, respectively). These molecules could only be correctly classified at very low values of TN; that is, the correct classification is linked to the N
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling As observed, the RINH algorithm offers valuable information about misclassified molecules as it is required by OECD guidelines.1,2 In addition, the RINH algorithm contributes with valuable and clear information on the behavior of molecules that are hard to predict. For instance, we can appreciate that molecule 151 in DS02 data set or molecule 141 in DS04 data set are correctly classified as belonging to class 0 but with very low values of reliability (−0.40 and −1.52, respectively). Molecules with this behavior are molecules showing reliability values very close to zero (positive or negative) for many values of TN (see Figure 8). This is due to the low value of the rivality index (positive/negative) of this molecule at those values of TN and the oscillating behavior of these values with the changes in the TN values. This oscillating behavior has been described above as “activity borders” molecules. This type of molecules is correctly/ incorrectly classified depending of the cardinality of their neighborhood. They are very dependent to the molecules of both classes present in the data sets and, therefore, the number of considered nearest neighbors.
Molecules 17, 57, 73, and 80 are also misclassified when LTN and GTN functions are used because these molecules never take negative values of the rivality index for any value of TN. As a result, the reliability of misclassification of these molecules is complete (100%), and these molecules will always be misclassified for any data set composition (see Figures 8 and 9). Regarding molecule 41, as shown in Figure 10 (center), it only takes negative values of the rivality index at TN = 1, that is, with very low reliability. For TN higher than 2, the rivality index takes positive values, and for values of TN close to the maximum scope, the rivality index takes values very close to zero. Therefore, this molecule is only correctly classified at a value of TN equal to 1, that is, with very low reliability and with the increase of the values of TN (high reliability) this molecule is misclassified. Revisiting Figure 8, the absolute value of the reliability is positive close to 0. In addition, observing Figure 9, this molecule is misclassified by the GTN function with a reliability close to 100%. Therefore, molecule 41 should be considered as a misclassified molecule by RINH as it is considered by SVMLI, ENSRF and the consensus of the algorithms. Molecule 52 is misclassified by SMVLI, ENSRF, and the consensus of the machine learning algorithms (seven algorithms misclassify this molecule). However, this molecule is correctly classified by five machine learning algorithms (CT, SVMGA, ENSGB, ENSLB, and RLI) and also correctly classified by the RINH algorithm. As can be appreciated in Figure 10 (center), this molecule takes negative values of the rivality index for all values of TN between 1 and 40, and observing Figures 8 and 9, we can appreciate that the molecule 52 is correctly classified with a value of reliability equal to 100% using LTN and GTN functions. Then, although SVMLI and ENSRF have been unable to correctly classify this molecule, molecule 52 should always be correctly classified at any data set composition. Regarding molecule 31, misclassified by SVMLI and ENSRF (and by most of the algorithms), the RINH algorithm is able to correctly classify this molecule at low, intermediate, and high values of TN. The behavior of the rivality index regarding TN has been described in Figures 6, 8, and 9, and it can be observed in Figure 10 (center). Except for very few values of TN (lower than 5), this molecule behaves as an “activity border” having values of the rivality index close to zero. However, for most of the TN values, this molecule can be correctly classified. Therefore, the prediction has a very high value of reliability (see Figure 9). Observing Figure 10 again, we can appreciate another group of molecules misclassified by SVMLI, but correctly classified by ENSRF algorithms (molecules 1, 26, 39, and 58), this results are discussed below:
4. DISCUSSION 4.1. Comparative Analysis of the Results between Machine Learning and RINH Algorithms. In order to support the analysis commented above of the results using the RINH algorithm, it is necessary to compare these results with the ones obtained in the construction of the classification models using the 12 machine learning algorithms considered in this paper. To achieve better clarity of this comparison, we have focused our analysis on the DS01 data set, because this data set is one of the data sets with a lower number of molecules and a lower number of misclassified molecules. Figure 10 (top) shows the misclassified molecules detected by the 12 machine learning algorithms in the building of the classification models for the DS01 data set and the consensus results of these algorithms considering those misclassified molecules by more than 50% of the algorithms. In addition, in Figure 10 (top) are represented the misclassified molecules using the RINH algorithm, considering values of TN equal to 5 and 40, with values of TN selected by the LTN function and using the GTN function. The results for the RINH algorithm are the same as those detailed in Table 2 and in section 3.4. As it can be observed in Figure 10 (top), SVM with linear kernel (SVMLI) and Random Forest (ENSRF) algorithms are among the best algorithms and generate the best predictions. Other SVM kernels, such as Gaussian and Cubic generate similar results than linear kernel, although some misclassified molecules are different. Other ensemble algorithms have a similar behavior to Random Forest. First, we can observe, as above-described, the poor behavior of some machine learning algorithms. Thus, for instance, CT, KNN, RLI, LR, etc., generate a high number of misclassified molecules. Although DS01 is a data set with a high modelability, those algorithms are unable to build accurate classification models. Placing our attention on SVMLI and ENSRF algorithms, we can observe a first group of molecules misclassified by both algorithms and the consensus of the 12 algorithms. These molecules are the molecules 17, 31, 41, 52, 57, 73, and 80.
• Molecule 1: Most of the machine learning algorithms also misclassify this molecule, such as the RINH algorithm does. This molecule, as observed in Figure 10 (center), has positive values of the rivality index at any value of TN, so the reliability of the misclassification of this molecule is of 100% (see Figures 8 and 9). • Molecule 26: Most of the machine learning algorithms correctly classify this molecule, such as the RINH algorithm does. This molecule, as observed in Figure 10 (center), has negative values of the rivality index at any value of TN, so the reliability of correctly classifying this molecule is 100% (see Figures 8 and 9). O
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
algorithm using the leave-one-out (LOO) technique, the algorithm classifies these three molecules correctly. • Molecule 82: This molecule also has a behavior similar to molecule 31 (see Figure 10 bottom) and only sets positive values of the rivality index for values of TN between 6 and 16, and these values are very close to zero (“activity border”). However, at any other value of TN between 1 and 40, this molecule sets negative values of the rivality index, as shown in Figure 10 (bottom). As a result, and as observed in Figures 8 and 9, there is a very high reliability to predict correctly this molecule (absolute values close to 1). In addition, most of the algorithms classify correctly this molecule and when executing ENSRF algorithms using leave-one-out (LOO) technique, the algorithm correctly classifies this molecule, as so does the RINH algorithm. Finally, we can observe in Figure 10 another group of molecules correctly classified by SVMLI and ENSRF but misclassified by the RINH algorithm at some values of TN. These molecules are the molecules 47, 53, and 67. • Molecule 47: As observed in Figure 10 (bottom), molecule 47 takes positive values of the rivality index for most values of TN. Only for values of TN of 1, 6, 7, and 40 does this molecule take negative values of the rivality index, but those values are very close to zero. Therefore, this molecule could only be correctly classified at low values of TN with very low reliability positive and close to zero. As a result, molecule 47 should be considered as a misclassified molecule as it is considered by the RINH algorithm (see also Figures 8 and 9). Indeed, the ENSRF algorithm executed with the LOO technique also misclassified molecule 47. • Molecule 53: The results of the classification models using 5-fold cross validation of SVMLI and ENSRF algorithms are hard to interpret being different to several of the other machine learning algorithms considered. As observed in Figure 10 (bottom), this molecule shows positive values of the rivality index for most of the TN values, and only for very low and very high values of TN, the rivality index takes values very close to zero (negative/positive). Therefore, as shown in Figures 8 and 9, this molecule should be misclassified with a very high reliability (close to 100%) at any data set composition. This molecule belongs to class 1. However, its first nearest neighbor belongs to class 0, between its five nearest neighbors, three molecules belong to class 0, and between its 10 nearest neighbors, six molecules belong to class 0 (37, 40, 28, 26, 9, 25), etc., having a greater similarity to these molecules regarding the molecules of class 1 included in the corresponding neighborhood. Therefore, we have for any neighborhood considered positive values of the rivality index. In addition, when SVMLI and ENSRF are executed with LOO technique, this molecule is also misclassified as so does the RINH algorithm. • Molecule 67: The behavior of molecule 67 is a clear sample of the behavior of a molecule clearly affected by the data set composition. At low values of TN (lower than 4), this molecule takes positive values of the rivality index (see Figure 10 bottom). Therefore, the molecule is misclassified but with a low reliability (see Figure 10 top).
In addition, if the SMVLI algorithm is executed 10 times, 60% of the models classified this molecule correctly, showing a low reproducibility of this algorithm and the goodness of the results of the RINH algorithm. • Molecule 39: This molecule is correctly classified by the RINH algorithm (see Figures 8 and 9). The behavior of this molecule is similar to the one described above for molecule 31. We can see in Figure 10 (center) that for values of TN lower than 6, this molecule takes negative values of the rivality index and when TN increases the rivality index takes values close to zero (negative/positive). This kind of molecule has been defined as “activity borders”. That is, they are molecules with a rivality index close to zero, varying from positive to negative values (or vice versa) depending on the data set composition. Therefore, the researchers should have considered the results obtained for this molecule regarding the data set composition. • Molecule 58: The behavior above-explained for molecule 26 can be extended for this molecule. This molecule always takes negative values of the rivality index, as observed in Figure 9 (center); therefore, this molecule can be correctly classified with a reliability of 100%. Moreover, observing Figure 10, we can appreciate another group of molecules correctly classified by SVMLI, but misclassified by ENSRF algorithms (molecules 29, 34, 42, 64, 66, 72, and 82), this results are discussed below: • Molecule 29: The behavior explained previously for molecule 31 can be extended to this molecule (see Figure 10 bottom). Molecule 29 is an “activity border”, having values of rivality index close to zero (negative and positive) at different values of TN. As can be observed in Figure 8, the reliability of the prediction (as true) of this molecule is very close to zero. Thus, the RINH algorithm also misclassifies this molecule with a high reliability of 78.28% (see Figure 9), such as ENSRF does. • Molecule 34: This molecule shows negative values of the rivality index for values of TN lower than 8. For values greater than TN, the rivality index clearly takes positive values until values of TN close to the maximum scope in which these values are close to zero. As observed in Figure 8, this molecule could be correctly classified with low values of reliability. Therefore, this molecule should be misclassified for most of the data set composition, as happens when the GTN function is used (see Figures 9 and 10). • Molecule 66: This molecule only shows negative values of the rivality index for values of TN lower than 2 (see Figure 10 bottom). For values from 2 to 40, the values of the rivality index are positive, therefore, this molecule would be misclassified for almost all data set composition with a reliability value close to 100% (see Figures 8 and 9). • Molecules 42, 64, and 72: The above explained for the molecules 26 and 58 can be extended to the molecules 42, 64, and 72. These molecules show negative values of the rivality index for any TN value (see Figure 10 bottom), so these molecules can be correctly predicted with a reliability of 100% by the RINH algorithm (see Figures 8 and 9). In addition, molecules 42 and 72 are correctly classified by most of the algorithms and by executing ENSRF P
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 10. Molecules misclassified by the 12 machine learning and RINH algorithms for the DS01 data set and rivality index values at different values of TN for some molecules of the DS01 data set.
about the fact that this molecule could be correctly classified at most of the data set compositions as the results with the RINH algorithm describe. In this detailed study of the results of the classification models using the machine learning algorithms and RINH algorithm, we have observed the capacity of RINH algorithm to inform about the classification results of the built model. The accuracy of the
For most of the intermediate values of TN, molecule 67 takes negative values of the rivality index, and at values close to the maximum scope the values of the rivality index are inclined to be close to zero. As a result, as shown in Figures 8 and 9, this molecule can be correctly predicted with most of the data set compositions, with a reliability value close to 90% as the one generated by the GTN function, which is informing Q
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 11. Matthews correlation coefficient (MCC) versus the ratio between the number of molecules and the number of descriptors obtained in the classification models generated for 20 data sets using Linear SVM (SVMLI), Random Forest (ENSRF), and RINH algorithms.
models using RINH reaches and sometimes outperforms those obtained with machine learning algorithm. Moreover, RINH algorithm favors the analysis of the characteristics of each molecule of the data set and the AD of the resulting model. Thus, three types or groups of molecules can be detected in a data set: Molecules that would always be correctly classified independently of the data set composition. These molecules always show negative values of the rivality index and therefore, would be correctly classified at 100% of reliability. Molecules that would always be misclassified independently of the data set composition. These molecules always show positive values of the rivality index and therefore, would be misclassified at 100% of reliability, being considered as outliers in any model. Finally, molecules more or less affected by the data set composition. These molecules would be correctly/incorrectly classified depending on the cardinality of the neighborhood considered and, therefore, depending on the data set composition. For these molecules, the consideration of the reliability value is an important contribution to the researchers. Values of the reliability close to zero would inform about the uncertainty of the prediction (as correctly/incorrectly classified). On the contrary, values greater than ±0.5 would inform about the accurate prediction of these molecules. 4.2. Validation of the Results. With the aim of a clear explanation about the behavior of the RINH algorithm and to compare the performance of the classification models built using this algorithm with those generated with the 12 machine learning algorithms, we have had to select data sets with a low/ intermediate number of molecules. Although the selected data sets are orthogonal, having different balancing and modelability, and the results obtained could be extended to any other data sets, we have considered the need to demonstrate the goodness of our proposal performing the testing of RINH algorithm with data sets with lower/higher number of molecules. Therefore, by carrying out these calculations, we could show that the good behavior of RINH algorithm regarding other machine learning algorithms does not directly depend on the ratio between the number of molecules of the data set and the
number of descriptors or independent variables used for the construction of the classification models. Thus, we also have selected from the Chembench web site26 another 20 data sets of different characteristics, and we have built classification models using Linear SVM, Ensemble Random Forest and RINH algorithms. Details of these data sets and the results obtained are included in the Supporting Information. The data sets selected contained from 122 to 818 molecules with a modelability of reference26 from 0.67 to 0.93. After the cleaning of the descriptors matrixes of these data sets, the number of descriptors considered varied from 110 to 162. Thus, the ratio between the number of molecules and the number of descriptors considered for each data set varied from 0.77 to 7.41. In the Supporting Information are gathered the values of the statistic parameters obtained in the construction of the classification models using Linear SVM (SVMLI), Random Forest (ENSRF), and RINH algorithms. Also, in Figure 11 are represented the Matthews correlation coefficient (MCC) values obtained in those models versus the ratio between the number of molecules and the number of variables contained in the corresponding descriptor matrixes. As observed in the Supporting Information and Figure 11, there is no proof of a direct correlation between this ratio (molecules/descriptor) and the better or worse behavior of the machine learning algorithms regarding the RINH algorithm. For most of the 20 data sets RINH algorithm is able to obtain similar or higher values of accuracy than SVMLI or ENSRF algorithms, independent of the value of the ratio (molecules/ descriptors). In some cases, RINH with GTN function generates lower values of MCC than SVMLI and/or ENSRF. However, as observed in Supporting Information, the reliability of the generated models are very high (greater than 0.95). When LTN function is used, RINH generates very accurate models with high values of the statistic parameters. In addition, for most of these models, the reliability values are also very high, independent of the ratio (molecules/descriptors) value.
5. CONCLUSIONS As the OECD guidelines describe,1,2 as important as the robustness of a classification model is the interpretability of the model and, therefore, the interpretation of the misclassification results. R
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling
■
In the studied data sets, we have found groups of molecules that will always be incorrectly classified. These molecules are detected by the RINH algorithm with an absolute value of reliability equal to 1. In addition, the RINH algorithm also detects those groups of molecules that will be correctly classified independently of the training set composition. Those molecules are predicted with an absolute value of the reliability equal to 1 or very close to 1. Therefore, the RINH algorithm is able to provide valuable information about the classification and misclassification of the QSAR model. Moreover, other groups of molecules can only be correctly/ incorrectly classified depending of the data set composition. These molecules show values of reliability close to zero for any value of TN. As a result, these molecules: (i) always have values of the rivality index close to zero, and are therefore activity borders, or (ii) the value of the rivality index changes from positive to negative (or vice versa) depending on the value of TN. Thus, the RINH algorithm would only correctly/incorrectly predict those molecules with low confidence. Therefore, if in the training set that group of similar molecules is not present or is not complete, those molecules will be misclassified. As a result, any model generated with any machine learning algorithms will be highly affected by the training set composition or technique used. In summary, we can observe that all misclassified molecules by RINH algorithm have been misclassified by several of the machine learning algorithms used. The differences in the results between those algorithms are due to the mathematical basis used in the construction of the classification model. The RINH algorithm detects the most common misclassified molecules by those algorithms including a strong and robust explanation of the results. In addition, the RINH algorithm detects the correctly classified molecules contributing to the reliability of the prediction. This reliability value provides the researchers with a valuable metric for setting the accuracy of the results and the applicability domain of the QSAR model built. Thus, considering as classification criterion the similarity between the molecules of the data set and as applicability domain the distance and density of the molecules included in the neighborhood of each molecule of the data set, the RINH algorithm is able to reach and even sometimes outperform the results obtained by the machine learning algorithms when these algorithm are used using the default values of the execution parameters (without tuning), with the added advantages to provide researchers with clear and robust information about the characteristics of the molecules of the data set to be used in the construction of a classification model and the applicability domain of the model to be later used in further predictions of new molecules.
■
Article
AUTHOR INFORMATION
Corresponding Author
*(I.L.R.) E-mail:
[email protected]. Telephone: +34-957-212082. ORCID
Irene Luque Ruiz: 0000-0003-2996-7429 Miguel Á ngel Gómez-Nieto: 0000-0002-1946-5495 Notes
The authors declare no competing financial interest.
■
REFERENCES
(1) Organization for Economic Co-operation and Development. OECD Principles for the Validation, for Regulatory Purposes of (Quantitative) Structure-Activity Relationship Models. Available online: http://www.oecd.org/chemicalsafety/risk-assessment/ 37849783.pdf (accessed on January 2019). (2) European Commission. QSAR Model Reporting Format (QMRF). Available online: https://ec.europa.eu/jrc/en/scientifictool/qsar-modelreporting-format-qmrf (accessed on January 2018). (3) Liu, P.; Long, W. Current Mathematical Methods Used in QSAR/ QSPR Studies. Int. J. Mol. Sci. 2009, 10 (5), 1978−1998. (4) Eklund, M.; Norinder, U.; Boyer, S.; Carlsson, L. Choosing Feature Selection and Learning Algorithms in QSAR. J. Chem. Inf. Model. 2014, 54 (3), 837−843. (5) Bruce, C. L.; Melville, J. L.; Pickett, S. D.; Hirst, J. D. Contemporary QSAR Classifiers Compared. J. Chem. Inf. Model. 2007, 47 (1), 219−227. (6) Organization for Economic Co-operation and Development. Guidelance Document on the Validation of (Quantitative) StructureActivity Relationship [(Q)SAR] Models. OECD Environment Health and Safety Publications. http://www.oecd.org/env/guidancedocument-on-the-validation-of-quantitative-structure-activityrelationship-q-sar-models-9789264085442-en.htm (accessed on January 2019). (7) Weaver, S.; Gleeson, M. P. The importance of the domain of applicability in QSAR modeling. J. Mol. Graphics Modell. 2008, 26, 1315−1326. (8) Roy, K.; Ambure, P.; Aher, R. B. How important is to detect systematic error in predictions and understand statistical applicability domain of QSAR models? Chemom. Intell. Lab. Syst. 2017, 162, 44−54. (9) Netzeva, T. I.; Worth, A.; Aldenberg, T.; Benigni, R.; Cronin, M. T. D.; Gramatica, P.; Jaworska, J. S.; Kahn, S.; Klopman, G.; Marchant, C. A.; Myatt, G.; Nikolova-Jeliazkova, N.; Patlewicz, G. Y.; Perkins, R.; Roberts, D.; Schultz, T.; Stanton, D. W.; van de Sandt, J. J.; Tong, W.; Veith, G.; Yang, C. Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. ATLA, Altern. Lab. Anim. 2005, 33, 155−173. (10) Hanser, T.; Barber, C.; Marchaland, J. F.; Werner, S. Applicability domain: towards a more formal definition. SAR QSAR Environ. Res. 2016, 27, 865−881. (11) Dimitrov, S.; Dimitrova, G.; Pavlov, T.; Dimitrova, N.; Patlewicz, G.; Niemela, J.; Mekenyan, O. A stepwise approach for defining the applicability domain of SAR and QSAR models. J. Chem. Inf. Model. 2005, 45, 839−849. (12) Sushko, I.; Novotarskyi, S.; Körner, R.; Pandey, A. K.; Cherkasov, A.; Li, J.; Gramatica, P.; Hansen, K.; Schroeter, T.; Müller, K. R.; Xi, L.; Liu, H.; Yao, X.; Ö berg, T.; Hormozdiari, F.; Dao, P.; Sahinalp, C.; Todeschini, R.; Polishchuk, P.; Artemenko, A.; Kuz’min, V.; Martin, T. M.; Young, D. M.; Fourches, D.; Muratov, E.; Tropsha, A.; Baskin, I.; Horvath, D.; Marcou, G.; Muller, C.; Varnek, A.; Prokopenko, V. V.; Tetko, I. V. Applicability domains for classification problems: benchmarking of distance to models for ames mutagenicity set. J. Chem. Inf. Model. 2010, 50, 2094−2111. (13) Yun, Y. H.; Wu, D. M.; Li, G. Y.; Zhang, Q. Y.; Yang, X.; Li, Q. F.; Cao, D. S.; Xu, Q. S. A strategy on the definition of applicability domain of model based on population analysis. Chemom. Intell. Lab. Syst. 2017, 170, 77−83.
ASSOCIATED CONTENT
S Supporting Information *
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.9b00264. Results of the statistic parameters obtained in the classification model built for the four data sets and the 12 machine learning algorithms, results of the statistic parameters obtained in the classification model built for the four data sets using RINH algorithm at different TN values, and results of the classification models for 20 data sets using Linear SVM, Random Forest and RINH algorithms (XLSX) S
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling (14) Roy, K.; Kar, S.; Ambure, P. On a simple approach for determining applicability domain of QSAR models. Chemom. Intell. Lab. Syst. 2015, 145, 22−29. (15) Liu, R.; Wallqvist, A. Merging applicability domains for in silico assessment of chemical mutagenicity. J. Chem. Inf. Model. 2014, 54, 793−800. (16) Kar, S.; Roy, K.; Leszczynski, J. Applicability Domain: A Step Toward Confident Predictions and Decidability for QSAR Modeling. In Computational Toxicology (Nicolotti, O.; Ed.; Springer, NY, Chapter 6, 141−169. (17) Sheridan, R. P. Three useful dimensions for domain applicability in QSAR models using random forest. J. Chem. Inf. Model. 2012, 52, 814−823. (18) Sahigara, F.; Mansouri, K.; Ballabio, D.; Mauri, A.; Consonni, V.; Todeschini, R. Comparison of Different Approaches to Define the Applicability Domain of QSAR Models. Molecules 2012, 17, 4791− 4810. (19) Sheridan, R. P. Using random forest to model the domain applicability of another random forest model. J. Chem. Inf. Model. 2013, 53, 2837−2850. (20) Roy, K.; Ambure, P.; Kar, S. How Precise Are Our Quantitative Structure−Activity Relationship Derived Predictions for New Query Chemicals? ACS Omega 2018, 3, 11392−11406 http://dx.doi.org/10. 1021/acsomega.8b01647 . (21) Mansouri, K.; Grulke, C. M.; Judson, R. S.; Williams, A. J. OPERA models for predicting physicochemical properties and environmental fate endpoints. J. Cheminf. 2018, 10 (1), 0263. (22) Cooper, J. A.; Saracci, R.; Cole, P. Describing the Validity of Carcinogen Screening Tests. Br. J. Cancer 1979, 39, 87−89. (23) Frank, I. E.; Friedman, J. H. Classification: Oldtimers and Newcomers. J. Chemom. 1989, 3, 463−475. (24) Luque Ruiz, I.; Gómez-Nieto, M. A. Study of the datasets modelability: modelability, rivality and weighted modelability indexes. J. Chem. Inf. Model. 2018, 58, 1798−1814. (25) Luque Ruiz, I.; Gómez-Nieto, M. A. Study of the Applicability Domain of the QSAR Classification Models by Means of the Rivality and Modelability Indexes. Molecules 2018, 23, 2756. (26) Chembench website. Carolina Exploratory Center for Cheminformatics Research (CECCR). Available online: https://chembench. mml.unc.edu/ (accessed on January 2019). (27) The Chemistry Development Kit (CDK); Available online: https:// cdk.github.io/ (accessed on January 2019). (28) Matlab and Simulink, Version 2018Rb; The MathWorks, Inc.: Natick, MA. Available online: https://www.mathworks.com/ products/matlab.html (accessed on 4 January 2019). (29) Statistics and Machine Learning Toolbox, Version 2018Rb; The MathWorks, Inc.: Natick, MA. Available online: https://www. mathworks.com/products/statistics.html (accessed on January 2019). (30) Ballabio, D.; Grisoni, F.; Todeschini, R. Multivariate Comparison of Classification Performance Measures. Chemom. Intell. Lab. Syst. 2018, 174, 33−44. (31) Maggiora, G. M. On Outliers and Activity Cliffs−Why QSAR Often Disappoints. J. Chem. Inf. Model. 2006, 46, 1535−1535. (32) Stumpfe, D.; Hu, Y.; Dimova, D.; Bajorath, J. Recent Progress in Understanding Activity Cliffs and Their Utility in Medicinal Chemistry. J. Med. Chem. 2014, 57, 18−28.
T
DOI: 10.1021/acs.jcim.9b00264 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX