Study of Data Set Modelability: Modelability, Rivality, and Weighted

Aug 27, 2018 - The knowledge of the capacity of a data set to be modeled in the first stages of the building of quantitative structure–activity rela...
0 downloads 0 Views 1MB Size
Subscriber access provided by University of South Dakota

Computational Chemistry

Study of the Datasets Modelability: Modelability, Rivality and Weighted Modelability Indexes Irene Luque Ruiz, and Miguel Ángel Gómez-Nieto J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00188 • Publication Date (Web): 27 Aug 2018 Downloaded from http://pubs.acs.org on September 1, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Study of the Datasets Modelability: Modelability, Rivality and Weighted Modelability Indexes Irene Luque Ruiz1, Miguel Ángel Gómez-Nieto. University of Córdoba. Department of Computing and Numerical Analysis. Campus de Rabanales. Albert Einstein building. E-14071, Córdoba, Spain. {iluque, mangel}@uco.es

ABSTRACT: The knowledge of the capacity of a dataset to be modeled in the first stages of the building of QSAR prediction models is an important issue because it might reduce the effort and time of the researchers in selecting or rejecting datasets and in refining the dataset’s composition. The modelability index (MODI) is based on the counting of the first nearest neighbor belonging to the molecules of the dataset and is a standardized measurement assumed in the QSAR community. In this paper, we revisit the calculation of the modelability index, proposing a more formal formulation that extends the calculation to the first nearest neighbors that belong to each existing class in the dataset. In addition, this new formulation allows the calculation of the rivality index, as a measurement of the presence of correctly classifiable molecules and activity cliffs. By weighting the rivality index considering the cardinality of the neighborhood of each 1

Corresponding author. Email: [email protected], Phone. +34-957-212082

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 50

molecule of the dataset, the calculated weighted modelability index is highly correlated with the Correct Classification Rate (QSAR_CCR) obtained in the building of QSAR models using different classification algorithms. The results obtained with the weighted modelability index show correlations of r2 higher than 0.9, slopes close to 1 and bias close to zero for different algorithms.

INTRODUCTION QSAR predictions proposals are dedicated to the development of statistic models that allows to predict, qualitative or quantitatively, the biological activity of drugs. In these investigations new dataset representations, new statistic algorithms and new experimental procedures are proposed with the goal of generating more robust and highly applicable prediction models1-3. In the first stage of the experimental procedure, the balancing and similarity/diversity of the dataset is examined, since the study of the domain of applicability (AD) of the model plays a critical role for the estimation of the uncertainty in the prediction of a specific molecule based on how similar it is to the compounds used to build the model. A prediction is valid only if the compound being predicted falls within the applicability domain of the model and, therefore, the composition of the training and test sets has a significant impact on the resulting model. This is due to the fact that there is a high possibility of considering outliers in the training set (which are actually influential observations for the model), and/or including compounds quite dissimilar to the training set compounds in the test set.4,5 Although the dataset composition, in some cases, can be refined, some problems are innate to the QSAR model building: i) the information of the molecule activity is always subject to experimental errors, ii) the information of the molecule activity is only known, in some cases, for

ACS Paragon Plus Environment

2

Page 3 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

a small number of molecules, iii) the distribution of the number of molecules of the different classes is not homogeneous (unbalanced datasets), etc. Thus, in the internal validation stage, a hard effort is necessary as well as the use of efficient and low time cost algorithms, and techniques such as y-randomizing, bootstrapping, cross-validation and so on, for the improvement of the prediction and building of robust models6,7. In order to diminish these efforts in early stages of the building of the QSAR models, Golbraikh et al.,8 have proposed a measurement of the viability of a dataset to be modeled. This measurement, named MODelability Index (MODI), allows to approximately know the accuracy of the prediction of the QSAR classification model. This index provides excellent information in the first stage of the building of QSAR models, allowing the researchers to detect early problems in the characteristics of the composition and/or the data representation of the dataset. Although the value of MODI should be interpreted as an approximation to the expected results of the statistic QSAR algorithm, the application of this index to a high number of datasets in the last years has proved a good correlation with the value of CCR (Correct Classification Rate) obtained in the building of the classification model. This high correlation is mainly a result of MODI index being based on the general QSAR principle, stating that similar compounds posses similar properties9, and the fact that statistic algorithms are also generally based in the consideration of similarity or distance measurements to predict the activity of a molecule in comparison to the activity of similar or nearest molecules. Basically, the calculation of the MODI index is based on the detection of activity cliffs10-15. Although for datasets with continuous activity values, activity cliffs are considered as very similar compounds with quite different activities, and indexes such as the proposed by Guha et. al.,16 have been used for their detection, the calculation of MODI for datasets with discrete

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 50

activity values consider as activity cliffs the pairs of molecules with a lesser distance between them than to any other molecule of the dataset, yet belonging to different activity classes. Thus, pairs of molecules not detected as activity cliffs8,17 are counted as true classifiable and the MODI index is calculated as a ratio between the number of these molecules and the whole number of molecules of the dataset. In this calculation, Euclidean distance measurements are obtained between all pairs of molecules of the dataset. The first nearest neighbor for each molecule is discovered and, if both molecules belong to the same class, the pair is counted, or if they are of different classes, the pair of molecules is considered as an activity cliff.8 Therefore, although different concepts can be used for the definition of activity cliffs, this index considers a structural concept based on distance similarity between pairs of molecules.13,18 Adilova et al.,19, have proposed a modification of the calculation of this index. Instead of considering distances for the calculation of the first nearest neighbor to each molecule of the dataset, authors use Voronin similarity measurements. For binary datasets with molecules with activity values measured as belong to 0 and 1 class, actives and inactives, etc., the authors calculate the Voronin similarity as follows:

,  =  Λ   ,  ,





Λ  ,   = 1 − 

ln 1 +   −    ln 1 +   

(1)

where: i and j represent two molecules of the dataset,   and  are the N descriptors (variables) 

space in which the molecules of the dataset is represented,  = max   − min   and   and

" values are computed on the training sample as follows:

ACS Paragon Plus Environment

4

Page 5 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling



+

 #Λ$ , $  −

'



Λ% , % & → )*,  = 1



(2)

where: Ai and Aj are the activity values of the molecules i and j, respectively, and $ and $ are 

the values of the Ai and Aj activities, respectively. These authors, testing different similarity thresholds, checked whether the most similar molecule to each other molecule of the dataset that satisfies the threshold belongs to the same class, and the modelability is calculated. Authors describe the use of Voronin similarity for the calculation of the modelability index as to be limited to large datasets, implying a more complex calculation with a higher computational cost than the MODI index. Recently, Golbraikh et al.,20 have extended their work proposing a set of new indexes with the aims of calculating several statistic criteria dedicated to predicting the modelability of a dataset prior to the building of a QSAR model. The calculation of these modelability criteria is based on the k-nearest neighbor approach for the measurement of the dataset diversity, activity cliffs and correct classification rate. These indexes are calculated for datasets with discrete categories or classes and for datasets with continuous values of the activity, carrying out a wide study of these indexes regarding the statistic parameters obtained in the built QSAR models. In this paper, we center our attention in the proposed indexes devoted to the study of the dataset modelability in QSAR classification (datasets with discrete and not continuous activity values). Thus, Golbraikh et al.20 define the dataset diversity index as follows:

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

5

4

1 3 ,-./_./1 =   . ,2    

Page 6 of 50

(3)

where: M is the number of molecules of the dataset and K is the number of nearest neighbors calculated by having into account the calculation of the normalized distance .3 between the molecule i and its nearest neighbor j. As authors have proposed, this normalized distance can be calculated using different criteria and using different distance metrics or similarity indexes. The activity cliff index is also proposed in a similar form, based on the activities of nearest neighbors, as follows:

5

4 ∑  8 % − %  1 ,-./_%6/ =  , ∑4   8

(4)

 

where: M is the number of molecules of the dataset, K is the number of nearest neighbors calculated by having into account the calculation distance-dependent weights 8 between the

molecule i and its nearest neighbor j, % is the activity value of the molecule i and % is the

activity value of the nearest neighbor j to the molecule i. Authors also proposed different distance-dependent weight schemes based on different types of normalization of the values of .3 : i) dividing by the square root of the number of descriptors, ii) dividing by the maximum value over all Dij and iii) using Max/Min normalization. Both, MODI_DIV and MODI_ACI, were tested versus the correct classification rate (QSAR_CCR) obtained in the building of QSAR models using a large number of datasets with binary values of activity (classes 0 and 1, actives and inactives, etc.); obtaining as a result low

ACS Paragon Plus Environment

6

Page 7 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

correlations: 9%:_66: = −0.92 × ,-./_./1 + 0.84 QSAR (R2=0.48) and 9%:_66: =

−0.67 × ,-./_%6/ + 0.90 (R2=0.63). Despite that, these indexes could be used in the

evaluation of dataset modelability, as the authors proposed. In addition, Golbraikh et al.20 define, for datasets with binary response variable, the modelability index (MODI_CCR) as “the correct classification rate (CCR) for leave-one-out cross-validation with 1-nearest neighbor (1-kNN) in the entire descriptor space”. Then, for binary datasets the value of this index is based on counting if the first nearest neighbor to each molecule i of the dataset belongs to the same class of the molecule i. This index is formulated as follows:

E

1 D ,-./_66: =  6 D

(5)

 

where C is the number of classes (two for binary datasets), Nii is the count of molecules of class i predicted correctly (molecules with its first nearest neighbor belonging to the same class), and Ni is the total number of compounds of class i existing in the dataset. Authors describe the calculation of MODI as follows: “For every compound in a dataset, we determine whether its first nearest neighbor, i.e., a compound with the smallest Euclidean distance from a given compound estimated for the entire descriptor space, belongs to the same activity class or not. In the latter case, the pair can be formally designated as an activity cliff. The number of nearest neighbor pairs that are not activity cliffs is counted for each class of compounds and is used to calculate MODI by means of the eq. 5.”8 A similar definition is presented by Puzyn and Roy17 in their manual of the software utility for the calculation of this index: “MODI is defined as an activity class-weighted ratio of the number

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 50

of nearest-neighbor pairs of compounds with the same activity class versus the total number of pairs”. As it can be observed, both definitions seem to consider the concept of activity cliffs as pairs of molecules, counting for the calculation of the index the non-activity cliffs pairs11. However, the calculation of this index by means of the eq. 5 is performed considering molecules, that is, the number of molecules of each class having its first nearest neighbor belonging to the same class. This calculation allows Golbraikh et al.,8,20 to describe MODI_CCR as a measurement of QSAR_CCR (Correct Classification Rate) or balanced accuracy, calculated as follows:

1 D 9%:_66: =   "  

FGHIJKGILMNIL

DFONPQ

where: k is the number of classes, D

FGHIJKGILMNIL

(6)

is the number of molecules correctly

predicted belonging to the i class by a QSAR model using 5-fold cross-validation and DFONPQ is the number of molecules that belong to the i class.

In the analysis of this index performed by the authors, correlations of 9%:_66: = 0.59 ×

,-./_66: + 0.31 (R2=0.67) using 42 datasets, and 9%:_66: = 0.89 × ,-./_66: +

0.064 (R2=0.84) using a total of 102 datasets were obtained.

Finally, Golbraikh et al.,20 proposed a more refined modelability index utilizing a similarity

search procedure based on a k-nearest neighbors approach without a variable selection and carried out in the entire descriptor space. Thus, using a standard kNN algorithm developed by the authors21 with LOO (leave-one-out) and 5-folds cross-validation (CV) techniques, a refined MODI_ssCCR index is obtained combining the results generated in each hold-out prediction.

ACS Paragon Plus Environment

8

Page 9 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

As observed, the calculation of this index has a higher computational cost than MODI_CCR and it practically assumes the building of a model using LOO or CV for the calculation of the estimated response variable. This process is also performed by means of a weighting scheme in which the nearest neighbors from a molecule have assigned a higher weight than the farthest neighbor, using the following expression:

TU =

∑   T 8 ∑   8

(7)

V is the where: where yij is the observed activity value for the nearest neighbor j of molecule i, T

predicted activity value for molecule i and the weights are defined as follows:

8 = W1 +

' X

' ∑   X

Y

J

(8)

where: dij are the Euclidean distances between compound i and each of its k nearest neighbors.

With this index correlations of 9%:_66: = 0.68 × ,-./_ZZ66: + 0.24 (R2=0.75)

using 42 datasets, and 9%:_66: = 0.90 ,-./_ZZ66: + 0.055 (R2=0.87) using 102 datasets were obtained. Thus, in terms of its coefficient of determination, MODI_ssCCR showed a better correlation with QSAR_CCR than MODI_CCR, although both indexes have

demonstrated to be an excellent criterion to be used in the prediction of the datasets modelability.20 As it is observed in the slopes of the correlations obtained for MODI_CCR and MODI_ssCCR, the values of these indexes are not equal, nor similar to the values of QSAR_CCR, being generally higher. Thus, the authors propose that datasets with MODI_CCR < 0.5 or

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 50

MODI_ssCCR < 0.6 are not modelable (all QSAR_CCR < 0.7), datasets with MODI_CCR ≥ 0.8 or MODI_ssCCR ≥ 0.8 are modelable, and for datasets with 0.5 < MODI_CCR ≥ 0.8 or 0.6 < MODI_ssCCR ≥ 0.8 it is impossible to say whether they are modelable or not, and QSAR modeling is required for its determination. Although these indexes, and other proposed22, also have been formulated for the prediction of the modelability of datasets with continuous activity values, in this manuscript we focus our work in analyzing the calculation described in the proposals of Golbraikh et al.,8,20 for datasets with discrete values of activity, and we also present a new formulation of the modelability index, having into account the rivality or noise produced by the nearest neighbors to a given molecule i of the dataset in the correct prediction of the activity of that molecule i. This new formulation maintains the original MODI_CCR concept8,20, does not introduce any computational cost in its calculation and fully reproduces the MODI_CCR values when only the first nearest neighbor is considered in its calculation. However, this new formulation allows the calculation of a new index, named rivality index (RI), informing of the capability of each molecule of the dataset to be classified correctly. Thanks to the rivality index, the characteristics of each one of the molecules of the dataset as well as the whole dataset can be analyzed in the first stages of the QSAR works. In addition, we propose a new modelability index based on weighting the distance between the molecules. The weighted index (WMODI*) refines the modelability index values, highly improving the accuracy of the prediction of the dataset’s modelability. The manuscript has been organized as follows: after the Introduction section, in the following section, the datasets and algorithms used in the calculations are described and we analyze the calculation of MODI_CCR. Moreover, in this section we describe the rivality index (RI) and

ACS Paragon Plus Environment

10

Page 11 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

introduce a new formulation for the calculation of the dataset’s modelability (MODI*). In the next section, we describe the results obtained for the fifty-five datasets used for the new formulations of the modelability, and we describe how this modelability index is improved by means of the consideration of weighted distance measurements. The experiments performed with the proposed weighted modelability index (WMODI*) are described, showing the excellent correlations obtained with the experimental Correct Classification Rate (QSAR_CCR) compared to the original index. Finally, conclusions are introduced supporting the approach proposed in this work.

MATERIALS AND METHODS Datasets description From Chembench website23, fifty-five datasets were gathered for the analysis and validation of the proposal presented in this paper (see Fig. S1 and S2, and Table S1 in Supporting Information). These datasets show a high diversity, having between 114 and 818 molecules distributed in two classes (classes 0 and 1), with a class distribution varying between 37% and 76% for class 0 and between 24% and 62% for class 1, and a modelability index of reference23 varying from 0.65 to 0.94 (see Table S1 in Supporting Information). This high diversity of the datasets allows us to prove the applicability of our study to different chemical spaces. For all datasets, information about the molecule composition, activity class and modelability of the reference was accessible23. The data representation of the datasets used in the calculations has been their descriptor matrices, also gathered from Chembench website23. The descriptor matrix selected has been CDK24 descriptors, including 149 variables describing 1D and 2D molecular descriptors23. In

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 50

addition, SDF files for all datasets were also gathered and MACCs fingerprint matrices were built.

Experimental method The descriptors matrix data representation of each dataset was analyzed, and columns (variables) having any value equal to NaN/Inf were erased. In addition, columns with value equal to zero for all rows were also deleted. The resulting matrices were normalized (range-scaled) by columns. Thus, maximum and minimum value of each column was calculated and Max/Min criterion was used to update the values of the matrix to their normalized values in the range [0,1]. These normalized matrices were used in the calculations of the classification models and the indexes described in this paper. For the calculation of the nearest neighbors, Euclidean distance was considered, and distance matrices were generated for each dataset. These matrices are symmetrical matrices of size MxM, being M the number of molecules of the dataset, where the elements (i, j) store the Euclidean distance between the molecules i and j. This Euclidean distance has been calculated using the normalized descriptor matrix. The diagonal of these matrices was set to Inf, and the distance matrices were sorted in ascending order by rows. Then, for each row (molecule) the neighbors to each molecule were ordered from the nearest (lower distance value) to the furthest (greater distance value). For the calculations performed in this work, Matlab2017b25 was used. In addition, using the Statistic and Machine Learning Toolbox26, Complex Tree, Support Vector Machine (SVM) with linear, Gaussian and polynomial kernels (cubic), k-Nearest Neighbors (kNN), ensemble subspace discriminant kNN and ensemble bagged decision trees (Random Forest) algorithms were applied

ACS Paragon Plus Environment

12

Page 13 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

for the building of the classification models and calculation of the Correct Classification Rate (QSAR_CCR). In order to allow the experiments to be easily reproduced by other researchers, the algorithms were executed using the default values of the parameters established by Matlab2017b25,26, so no changes were introduced for improving the results in the building of the classification models for the fifty-five datasets. The building of the classification models was performed using 5-folds cross-validation, repeating this process five times for each dataset and each algorithm. In this cross-validation process the partitioning of the dataset is randomly performed, having the training and test subset of each fold about the 80% and 20% of the molecules of the dataset, respectively. For each dataset and each algorithm, using this 5-fold cross-validation technique, we have built five different models, and for each of the five models, the results of the predictions obtained for each molecule and the values of QSAR_CCR have been stored. Thus, in our calculations, for each dataset and each algorithm, we create 5x5=25 different training and test subsets, having a random and different composition of molecules. As a result, we can obtain excellent information about the behavior of each algorithm with each dataset in the building of the classification model, and to use grouped values of QSAR_CCR in order to be compared with the modelability indexes. In addition, we can obtain robust information of the molecules’ behavior, because some molecules have a different behavior depending of the composition of the training and test set (25 different compositions). Using PaDel-Descriptor software27 from the SDF files, corresponding to the fifty-five datasets gathered from Chembench website23, MACCs fingerprints matrix data representations of each dataset were generated, and were also used as input data for the calculation of the modelability index and results described in this manuscript.

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 50

Rivality index The rivality index (RI) is a measurement of the capability to correctly predict the activity of a molecule by a statistic algorithm. For any molecule of a given dataset, the rivality index is defined as follows: :/ =

X[ − X

\

X[ + X

\

(9)

where: X[ is the distance between the molecule i and its nearest neighbor molecule belonging to

the same class of i, and X is the distance between the molecule i and its nearest neighbor \

molecule that belongs to any class different to the class of molecule i. RI is a normalized index which takes values between 1 and -1. Thus, values lesser than zero imply that the first nearest neighbor of molecule i is a molecule that belongs to the same class of i, and values of RI greater than zero mean that the first nearest neighbor of the molecule i is a molecule belonging to a different class of i. The implementation of eq. 9 is quite simple. The Euclidean distances between all pairs of molecules of the dataset are obtained, and for each molecule the distance to the remaining molecules is sorted (this process is performed in the pre-processing stage as described above). When that is done it is only necessary to find, for each molecule, the first nearest neighbor of the same and of a different class.

ACS Paragon Plus Environment

14

Page 15 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Reformulating the Modelability index The MODelability Index8,20 is a quantitative measurement to quickly assess whether a predictive QSAR model can be obtained for a chemical dataset. MODI_CCR takes values between 0 and 1 and is a measure of the number of molecules of a dataset whose first nearest neighbor belongs to the same class. As a result, we could reformulate the modelability index (MODI_CCR) in a more general form as follows:

5a

 

 

1 1 ,-./ =  ^  1, ∀:/ ≤ 0b " ,



(10)

where: k is the number of exiting classes in the dataset, RIi is the value of the rivality index of the molecule i and Mk is the number of molecules of the class k in the dataset. Analyzing the eq. 10 (and eq. 9), it can be observed that the expression between brackets calculates the sensibility/specificity, that is, the modelability of each class for binary datasets. In addition, the term c:/ =

f

Lde JLd

f

Lde gLd

≤ 0h, allows us to include those cases in which several

molecules from different classes are at the same distance of a given molecule, that is, more than one first nearest neighbor could be taken into account in the calculation. Thus, if a molecule has some molecules belonging to the same and to a different class at the same distance, this molecule is considered as correctly classifiable. In those cases in which X[ = X = 0, the result is \

f

Lde JLd

f Lde gLd

= $ = DD, and the value of this $

term is set to -10-6, in order to be differentiated of those cases in which X[ = X ≠ 0 and this \

term takes the value of 0.

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 50

The molecules with a RIi value close to -1 represent those molecules that will be classified as true by a QSAR algorithm and vice versa. Thus, the ratio between the summation of the values of RI for each class and the number of molecules of that class originates a global RI, informing of a global measurement of the modelability of the class. Besides, the summation of the global RI of all the classes in the dataset produces an absolute measurement of the dataset modelability. We can observe that RI considers several concepts in its calculation: the values of the distance from a given molecule i to their nearest neighbors belonging to the same and to a different class and, therefore, the activity values of these nearest neighbors. Thus, RI can be considered as an integrated formulation of the MODI_DIV and MODI_ACI20 or an index similar to the weighted MODI_ACI using distance-dependent weights20. Global values of the RI for the dataset close to -1 imply that the data set is rather highly modellable. When values of RI for the dataset tend to zero, the modelability of the dataset decreases, and datasets with values of RI greater than zero are very little modelable.

RESULTS AND DISCUSSION Analysis of the behavior of the modelability index First, we have tested the correlation between the values of modelability given as reference in Chembench website23 and the MODI_CCR, calculated using the eq. 5, with average values of QSAR_CCR obtained in the five repetitions (models) built using 5-fold cross-validation for the fifty-five datasets. Firstly, the analysis of the modelability index values reported in Chembench website23 showed

an acceptable correlation with the values of MODI_CCR obtained: ,-./:jjkj*lj =

0.95 × ,-./_66: + 0.05 (R2=0.71).

ACS Paragon Plus Environment

16

Page 17 of 50

Mean(QSAR_CCR) vs Reference (kNN, k=1)

Mean(QSAR_CCR) vs Reference (kNN, k=2)

1.00

1.00

0.90

0.90 0.80

y = 0.9498x + 0.0368 R² = 0.7613

0.70

Mean(QSAR_CCR)

Mean(QSAR_CCR)

0.80

0.60 0.50 0.40 0.30

y = 1.0048x - 0.0237 R² = 0.8328

0.70 0.60 0.50 0.40 0.30

0.20

0.20

0.10

0.10

0.00

0.00 0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00

0.10

0.20

0.30

MODI (Reference)

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.90

1.00

MODI (Reference)

Mean(QSAR_CCR) vs Reference (kNN, k=10)

Mean(QSAR_CCR) vs Reference (SVM, linear)

1.00

1.00

0.90

0.90

0.80

0.80

y = 0.9415x + 0.0224 R² = 0.7484

0.70

Mean(QSAR_CCR)

Mean(QSAR_CCR)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.60 0.50 0.40 0.30

0.60 0.50 0.40 0.30

0.20

0.20

0.10

0.10

0.00

y = 0.7875x + 0.1781 R² = 0.6647

0.70

0.00 0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00

0.10

MODI (Reference)

0.20

0.30

0.40

0.50

0.60

0.70

0.80

MODI (Reference)

Figure 1. Relationship between the values of the modelability index of reference and the mean QSAR_CCR of the five repetitions (models)

As Fig. 1 shows, the best correlation between the modelability of reference23 and the mean of the QSAR_CCR is obtained when kNN algorithm with k=2 is used, resulting in r2=0.83 and slope and bias close to ideality. This suggests that the consideration of two neighbors, and not one, are determinant to correctly classify a molecule. When kNN with k=10 is used, the correlation is like the one obtained with k=1. As observed in Fig.1, the worst results are unexpectedly obtained when SVM algorithm with linear kernel is used with r2=0.66, slope =0.79 and bias=0.18. These values are in resonance with the ones reported by the authors of this index8,20. Thus, we observe that the use of kNN with k=2 generates very similar results to those reported23 using RF algorithm, and therefore, the

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling

measurement of two nearest neighbors to each molecule of the dataset would be enough to obtain a valuable modelability measurement. If the modelability index (MODI_CCR) is based on counting that if the first nearest neighbor to a molecule belongs to the same class of that molecule (as eq. 5), we could expect that the use of kNN algorithm with k=1 would generate values of QSAR_CCR similar to MODI_CCR values.

Mean(QSAR_CCR) vs MODI_CCR (kNN, k=1)

Mean(QSAR_CCR) vs MODI_CCR (kNN, k=2)

1.00

1.00

0.90

0.90 0.80

y = 0.9474x + 0.031 R² = 0.9519

0.70

Mean(QSAR_CCR)

Mean(QSAR_CCR)

0.80

0.60 0.50 0.40 0.30

y = 0.9351x + 0.0225 R² = 0.9064

0.70 0.60 0.50 0.40 0.30

0.20

0.20

0.10

0.10

0.00

0.00 0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00

0.10

0.20

0.30

0.40

MODI_CCR

0.50

0.60

0.70

0.80

0.90

1.00

MODI_CCR

Mean(QSAR_CCR) vs MODI_CCR (kNN, k=10)

Mean(QSAR_CCR) vs MODI_CCR (SVM, linear)

1.00

1.00

0.90

0.90

0.80

0.80

y = 0.8668x + 0.0731 R² = 0.7973

0.70

Mean(QSAR_CCR)

Mean(QSAR_CCR)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 50

0.60 0.50 0.40 0.30

0.60 0.50 0.40 0.30

0.20

0.20

0.10

0.10

0.00

y = 0.7672x + 0.1876 R² = 0.7928

0.70

0.00 0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00

0.10

MODI_CCR

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

MODI_CCR

Figure 2. Relationship between the mean of QSAR_CCR of the five models and the values of MODI_CCR index for the fifty-five datasets considered.

Fig. 2 shows as expected that the best correlation between the QSAR_CCR and MODI_CCR is obtained when kNN with k=1 is used, with r2=0.95, however, a good correlation is also obtained for k=2. When the number of neighbors considered increases (k=10) the correlation diminishes, obtaining similar results to those obtained when SVM with linear kernel is used.

ACS Paragon Plus Environment

18

Page 19 of 50

We observe that MODI_CCR is an excellent index for the prediction of the modelability of datasets when algorithms based on the consideration of a low number of nearest neighbors are used, but not when the classification algorithm is based on any other criterion for the building of the classification model (i.e. SVM).

Redundacy generated with kNN (k=1) and SVM (linear) Class 0 for kNN

Class 1 for kNN

Class 0 for SVM

Class 1 for SVM

100 90

Molecules redundancy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55

Datasets

Figure 3. Mean values of QSAR_CCR for the 5-fold cross-validation using kNN and SVM (left); and molecules’ redundancy for the five repetitions of 5-fold cross-validations of the classification process using kNN and SVM

As we have tested, and can be observed in Fig. 3 (left), see Table S4 in Supplementary Information, kNN (k=1) and SVM with linear kernel have a global similar behavior for all considered datasets. For some datasets SVM shows a better classification behavior (greater values of QSAR_CCR), however for other datasets the best behavior is shown by 1-kNN. However, SVM algorithm is more selective in the assigning of a molecule of the dataset as belonging to a class. Fig. 3 (right) shows, for the fifty-five datasets, the number of molecules that are assigned to a different class (classified as true and false) after the five models generated using 5-fold cross-validation are performed. That is, the number of molecules that in any of the five model are classified correctly, but in any of the other five models generated are incorrectly classified (we have named this number of molecules as redundancy).

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 50

This number of molecules (redundancy), allows us to appreciate the different characteristic of reproducibility and accuracy of 1-kNN and SVM algorithms for the datasets used in this study. As observed, 1-kNN produces higher redundancy than SVM, that is, a greater number of molecules are assigned to a class in a model and to another class in any of the other models generated. This fact could be considered as a clear indicator that, although MODI_CCR value can be a good index for a relative prediction of the dataset modelability, this index does not inform about which molecules will be correctly or incorrectly classified.

Advantages of an absolute measurement of the modelability index Figure 4 (top-left) shows the perfect correlation between MODI_CCR and MODI*, so it can be inferred that the same behavior explained above for MODI_CCR can be applied to MODI*. However, the consideration of MODI* also allows us to calculate the rivality index. These rivality values obtained for the datasets are shown in Fig. 4 (top-right). In Fig. 4, we can observe an acceptable correlation between MODI* and the rivality index. The lower values of rivality are obtained for datasets with higher modelability value, and vice versa. Thus, for instance, for DS21 dataset a positive value of rivality is obtained (0.026) being this dataset rather few modelable, with a value of MODI*=0.457 (see Fig. 4 bottom). Figure 4 (center) shows, for the fifty-five datasets, for each class and the whole dataset, the error rate (percentage of molecules incorrectly classifiable predicted by MODI*) for the two classes and the average (the two class of molecules in the datasets have been named as class 0 and class 1, in the same form as they are indentified in their corresponding dataset23).

ACS Paragon Plus Environment

20

Page 21 of 50

MODI* vs MODI_CCR

MODI* vs RI 1.00

0.90

0.90

0.80

0.80

y=x R² = 1

0.70

MODI*

MODI*

1.00

0.70

0.60

0.60

y = -0.4292x + 0.6119 R² = 0.7959

0.50

0.50

0.40

0.40

0.30

0.30

0.20

0.20

0.10

0.10

0.00

0.00 0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

-1.00

-0.80

MODI_CCR

-0.60

-0.40

-0.20

0.00

Rivality values

Class 0

Class 1

Average

0.70 0.60

Error rate

0.50 0.40 0.30 0.20 0.10 0.00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Datasets

Class 0 0.05

Class 1

All molecules

Datasets

-0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Normalized Rivality Index Values

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

-0.15 -0.25 -0.35 -0.45 -0.55 -0.65 -0.75 -0.85

Figure 4. Correlation between MODI* and MODI_CCR (top-left), correlation between MODI* and RI (top-right), error rate (percentage of molecules detected as incorrectly classifiable) in the MODI* calculation (center), normalized rivality values for the two existing classes of the fifty-five datasets (bottom)

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 50

As it can be observed, in most datasets the two classes have a clear different behavior, being the subset of the molecules of one of the two existing classes the responsible of the decreasing of the modelability of the dataset. These molecules with RI value greater than zero are considered as incorrectly classifiable and can be gathered in the calculation process of MODI*. For instance, for DS05 dataset (see Fig. 4-center) which has 516 molecules belonging to the class 0 and 163 belonging to the class 1, only the 10% of the molecules that belong to class 0 are detected as incorrectly modelable, whereas this percentage is equal to 40% for the molecules belonging to class 1, that is, practically the same number of molecules of each class. However, this counting of molecules detected as incorrectly classifiable only contributes with a global measurement of the high or low capacity of the dataset to be modelable. Figure 4 (bottom) shows the normalized total value of the rivality index corresponding to each class of molecules, and for the whole dataset, obtained for the studied datasets. The normalized values have been obtained by the ratio between the summation of the rivality index of the molecules and the number of molecules. We can observe that there is no existence of a clear relationship between the percentage of molecules detected as incorrectly classifiable and the rivality values, however, when this percentage increases also does the rivality values. Thus, for DS05 dataset the higher number of molecules existing in the class 0 (516 molecules) generate a lower global value of the rivality index (-0.24) than the generated for the class 1 (-0.07), with 163 molecules, although both classes have practically the same number of molecules considered as incorrectly classifiable by MODI* (516*0.1≈52 and 163*0.4≈65, for the classes 0 and 1, respectively). This behavior can be clearly observed in detail in Fig. 5.

ACS Paragon Plus Environment

22

Page 23 of 50

Class 0

Class 1

Average

Normalized Rivality Values > 0

0.30

0.25

0.20

0.15

0.10

0.05

0.00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Datasets

Class 0

Class 1

Average

0.00

Normalized Rivality Values ≤ 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

-0.10

-0.20

-0.30

-0.40

-0.50

-0.60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Datasets

Figure 5. Normalized values of the rivality index for the two classes existing in the fifty five datasets, values greater than zero (top), values lesser or equal than zero (bottom)

Figure 5 (top) shows the normalized values of rivality index greater than zero for each class of the studied datasets, obtained by means of the summation of the values of the rivality index greater than zero in each class and dividing it by the number of molecules of the dataset belonging to each class. For instance, as described above, for the dataset DS05 the number of molecules considered incorrectly classifiable by MODI* is very similar for each class (see Fig. 4-center), although the number of molecules of each class is very different.

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 50

However, in Fig. 5 (top) we can see that the normalized values of the rivality greater than zero for the different number of molecules of each class is practically equal (0.12 and 0.11, respectively). Therefore, although both classes virtually have the same number of molecules considered as incorrectly classifiable (52 and 65, respectively), a lesser number of molecules of class 0 have greater positive values of the rivality index than the molecules of the class 1. Therefore, those molecules could be clearly considered as activity cliffs. On the other hand, if we observe the Fig. 5 (bottom), we can check that in DS05 dataset the normalized value of the rivality index lesser or equal to zero is smaller for the molecules of class 0 than for the molecules of class 1 (-0.28 and -0.19, respectively). Then, although a low number of molecules of class 1 are detected as correctly classifiable (163-65=98), these molecules could be not correctly classified by the QSAR algorithms because these molecules have low negative, or close to zero, values of the rivality index. In addition, it exists a high number of molecules (516-52=464) belonging to class 0 generating a low (negative) value of the normalized rivality index, meaning that those molecules have very low negative values of RI and, therefore, many of those molecules could be clearly detected as correctly classified by the QSAR algorithm. Thus, those molecules belonging to class 0 are the responsible of the high value of the modelability (0.754) of the DS05 dataset. Even more interesting that the rivality values for the whole datasets is the study of the rivality values for each one of the molecules of the dataset. While a rivality value of a molecule greater than zero could contribute to its consideration as incorrectly classifiable, a value lower than zero could conduct to its consideration as correctly classifiable. Positive values close to zero could correspond to molecules properly classifiable for some of the wide number of excellent QSAR algorithms.

ACS Paragon Plus Environment

24

Page 25 of 50

DS12 dataset using kNN (k=1)

DS12 dataset using SVM

kNN (True Class 0)

kNN (False Class 1)

kNN (True Class 1)

SVM (True Class 0)

SVM (False Class 1)

SVM (True Class 1)

kNN (False Class 0)

MODI* (True Class 0)

MODI* (False Class 1)

SVM (False Class 0)

MODI* (True Class 0)

MODI* (False Class 1)

MODI* (True Class 1)

MODI* (False Class 0)

Redundancy (Class 0)

MODI* (True Class 1)

MODI* (False Class 0)

Redundancy (Class 0)

Redundancy (Class 1)

0

10

20

30

40

Redundancy (Class 1)

50

60

70

80

90

100

110

0

10

20

30

40

50

60

70

80

90

100

110

Molecules

Molecules

Rivality values for DS12 dataset

Rivality values for DS12 dataset

Rivality for Class 0 in DS12 dataset

Rivality for Class 1 in DS12 dataset

Rivality for Class 0 in DS12 dataset

Rivality for Class 1 in DS12 dataset

kNN Redundancy (Class 0)

kNN Redundancy (Class 1)

SVM Redundancy (Class 0)

SVM Redundancy (Class 1)

0.30

0.30 0.10

Rivality values

0.10

Rivality values

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

-0.10 -0.30 -0.50

-0.10 -0.30 -0.50 -0.70

-0.70

-0.90

-0.90 0

10

20

30

40

50

60

70

80

90

100

110

0

10

20

30

40

50

60

70

80

90

100

110

Molecules

Molecules

Figure 6. Study of MODI* and rivality index for DS12 dataset using 1-kNN (left) and SVM with linear kernel (right) algorithms.

This fact has been tested in the calculations performed. After the five repetitions (models) were carried out with each algorithm, we observed that some molecules were correctly classified in a model and incorrectly in another model (see Fig. 3-right). Figure 6 shows detailed information of this behavior using the DS12 dataset as an example (this dataset has been only selected to be used as an example due to its low number of molecules, in order to make the graphic representation clearer). In Fig. 6 (top), for kNN (k=1) and SVM algorithms, the molecules correctly (blue and green) and incorrectly classified (red and violet) of each class for kNN and SVM algorithms are represented in bold symbols.

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 50

In Fig. 6 (top), the legend “True Class” describes the molecules that have always been classified in their correct class in the five repetitions (models) using 5-fold cross-validation, the legend “False Class” describes molecules that have always been incorrectly classified in the five models. The molecules correctly classified in any of the five models and incorrectly classified in another one (above described as redundancy) have been represented by empty circles (yellow and grey). In addition, the molecules considered as correctly and incorrectly classifiable in the calculation of MODI* are also represented using empty symbols. Molecules correctly assigned to their class are represented in blue and green, and the ones incorrectly assigned are represented in red and violet. As observed in Fig. 6 (top), kNN produces, after the five models, a higher redundancy (22 molecules) than SVM (see empty circles), as it is also represented in Fig. 3-right. In the case of SVM algorithm, only 8 molecules of class 0 (6, 8, 15, 21, 24, 26, 38, 40) and 10 molecules of class 1 (58, 59, 61, 65, 67, 71, 80, 85, 105, 113) generated redundancy. Observing the filled and empty symbols, it can be also appreciated that the accuracy of MODI* in the detection of correctly and incorrectly classifiable molecules is practically the same for both algorithms, being low the number of errors in the detection of both. MODI* detects all the molecules that, in the five models are correctly classified by the algorithms. In addition, MODI* detects as molecules incorrectly classifiable (values of the rivality index greater than zero) all the molecules (except the molecules 80 and 89, both belonging to class 1) that always, in the five models, are incorrectly classified by 1-kNN algorithm (see violet triangles).

ACS Paragon Plus Environment

26

Page 27 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In the case of those molecules that 1-kNN classified correctly in any model but incorrectly in another one (redundancy), we can observe two behaviors of MODI*. Some molecules, for instance, the molecules 6 and 8 (see empty circles), are considered by MODI* as correctly classifiable, and in other cases (for instance, molecule 67) are considered as incorrectly classifiable. This behavior of MODI* is because these molecules have a value of the rivality index close to zero (negative and positive, respectively). We denote these types of molecules as “activity borders”, that is, molecules with a border rivality value because the distance from this molecule to its nearest neighbors of the both classes is close to zero. These results of MODI* with 1-kNN are close to the ones observed for SVM, as can be observed in Fig. 6 (top) In Figure 6 (bottom) the same results for both algorithms and the values of the rivality index obtained for the molecules of the DS12 dataset are represented. Thus, in filled diamonds (blue and red) the values of the rivality index of the molecules of each class are represented, and in empty circles (yellow and violet) the molecules classified by the algorithms in both class after the five models (redundancy) are represented. The normalized standard deviation (nStdv) of the rivality index values for the whole dataset has also been calculated, having a value of 0.1261. Figure 6 (bottom) shows the threshold of this value in its positive and negative range. With clear exceptions, that are explained below, the molecules with rivality indexes greater than nStdv are always incorrectly classified by the algorithms, so, we can easily detect the activity cliffs of a dataset. For instance, molecules with a negative value of the rivality and considerable lower than nStdev are always correctly classified by the algorithms, except for molecules 59 and 65 (for kNN) and 59, 65 and 105 (for SVM). The algorithms sometimes classify these molecules

ACS Paragon Plus Environment

27

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 50

correctly and sometimes incorrectly in any of the five models, because the molecules contained in the training and test set in each fold of each model change (5 folds x 5 models = 25 different folds). Thus, if in a fold some nearest neighbors of the same class are not considered, and the following nearest neighbor is a molecule of a different class, the molecules are erroneously classified in that model. This behavior can also be observed, but with rivality values higher than zero, for the molecule 67 (for kNN) and 40 y 67 (for SVM). It is interesting to observe, as commented above, that molecules classified correctly and incorrectly in any model (redundancy) are molecules with rivality values within the threshold defined by nStdev. These molecules, named “activity borders”, may be correctly or incorrectly classified depending of the algorithm used and the composition of the training and test sets in the cross-validation stage. Figure 7 shows this same study carried out with DS55 dataset (a dataset with a low number of molecules, but with a higher modelability than DS12 dataset) considering SVM with linear kernel algorithm. For this dataset, the modelability of the reference has a value of 0.91, the average QSAR_CCR corresponding to the five models was 0.903 and the MODI* value was of 0.911. We can appreciate in Fig. 7 that, with SVM, two molecules (15 and 69) were detected as false class 0 in all the five models built (see filled violet diamonds), and eight molecules (7, 43, 50, 118, 119, 120, 121 and 122) were detected as false class 1 in all the five models (see filled red diamonds). MODI* detects most of these molecules as incorrectly classifiable with some exceptions (see empty violet and red diamonds). For instance, MODI* considers molecule 7 as correctly classifiable because this molecule has a negative value of the rivality index. However, the value

ACS Paragon Plus Environment

28

Page 29 of 50

of the rivality index of molecule 7 is close to zero (see Fig. 7 right), so this molecule could be considered as an activity border.

Redundancy of Molecules in DS55 dataset using SVM

Rivality values for DS55 dataset

Class 0 (True Class 0)

Class 0 (False Class 1)

Class 1 (True Class 1)

Class 1 (False Class 0)

MODI* (True Class 0)

MODI* (False Class 1)

MODI* (True Class 1)

MODI* (False Class 0)

Redundancy (Class 0)

Redundancy (Class 1)

Rivality for Class 0 in DS55 dataset

Rivality for Class 1 in DS55 dataset

SVM Redundancy (Class 0)

SVM Redundancy (Class 1)

0.30 0.10

Rivality values

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

-0.10 -0.30 -0.50 -0.70 -0.90

0

10

20

30

40

50

60

70

80

90

100

110

120

0

10

20

30

40

50

60

70

80

90

100

110

120

Molecules

Molecules

Figure 7. Study of MODI* and rivality index for DS55 dataset using SVM algorithm

This fact, can be easily tested if we observe molecule 1. This molecule is considered by MODI* as incorrectly classifiable because its value of the rivality index is greater than zero. However, SVM algorithm correctly classifies this molecule in all the five built models. Again, as the rivality index of this molecule is very close to zero (0.0272) as observed in Fig. 7 (right), it can also be considered as activity border. Other algorithms used in or study (see Supporting Information), as kNN, generate the same results that MODI*, classifying molecule 7 correctly and molecule 1 incorrectly in all the five built models. Therefore, activity borders molecules are dependent of the algorithm and the composition of the training set, to be correctly or incorrectly classified. Figure 7 (right) shows the behavior of the rivality index for this dataset having a nStdev of 0.1424. As observed, molecules 68 and 70 have a rivality value close to zero, these molecules can be also considered as activity borders and, therefore, in any model, of the five built models for this dataset, they are classified correctly and in other model incorrectly, being redundant molecules as is observed in Fig. 7 (left).

ACS Paragon Plus Environment

29

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 50

The cases of molecules 114 and 115 are difficult to explain. These molecules are very similar to each other, indeed, the first nearest neighbor to each molecule is the other molecule and the first neighbor of the other class is quite far, in positions 6 and 23, respectively. Therefore, for these molecules the rivality index is negative and far from the threshold defined by nStdev. However, these molecules are sometimes incorrectly classified by SVM algorithm. The only possible explanation is a rather low accurate behavior of this algorithm (maybe due to the execution parameters not being tuned) for some distributions of the composition of the training and test set, because for the all the five models built using kNN (k=1, k=2 and k=10) algorithms, these molecules are always correctly classified, supporting that values far smaller than zero of the rivality index describe easily and correctly classifiable molecules.

Weighted rivality index We have observed that high positive values of the rivality index allow us to find those molecules with a high possibility to be incorrectly classified, and that high negative values of the rivality index describe molecules with a high capability to be correctly classified by a QSAR algorithm. However, we have also observed two types of deviations between the information provided by the rivality index for the molecules of the dataset and the prediction values obtained by the QSAR algorithm in the experiments carried out: i) some molecules are correctly or incorrectly classified depending of the composition of the training and test subsets, randomly partitioned for the five models built (redundancy), ii) some molecules with a negative value of the rivality index are incorrectly classified.

ACS Paragon Plus Environment

30

Page 31 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Mainly, the first type of deviation is due to the activity borders, molecules with a rivality index close to zero (positive or negative). These molecules are difficult to classify correctly and they are very dependent of the second type of deviation, that is, the composition of the randomly selected molecules of the training and test sets. In many cases, the first two neighbors of a molecule are molecules of both the same or different class, but a considerable number of the following neighbors belong to a unique class (equal or different to the one of the considered molecule). Thus, when some of the N firsts neighbors are selected in the training set, a different result is obtained than when some of these molecules are not in the training set. This lack of correlation between MODI* index, obtained from the rivality index, and the QSAR_CCR can be avoided by means of the consideration of these other close neighbors. Thus, we defined a weighted rivality index as follows: X[ ∗ 8[  − X ∗ 8  \

mmm :/ =

\

X[ ∗ 8[  + X ∗ 8  \

\

(11)

where: X[ is the Euclidean distance from the molecule i to its first nearest neighbor x belonging

to the same class of i, X is the Euclidean distance from the molecule i to its first nearest \

neighbor y that belongs to a different class than the one of i, 8[ is the weight assigned to the neighbor that belongs to the same class of i and 8 is the weight assigned to the neighbor \

belonging to a class different to the class of i. These weights are calculated through the following expression: 8[ =

E d JE de E d

, 8 = \

f

E d JE d E d

, 8[ + 8 = 1 \

(12)

ACS Paragon Plus Environment

31

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 50

where: 8[ is the weight assigned to the distance between the molecule i and its first nearest

neighbor x of the same class of i, 8 is the weight assigned to the distance between the molecule \

i and its first nearest neighbor y of a different class than the one of i, 6D is the cardinality of the neighborhood assigned to the molecule i, 6D[ and 6D are the cardinalities in the neighborhood \

of the molecules belonging to the same and different class of molecule i, respectively. The value of the cardinality of the neighborhood is calculated using a threshold of neighbors (TN) describing the minimum number of neighbors of each class that must exist in the neighborhood. Thus, selecting a TN for the dataset, the cardinality of the neighborhood could be different for each molecule due to the fact that, for each molecule, its TN first nearest neighbors of each class can be in a different site/order or distance. Now, we can reformulate the eq. 10 and calculate a weighted modelability index as follows:

5a

 

 

1 1 mmm ≤ 0b n,-./ ∗ =  ^  1, ∀:/ " ,

(13)

Figure 8 shows two examples of the behavior of the weighted rivality index. In this Fig. 8, we have represented a hypothetical status of the first eleven nearest neighbors. In the first case (Fig. 8 top) for molecule A, belonging to class blue, if we choose a TN=1 a neighborhood of three is selected with two molecules of the same class (blue) than A, and one of a different class (orange). Then, the weighted values for 8[ = 1/3 = 3 − 2/3, 8 = 2/3 = 3 − 1/3, for \

the distance to the first neighbor of the same class and the first neighbor of a different class, respectively.

ACS Paragon Plus Environment

32

Page 33 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

As the value of TN changes, the cardinality of the neighborhood also changes, and this value of the cardinality of the neighborhood could be different for each molecule of the dataset. Thus, for values of TN from 1 to 4, the values of 8[ are set to the values of 1/3, 2/5, 3/7 and 4/9. Figure 8 (bottom) clarifies how WMODI* refines the MODI* definition. For TN=1, the weights 8[ = 8 = 1/2, and the value of WMODI* is the same as MODI*. If the distance \

from B to C is equal than the distance from B to D Xpq − XpE = 0, the molecule B would be considered as correctly classifiable, and if the distance from B to C is lower than the distance from B to D Xpq − XpE r 0, the molecule B would be considered as incorrectly classifiable.

Figure 8. A hypothetical example of neighborhood to explain the behavior of the threshold of neighbors

However, if the value of TN increases, the weight values change having into account that the molecule B has more neighbors of its same class (blue) than of the other class (orange). The changes of these weights determine that, for instance, for a value of TN=4, if the difference of distance between B-C and B-D is small, the molecule B would change from being considered as incorrectly classifiable to being considered as correctly classifiable.

ACS Paragon Plus Environment

33

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 50

Thus, considering this threshold (TN) for the calculation of the 6D value, we can partially estimate that the status of the neighborhood of each molecule of the dataset is like the one existing in for any partition of the dataset in the cross-validation stage. Golbraikh et al.,20 also use weighted schemes for the calculation of MODI_ACI (see eq. 4) and MODI_ssCCR (see eq. 7) considering 2, 3 or 4 neighbors in the weight calculations and carrying out a similarity search in the entire descriptor space, but in this case, the weighting is based on the distance between a given molecule and its N considered neighbors (see eqs. 4 and 8) instead of the density of neighbors of each class exiting in the neighborhood as we propose. Thus, different values of MODI_ssCCR and WMODI* are obtained, because of, for instance, the consideration of k=4 in MODI_ssCCR implies to have into account the four nearest neighbors for all the molecules of the dataset, whereas the consideration of TN=4 in WMODI* implies that each molecule of the dataset will have a different value of the cardinality of the neighborhood and, therefore, a different number of nearest neighbors will be had into account for each molecule of the dataset. In addition, MODI_ssCCR needs a higher computational cost than WMODI*, needing a similarity searching procedure and a more complex computational procedure based on, practically, the building of a kNN QSAR model. On the contrary, the calculation of WMODI* is as simple as the calculation of MODI* not requiring a measurable additional computational cost. Figure 9 (left) shows the correlations (values of r2) obtained between the average of QSAR_CCR values obtained for the five models built and WMODI*, considering different values of TN for the 55 studied datasets (see Supporting Information). In Fig. 9 (left), the dash lines show the values of r2 corresponding to the correlations between QSAR_CCR and MODI*, and the filled squares represent the correlations of QSAR_CCR with the MODI of reference23.

ACS Paragon Plus Environment

34

Page 35 of 50

In order to obtain a better visualization of this figure, the Y-axis with black letters (right) has been used for the representation of the correlations of MODI* and WMODI* with QSAR_CCR, and the Y-axis with red letters (left) has been used for the correlations obtained for the MODI of reference with QSAR_CCR. We can observe, as expected, that when TN=1 is considered, the results obtained for WMODI* are equal to those obtained for MODI*. When kNN algorithm is used, the correlation between QSAR_CCR and WMODI* is higher than when TN is equal to the number of neighbors (k) considered by the algorithm, decreasing for TN values smaller or higher than k.

SVM (5 folds)

SVM (10 folds)

kNN (k=1)

kNN (k=2)

kNN (k=10)

WMODI* (TN=1)

0.85

WMODI* (TN=2)

WMODI* (TN=4)

MODI*

1.0 0.95

0.9 0.90

0.75 0.85 0.70

0.80

0.65

Modelability values

0.80

QSAR_CCR vs WMODI*

QSAR_CCR vs MODI of reference

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.8 0.7

0.6 0.5

0.60

0.75

1

2

3

4

5

8

10

15

0.4 0

5

10

15

20

25

30

35

40

45

50

55

Datasets

Threshold of Neighbors (TN)

Figure 9. Correlation (r2) between QSAR_CCR and WMODI* for different algorithms (left), comparison of MODI* respect to WMODI* values for different values of TN (right)

When SVM algorithm is considered, the correlation between QSAR_CCR and WMODI* increases with the TN value until a value equal to 4, decreasing from this value when TN increases. This value of TN is not considerable affected by the number of folds considered in the cross-validation (see red and dark blue lines). Thus, we see in Fig. 9 (left) that, although with 10 folds the r2 values decrease in comparison to when 5 folds are used, the maximum correlations are obtained for TN values of 4.

ACS Paragon Plus Environment

35

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 50

Comparing the results of the correlations of QSAR_CCR versus MODI* and WMODI*, we can also observe in Fig. 9 (left) the best behavior of WMODI*. For SVM, higher values for r2 are always obtained for WMODI* than for MODI*, and for kNN, if the value of TN is equal to the value of k, the values of r2 are clearly equal or actually improve (see solid versus dash lines). This improvement of the behavior of WMODI* in comparison to MODI* can be clearly appreciated observing Fig. 9 (right) for low values of TN (between 1 and 4). In this figure, the values of MODI* and WMODI* for values of TN equal to 1 (red squares), 2 (green triangles) and 4 (violet circles) are represented for the 55 datasets. We can observe that the values of MODI* (blue points) are equal to those of WMODI* for TN=1. Thus, we can appreciate in Fig. 9 (right) a different behavior of WMODI* values with TN for each dataset. Some datasets are little, or nothing, affected by the changes of TN (for instance, DS20). On the other hand, for other datasets the increasing of TN also generates an increasing of WMODI* (for instance DS51). Finally, for some other datasets the increasing of TN generates a decreasing of WMODI* value (for instance, DS05). These differences are due to the consideration, in the calculation of the weighted rivality index, of the presence of molecules of both classes in the neighborhood of each molecule, generating a refinement of the modelability calculation by means of the weighting of the distance measurements. This weighting is different for each molecule of each dataset, because although the TN value considered is constant for the whole dataset, the cardinality of the neighborhood is different for each molecule, so it is dependent of the relative similarity of each molecule with the remaining molecules of the dataset.

ACS Paragon Plus Environment

36

Page 37 of 50

The best behavior of WMODI* with the values of TN allows us to predict the modelability of the datasets based on the different algorithms to be used in the building of the QSAR classification models.

Normalized weighted Rivality index values

TN=1

TN=3

TN=5

TN=10

0.1 0.0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Datasets

TN=1

TN=3

TN=5

TN=10

1.0

0.9

WMODI* values

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.8

0.7

0.6

0.5

0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Datasets

Figure 10. Effect of TN value on the rivality index (top), effect of TN value on WMODI* (bottom)

For the building of models using algorithms based on nearest neighbor, it is a good advice to weight the distance considering a neighborhood equal to the one used by the algorithm. In other cases, values of TN between 3 and 5 seem to be appropriate, because higher values of TN modify the distances, considering neighbors which would not be considered by the algorithms, and vice versa. Lower values of TN (for instance TN=1, as MODI_CCR and MODI* consider), for many

ACS Paragon Plus Environment

37

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 50

molecules, could not weight the neighbor’s contribution, affecting to the correct consideration of some molecules as correctly or incorrectly classifiable. In Fig. 10 we can observe ,more clearly, this influence of TN values in the weighted rivality index and WMODI*. Figure 10 (top) shows the behavior of the normalized values of the weighted rivality index in relation to TN values. The normalized weighted rivality index is calculated as the ratio between the weighted rivality index for the whole dataset and the number of molecules of the dataset. We can appreciate in this Fig. 10 (top) how the values of the rivality index are affected by the changes in the TN values (in order to show a clear graphic, only some values of TN have been represented). Although this figure shows the global behavior of the datasets, we can appreciate that the changes in the values of TN have a different effect for each dataset. Thus, some datasets (i.e. DS21) are almost not affected in the values of RI with the changes of TN values, and other datasets (i.e. DS05) are clearly affected, meaning that the changes in the value of TN generate that some molecules of the dataset could be changed from incorrectly to correctly classifiable, and vice versa. These changes of the rivality index values are gathered by the WMODI* as it can be observed in the Fig. 10 (bottom). Thus, datasets that suffer scarce changes of the rivality index with the changes of TN values (i.e. DS21), also experiment few changes in the WMODI* values, and datasets experimenting major changes of the rivality index with the changes of TN values (i.e. DS05) also suffer changes of WMODI*. These changes in the WMODI* values due to the changes in the RI values do not always have the same behavior. We can observe in Fig. 10 (top) that for DS05 dataset the rivality index

ACS Paragon Plus Environment

38

Page 39 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

increases for values of TN from 1 to 10. This increasing should generate a decreasing of the WMODI* values, as it is observed for this dataset in Fig. 10 (bottom). Nevertheless, in the case of DS40 dataset, the increasing of the rivality index with the increasing of TN generates an increasing of WMODI*. This behavior is a result of some activity borders molecules modifying their rivality index from positive to negative values, and other molecules changing their rivality index value, yet maintaining their sign (positive or negative). This situation happens even though the global rivality value increases. In addition, other datasets as DS52 are little or nothing affected by the increasing of RI with the increasing of TN. As we observe in Fig. 10 for DS52 dataset, RI value always increases with the increasing of TN (from 1 to 10). However, the values of WMODI* are always approximately equal to 0.96 for any value of TN. This happens as a result of DS52 having a high number of molecules (731), and the changes from negative to positive of the rivality index values of a few number of molecules (5 to 7) do not affect to the value of the modelability. Thus, global measurements of the rivality index for the existing classes in the dataset, and the whole dataset, provide information about the characteristics of these classes and the whole dataset (see also Figs. 4 and 5). However, the most important information (see Figs. 6 and 7) are the values of the weighted rivality index of the molecules’ dataset These values provide a measurement of the capacity of the molecules of the dataset to be classifiable correctly or incorrectly by an algorithm. These more refined values of the rivality index, because of the consideration of the weighting of the distances, generate more refined values of WMODI* conducting to more accurate correlations with QSAR_CCR values.

ACS Paragon Plus Environment

39

Journal of Chemical Information and Modeling

Complex Tree

SVM-Linear

Ensemble (Bag)

Complex Tree

SVM (Linear)

SVM (Gaussian)

SVM-Gaussian

SVM-Cubic

Ensemble (kNN)

SVM (Cubic)

Ensemble (Bag)

Ensemble (kNN)

kNN (k=1)

kNN (k=2)

kNN (k=10)

0.95

1.0 0.90

0.75

0.85

0.70

0.80

0.65

0.75

0.9

QSAR_CCR

0.80

QSAR_CCR vs WMODI*

QSAR_CCR vs MODI of reference

0.85

0.8 0.7 0.6 0.5

0.60

0.70

1

2

3

4

5

8

10

0.4

15

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55

Threshold of Neighbors (TN)

Datasets

Complex Tree

SVM-Linear

Ensemble (Bag)

Complex Tree

SVM (Linear)

SVM (Gaussian)

SVM-Gaussian

SVM-Cubic

Ensemble (kNN)

SVM (Cubic)

Ensemble (Bag)

Ensemble (kNN)

kNN (k=1)

kNN (k=2)

kNN (k=10)

kNN (k=1)

kNN (k=2)

kNN (k=10)

1.0

0.91 0.82 0.86

0.72 0.81 0.67 0.76 0.62 0.57

0.71 1

2

3

4

5

8

10

15

0.9

QSAR_CCR

0.77

QSAR_CCR vs WMODI*

QSAR_CCR vs MODI of reference

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 50

0.8 0.7 0.6 0.5 0.4 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55

Datasets

Threshold of Neighbors (TN)

Figure 11. Correlation values (r2) between the average of QSAR_CCR values obtained in the five models built and WMODI* using descriptor matrices (top-left) and behavior of the average of QSAR_CCR for the different algorithms using descriptor matrices (top-right); Correlation values (r2) between the average of QSAR_CCR values obtained in the five models built and WMODI* using fingerprint matrices (bottom-left) and behavior of the average of QSAR_CCR for the different algorithms using fingerprint matrices (bottom-right)

Figure 11 (top-left) shows the correlation coefficients (r2 values) between the average of QSAR_CCR values obtained in the five models built, and WMODI* (in Tables S2 and S3 of Supporting Information the complete results of r2 are gathered, as well as the slope and bias obtained in the correlations for all the applied algorithms; in Tables S4 and S5 of Supporting Information the values of the average QSAR_CCR obtained in the building of the models for the

ACS Paragon Plus Environment

40

Page 41 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

different algorithms are shown; and in Tables S6 and S7 of Supporting Information the values of WMODI* are shown). In this figure we have represented the results of those algorithms that generated the best values of the QSAR_CCR (in addition to the kNN algorithm results shown in Fig. 9). These algorithms are: SVM (with linear, Gaussian and polynomial of order equal to 3 kernels), Complex Tree, and Ensemble with Bag (Random Forest) and discriminant kNN kernels. All algorithms were executed using the default parameters considered by Matlab2017b26, thus no particular optimization aimed to improve the correlations obtained was applied. In Fig. 11, dash lines represent the values of the correlation coefficient (r2) between QSAR_CCR and MODI*, solid lines represent the correlation coefficient between QSAR_CCR and WMODI* and filled squares represent the correlation coefficient between QSAR_CCR and the MODI values gathered from the litetarure23. In order to obtain a better visualization of this figure the Y-axis with black letters (right) has been used for the representation of MODI* and WMODI* correlations, and the Y-axis with red letters (left) for the correlations of the MODI of reference. As observed in Fig. 11 (top-left), using descriptors based matrices as data representation of the datasets, the worst correlations are obtained with the MODI of reference; these correlations are always lower than the ones obtained for MODI* and WMODI*. Besides, we can observe that when TN is equal to 1, WMODI* generates the same results as MODI*, as described above. Except for Complex Tree and ensemble discriminant kNN algorithms, the increasing of TN, with values greater than two, also generates an increasing of the correlation values (r2). The same behavior as the one observed for the kNN (k=1, k=2) algorithm shown in Fig. 9.

ACS Paragon Plus Environment

41

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 50

In the remaining used algorithms, TN plays an important role in the results. Thus, values of TN greater than 1 generate a clear improvement of the prediction of the dataset modelability. Values of TN between 3-5 generate excellent correlations for almost all algorithms. As it can be observed in Tables S2 and S3 of Supporting Information, in most cases r2 values are between 0.8-0.9, the values of the slope are close to 1, and the values of bias are close to zero. We also observe in this Fig. 11 (top-left) that, as described above, for high TN values, the accuracy of WMODI* decreases. This fact implies that the consideration of a high cardinality of the neighborhood of the molecules of the dataset enters noise in the rivality index calculation due to the consideration of an excessive number of neighbors. The worst behavior of the correlation is obtained when ensemble discriminant kNN algorithm is used. This behavior can be explained by observing Fig. 11 (top-right). In this figure, the average of QSAR_CCR values obtained in the five models built for all used algorithms are represented. We can appreciate that ensemble kNN (orange line) is the algorithm with the worst behavior, generating very low values of QSAR_CCR and, therefore, low correlations with WMODI*. We can also observe the slight changes of QSAR_CCR with the considered algorithms. These changes conduct to also slight changes in the correlation between QSAR_CCR and WMODI* easily accommodated by the changes in TN values. In order to highlight the behavior of WMODI* index, Fig. 11 (bottom) shows the same study but using fingerprint matrices as data representation of the datasets. The behavior with fingerprint matrices is practically the same as described above. In all cases, the worst correlations are obtained between QSAR_CCR and the MODI of reference. For algorithms based on the neighbors consideration, the best correlations are obtained when the value of TN is equal to the k value and, for the remaining algorithms, intermediate values of

ACS Paragon Plus Environment

42

Page 43 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

TN generate excellent correlations between WMODI* and QSAR_CCR, with high values of r2, slope values close to 1 and bias values close to zero (see Supplementary Information). It is convenient to highlight that when fingerprint matrices are used, for a wide number of datasets, although the molecules are represented with different fingerprints, the distance values between a given molecule and its N nearest neighbors are equal to zero, and within these nearest neighbors there are molecules of the same and different class. However, despite of this innate problem to this type of data representation affecting to both the QSAR algorithm and WMODI* calculation, the correlations between the WMODI* and QSAR_CCR are very high, clearly improving the correlations obtained using MODI_CCR.

CONCLUSIONS The knowledge of the capacity of a dataset to be modeled in the first stages of the building of QSAR prediction models is an important issue because it could lead to reduce the effort and time of the researchers, selecting or rejecting datasets, and refining the dataset’s composition, which will allow them to generate robust and validated models with a high applicability domain. Modelability index (MODI_CCR) is a low computational cost and easily interpretable measurement of these datasets modelability, having a good correlation with the experimental values of QSAR_CCR that would be obtained in the building of classification models. However, the simple definition of this index fails to explain some possible cases related with the distances between each molecule of the dataset and its N nearest neighbor. In this paper, we have proposed a more general formulation of this index. This formulation considers, not only the first nearest neighbor to each molecule of the dataset, but their first nearest neighbor belonging to the different activity classes exiting in the dataset. The calculation

ACS Paragon Plus Environment

43

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 50

of the modelability index (MODI*) with this new formulation does not introduce any added computational cost, however, it avoids any misunderstanding in the calculation of the modelability index and it allows the calculation of a rivality index describing the capacity of each molecule of the dataset to be correctly classifiable. Weighting the rivality index by the consideration of a threshold of neighbors of the molecules of the exiting classes, we are able to redefine the modelability index, obtaining a new and more refined index (WMODI*) able to predict with a higher correlation the corrected classification rate (QSAR_CCR) obtained in the building of the QSAR model. We have proved the robustness of the weighted modelability index for a set of fifty-five benchmark datasets using different statistic algorithms and different data representations of the dataset. Adjusting the value of the threshold of neighbors, we have obtained highly robust correlations between WMODI* and QSAR_CCR, with values of r2 greater than 0.9, slopes close to 1, bias close to zero and very low values of RMSE (Root Mean Square Error) for different classification algorithms and dataset input data representations.

For instance, for TN=4 and using ensemble Bag algorithm the correlation is: 9%:_66: =

0.94 × n,-./ ∗ + 0.06 (R2=0.91, RMSE=0.02, F=539, p=1.9x10-29) when descriptor matrices are used as input data; and 9%:_66: = 0.89 × n,-./ ∗ − 0.01 (R2=0.89, RMSE=0.03,

F=424, p=6.1x10-27) when fingerprint matrices are used as input data.

These results demonstrate that the values obtained for WMODI* are robust and very close to those that could be obtained for QSAR_CCR in the building of a QSAR classification model, improving the results obtained for other modelability indexes proposed in the literature.

ACS Paragon Plus Environment

44

Page 45 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The advantages of the weighted index are not only to contribute with a more adjusted, a priori, measurement of the dataset modelability, but also to contribute with absolute modelability measurements for each molecule of the dataset by means of the rivality index. The rivality index has allowed us to generate a modelability index for the a priori prediction of the modelability in QSAR regression models; being this a subject that we are currently preparing for a following paper.

ACS Paragon Plus Environment

45

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 46 of 50

ASSOCIATED CONTENT Supporting Information Figure S1. Cardinality of the datasets and percentage of molecules of classes 0 and 1. Figure S2. Values of the MODI of reference for the fifty five datasets. Table S1.

Datasets used in the manuscript with their MODI of reference.

Table S2.

Correlations between MODI of Reference, MODI* and WMODI* with the average of QSAR_CCR values obtained in the five models built using descriptor based representations of the 55 studied datasets.

Table S3.

Correlations between MODI of Reference, MODI* and WMODI* with the average of QSAR_CCR values obtained in the five models built using MACCs fingerprint based representations of the 55 studied datasets.

Table S4.

Average of QSAR_CCR values obtained in the five models built for the 55 datasets using descriptor matrices as input data representation.

Table S5.

Average of QSAR_CCR values obtained in the five models built for the 55 datasets using MACCs fingerprint matrices as input data representation.

Table S6.

WMODI* values for the fifty-five datasets and different TN values using descriptor matrices as input data representations.

Table S7.

WMODI* values for the fifty-five datasets and different TN values using MACCs fingerprint matrices as input data representations.

NOTES: The authors declare no competing financial interest.

ACS Paragon Plus Environment

46

Page 47 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

REFERENCES [1]

Roy, K.; Kar, S.; Das, R.N. A Primer on QSAR/QSPR Modeling. Springer International Publishing. 2015.

[2]

Tropsha, A. Best Practices for QSAR Model Development, Validation, and Exploitation. Mol. Inform. 2010, 29, 476−488.

[3]

Cherkasov, A.; Muratov, E.N.; Fourches, D.; Varnek, A.; Baskin, I.I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y.C.; Todeschini, R.; Consonni, V.; Kuz’min, V.E.; Cramer, R.; Benigni, R.; Yang, C.; Rathman, J.; Terfloth, L.; Gasteiger, J.; Richard, A.; Tropsha, A. QSAR Modeling: Where Have You Been? Where Are You Going To?. J. Med. Chem. 2014, 57, 4977-5010.

[4]

Roy, K.; Kar, S.; Ambure, P. On a simple approach for determining applicability domain of QSAR models. Chemometr. Intell. Lab. 2015, 145, 22–29.

[5]

Netzeva, T.I.; Worth, A.P.; Aldenberg, T.; Benigni, R.; Cronin, M.T.D.; Gramatica, P., Jaworska, J.S., Kahn, S., Klopman, G., Marchant, C.A. Current status of methods for defining the applicability domain of (quantitative) structure–activity relationships, ATLA Altern. Lab. Anim. 2005, 33, 155–173.

[6]

Veerasamy, R.; Rajak, H.; Jain. A.; Sivadasan, S.; Varghese, C.P.; Agrawal, R.K. Validation of QSAR Models-Strategies and Importance. Int. J. Drug Des. Discovery. 2011, 2, 511-519.

[7]

Norinder, U.; Carlsson, L.; Boyer, S.; Eklund, M. Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination. J. Chem. Inf. Model. 2014, 54, 1596−1603.

[8]

Golbraikh, A.; Muratov, E.; Fourches, D.; Tropsha, A. Data Set Modelability by QSAR. J. Chem. Inf. Model. 2014, 54, 1−4.

[9]

Maggiora, G. M. On Outliers and Activity Cliffs–Why QSAR Often Disappoints. J. Chem. Inf. Model. 2006, 46, 1535-1535.

[10]

Guha, R.; Van Drie, J.H. Structure–Activity Landscape Index: Identifying and Quantifying Activity Cliffs. J. Chem. Inf. Model. 2008, 48, 646−658.

ACS Paragon Plus Environment

47

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

[11]

Page 48 of 50

Stumpfe, D.; Hu, Y.; Dimova, D.; Bajorath, J. Recent Progress in Understanding Activity Cliffs and Their Utility in Medicinal Chemistry. J. Med. Chem. 2014, 57, 18−28.

[12]

Vogt, M.; Huang, Y.; Bajorath, J. From Activity Cliffs to Activity Ridges: Informative Data Structures for SAR Analysis. J. Chem. Inf. Model. 2011, 51, 1848−1856.

[13]

Hu, Y.; Bajorath, J. Extending the Activity Cliff Concept: Structural Categorization of Activity Cliffs and Systematic Identification of Different Types of Cliffs in the ChEMBL Database. J. Chem. Inf. Model. 2012, 52, 1806−1811.

[14]

Guha,

R.

Exploring

Uncharted

Territories:

Predicting

Activity

Cliffs

in

Structure−Activity Landscapes. J. Chem. Inf. Model. 2012, 52, 2181−2191. [15]

Heikamp, K.; Hu, X.; Yan, A.; Bajorath, J. Prediction of Activity Cliffs Using Support Vector Machines. J. Chem. Inf. Model. 2012, 52, 2354−2365.

[16]

Guha, R.; Van Drie, J.H. Structure-Activity Landscape Index: Identifying and Quantifying Activity Cliffs. J. Chem. Inf. Model. 2008, 48, 646−658.

[17]

Puzyn, T.; Roy, K. MODelability Index 1.0 (MODI) Manual. Laboratory of Environmental Chemometrics. http://www.qsar.eu.org/software. Last accessed March, 2018.

[18]

Hu, X; Hu, Y.; Vogt, M.; Stumpfe, D.; Bajorath, J. MMP-Cliffs: Systematic Identification of Activity Cliffs on the Basis of Matched Molecular Pairs. J. Chem. Inf. Model. 2012, 52, 1138−1145.

[19]

Adilova, F.; Ikramov, A. Data Set Analysis for the Calculation of the QSAR Models Predictive Efficiency Based on Activity Cliffs. Adv. Tech. Biol. Med. 2017, 5:2.

[20]

Golbraikh, A.; Fourches, D.; Sedykh, A.; Muratov, E.; Liepina, I.; Tropsha, A. Modelability Criteria: Statistical Characteristics Estimating Feasibility to Build Predictive QSAR Models for a Dataset. In Practical Aspects of Computational Chemistry III, Leszczynski, J., Shukla, M. K., Eds. Chapter 7. Springer. 2014. pp. 187230.

[21]

Zheng, W.; Tropsha, A. Novel Variable Selection Quantitative Structure–Property Relationship Approach Based on The k-Nearest-Neighbor Principle. J. Chem. Inf. Comput. Sci. 2000, 40, 185-194.

ACS Paragon Plus Environment

48

Page 49 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

[22]

Marcou, G.; Horvath, D.; Varnek, A. Kernel Target Alignment Parameter: A New Modelability Measure for Regression Tasks. J. Chem. Inf. Model. 2016, 56, 6−11.

[23]

Chembench website. Carolina Exploratory Center for Cheminformatics Research (CECCR). https://chembench.mml.unc.edu/. Last accessed March, 2018.

[24]

The Chemistry Development Kit (CDK). https://cdk.github.io/. Last accessed March, 2018.

[25]

Matlab and Simulink. Matlab 2017Rb. https://www.mathworks.com/products/matlab.html. Last accessed March, 2018.

[26]

Statistics and Machine Learning Toolbox. Matlab 2017Rb. https://www.mathworks.com/products/statistics.html. Last accessed March, 2018.

[27]

Yap, C.W. PaDEL-Descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466-1474.

ACS Paragon Plus Environment

49

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 50 of 50

For Table of Contents Use Only

Manuscript title: Authors:

Study of the Datasets Modelability: Modelability, Rivality and Weighted Modelability Indexes Irene Luque Ruiz and Miguel Ángel Gómez-Nieto

Author-created TOC graphic: Irene Luque Ruiz

ACS Paragon Plus Environment

50