Regression Modelability Index: A New Index for Prediction of the

Sep 11, 2018 - Regression Modelability Index: A New Index for Prediction of the Modelability of Data Sets in the Development of QSAR Regression Models...
2 downloads 0 Views 2MB Size
Subscriber access provided by University of Sunderland

Computational Chemistry

Regression Modelability Index: A New Index for the Prediction of the Modelability of the Datasets in the Development of QSAR Regression Models Irene Luque Ruiz, and Miguel Ángel Gómez-Nieto J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00313 • Publication Date (Web): 11 Sep 2018 Downloaded from http://pubs.acs.org on September 11, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Regression Modelability Index: A New Index for the Prediction of the Modelability of the Datasets in the Development of QSAR Regression Models Irene Luque Ruiz1, Miguel Ángel Gómez-Nieto. University of Córdoba. Department of Computing and Numerical Analysis. Campus de Rabanales. Albert Einstein building. E-14071, Córdoba, Spain. {iluque, mangel}@uco.es

ABSTRACT: The prediction of the capability of a dataset to be modeled by a statistic algorithm in the development of regression QSAR models is an important issue that allows researchers to avoid unnecessary tasks, waste time and/or to depurate the molecule composition of the dataset in order to achieve an improvement of the model’s accuracy. In this paper we propose and formulate a new index correlating with the performances of QSAR models. This index, the regression modelability index, requires very low computational cost and is based on the rivality between the nearest neighbors of the molecules of the dataset. This rivality allows to measure the capability of each molecule of the dataset to be correctly predicted by a regression algorithm. In

1

Corresponding author. Email: [email protected], Phone. +34-957-212082

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 49

this study, using forty datasets with very different characteristics regarding to the number of molecules and activity values, we have proved the high correlation between the proposed regression modelability index and the correlation coefficient in cross-validation (Q2), reaching values of r2=0.8. In addition, we have described the ability of this index to discover the outliers detected by the regression algorithms, allowing an easy dataset depuration in the first stages of the building of QSAR regression models.

INTRODUCTION The development of QSAR models for the prediction of the value of the activity of the molecules of a dataset is a task that requires different stages which must be performed with strict procedures and, therefore, needs a considerable amount of time and effort. The selection of the dataset composition (including similar and diverse molecules with activity values in an appropriate range), the selection and testing of different statistics algorithms, the use of different validation procedures (y-randomizing, bootstrapping, cross-validation, etc.) for the building of training models, the detection and erasing of outliers or activity cliffs, the partitioning of the dataset for the external validation of the built model, and so on, are some of the several tasks to be performed, implying time, effort and cost for the building of robust and applicable predictive QSAR models1-5. In order to facilitate these tasks to the researchers and to avoid wasting time, Golbraikh et al.6, have proposed a MODelability Index (MODI_CCR) that correlates with the value of the Correct Classification Rate (QSAR_CCR) that could be obtained by a QSAR classification model. Using this index, the researchers have at their disposal an excellent tool, capable of anticipating, in the

ACS Paragon Plus Environment

2

Page 3 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

first stages of the development of a QSAR model, the accuracy of a classification model built for a dataset. Basically, the calculation of the MODI_CCR index in the prediction of the capability of building a classification model is based on the detection of activity cliffs7-12, considering as activity cliffs “the pairs of molecules that are the most similar to each other but belonging to different activity classes”6,16. In a previous investigation, Luque Ruiz et al.,13 have reformulated the MODI_CCR index and proposed an improvement based on the rivality index concept. The rivality index (RI) is a measurement of the capability of a molecule of the dataset to be correctly classified, which proves that this index also allows to detect the molecules responsible for the formation of activity cliffs. These authors improved the modelability index for classification models by means of the consideration of the neighborhood of the molecules of the datasets. Thus, a weighted modelability index (WMODI*) was obtained taking into account the density of the neighbors of each class nearest to each molecule. WMODI* index is based on the calculation of the Rivality index (RI) for all the molecules of the dataset13. RI index is a normalized measurement (it takes values in the interval [-1,1]). Thus, considering a threshold of neighbors, the WMODI* index is calculated counting the number of molecules with value of RI lesser or equal than zero. The weighting of the modelability index was tested by the authors using fifty-five datasets and six classification algorithms, obtaining correlations between WMODI* and QSAR_CCR higher than 0.8. The proposal of an index capable of predicting the modelability of a dataset for the building of QSAR regression models is more complex, as the activity of the molecules takes continuous values instead of discrete ones.

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 49

In spite of that, the foundations of the rivality and WMODI* indexes can also be applied to the prediction of the value of Q2 (correlation coefficient in cross-validation) in the building of a regression models as we describe in the sections 2 and 3 of this manuscript. The complexity to obtain a dataset modelability index of the regression models is reflected in the few proposals described in the literature. Guha et al.,14 have proposed an index for the identification of activity cliffs, defined as pairs of molecules which are most similar but have the largest change in activity, that can be used as a modelability measurement of the regression models of datasets. This index, named structure-activity landscape index (SALI), is calculated as follows: , =

 −  1 − ( , )

(1)

where: Ai and Aj are the activities of the ith and the jth molecules, and sim(i,j) is the similarity coefficient between the two molecules. Authors proposed an elegant form to visualize the activity cliffs detected by the SALI index. Thus, from a N x N matrix, being N the number of molecules of the dataset, storing the values of the SALI index for the pairs of molecules (i, j), authors built a graphic representation of structureactivity cliffs. In this graph, each molecule is a node, and the nodes i and j of the graph are connected if the element (i, j) of the SALI matrix has a value greater than a manually specified SALI cutoff. Varying the cutoff values, to vary the number of detected activity cliffs, authors were able to detect the more significant activity cliffs. Thus, for low cutoff values a dense graph is obtained, and a high number of activity cliffs can be detected and vice versa, high cutoff values generate few connected graphs, showing a low number of pairs of molecules detected as activity cliffs.

ACS Paragon Plus Environment

4

Page 5 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Marcou et al.,15 have proposed the kernel target alignment index (KTA), dissimilarity/diversity index (Div) and similarity index (Sim), obtaining high correlations between KTA and the RootMean Square Error (RMSE) calculated in the building of QSAR models for four datasets represented in a large number of descriptor spaces. Recently, Golbraikh et al.,16 have carried out a deep and wide proposal of a modelability index capable of predicting the results of the regression models. This proposal is based on a set of indexes aiming to measure the diversity, activity cliffs and modelability of the datasets. Thus, Golbraikh et al.16 have defined the dataset diversity index as follows:





1  _ =    

(2)

 

where: M is the number of molecules of the dataset and K is the number of nearest neighbors considered in the calculation of the normalized distance  between the molecule i and its nearest neighbor j. The activity cliff index is also proposed in a similar way, based on the activities of nearest neighbors, as follows:



 ∑ 1   −  _ =   ∑  

(3)



where: M is the number of molecules of the dataset, K is the number of nearest neighbors considered in the calculation the distance-dependent weights



between the molecule i and its

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 49

nearest neighbor j,  is the activity value of the molecule i and  is the activity value of the nearest neighbor j to the molecule i. In addition, Golbraikh et al.,16 have defined, for datasets with binary response variables, and applicable to continuous response variables, the modelability index (MODI_CCR) as “the correct classification rate (CCR) for leave-one-out cross-validation with 1-nearest neighbor (1-kNN) in the entire descriptor space”. Therefore, for binary datasets the value of this index is based on counting if the first nearest neighbor to each molecule i of the dataset belongs to the same class of the molecule i. This index is formulated as follows:

#

1 " _! =   "

(4)



where: C is the number of classes (two for binary datasets), Nii is the count of molecules of class i predicted correctly (molecules with its first nearest neighbor belonging to the same class), and Ni is the total number of compounds of class i existing in the dataset. In order to be able to use this index for datasets with continuous values of activity, authors categorize the dataset in two or more categories or classes (bins) according to the property values, for instance, very active, active, moderate, etc., or class 0, class 1, class 2, class 3, etc. The calculation is performed counting the molecules that belong to a category or bin, having as first nearest neighbor a molecule belonging to its same category or bin. With this counting, authors16 build a squared D confusion matrix, with as many rows and columns as categories or classes, storing the number of molecules of each category or class predicted in the different categories or classes. This confusion matrix is normalized by columns, dividing each element Dij

ACS Paragon Plus Environment

6

Page 7 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

by the sum of all elements of the column j. Then, each element Dij can be considered as the conditional probability of compound j to be predicted to belong to category i. This normalized confusion matrix allows the authors to calculate the prediction error as follows:

#

#

$ =    %(| − |)

(5)

  '

where: Dij is the value of the (i,j) element of the normalized confusion matrix, and f is a monotonically increasing function taking the value %(| − |) = | − |. Assuming that the expected error Eexp of each element of the normalized confusion matrix is 1/C (C being the number of categories, classes or bins); and also assuming that the number of molecules of each bin is a priori unknown, the Eexp can be also calculated using the eq. 5; then, authors have proposed the following modelability index:

_(# = 1 −

$ $)*+

(6)

Finally, Golbraikh et al.,16 have proposed a more refined modelability index performing a similarity search procedure based on a k-nearest neighbors approach without a variable selection and carried out in the entire descriptor space. Thus, using a standard kNN algorithm developed by the authors17, leave-one-out (LOO) and leave-group-out (LGO) using 5-fold cross-validation, a refined MODI_q2 and MODI_ssR2 indexes are obtained combining the results generated in each hold-out prediction.

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 49

The calculation of this index has a higher computational cost than MODI_CCR. This process is also performed by means of a weighting scheme in which the nearest neighbors from a molecule have a higher weight assigned than the farthest. Using the similarity search criterion, authors must define a threshold of applicability. This threshold is defined as the maximum distance of a neighbor to be considered as nearest neighbor to a given molecule, and it is calculated as follows:

,-.)/-012 = 345 + 7

(7)

where: dav is the average distance of the k nearest neighbors to a given molecule, s is the standard deviation of the distances in the modeling set, and Z is a cutoff manually established in the calculation. All these indexes were tested by the authors for 14 datasets with continuous values of activity, using different weighted schemes, different values of Z-cutoff, categorizing the datasets in different number of categories or bins, and considering a different number of k nearest neighbors. When k is greater than 1, authors generate as many predictions of the indexes as there are k values, in order to finally obtain a consolidated global value of the corresponding indexes. As a result, the authors obtained excellent correlations between MODI_ssR2 and MODI_q2 with the values of Q2 obtained in the regression QSAR models generated using Random Forest (RF) algorithm. Lower values of R2 and higher values of slope in the correlations between MODI_CAT (using 3 and 4 bins) with Q2, and the worst and with low correlations values were obtained for the MODI_ACI and MODI_DIV indexes. In this paper we describe a new index capable of predicting, with a high accuracy and low computational cost, the results of the regression models for molecules’ datasets. The new index

ACS Paragon Plus Environment

8

Page 9 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

is based on the rivality index also proposed by authors for classification models, taking into account the nearest neighbors and the cardinality of the neighborhood to each molecule, being able to detect the molecules responsible for the formation of activity cliffs or outliers originated in the building of the QSAR regression models. The modelability index for regression proposed in this manuscript does not depend on the number of categories or bins in which the continuous values of the activity are categorized. In our approach, the activity values are always discretized in two classes, generating binary datasets. Molecules with nearest values of activity (greater or lower) of a given molecule are considered their nearest activity molecules and, therefore, belong to the same class. As this categorization process of the dataset is carried out, self-adapting to each molecule of the dataset, a modelability index for regression is calculated in a simple and fast calculation, obtaining values of this index very close to the Q2 values obtained for a QSAR regression algorithm. The results obtained using forty datasets and nine statistic algorithms have demonstrated excellent correlations (high values of R2) between this index and the correlation coefficient in cross-validation (Q2), with values of slope close to 1, intercept close to zero and very low values of the Root-Mean Square Error (RMSE). Thus, we recommend the use of this index in the early stages of the development of QSAR regression models. The manuscript has been organized as follows: after the Introduction section, in the Materials and Methods section, the datasets and algorithms used in the experimental are described and we analyze the calculation of the regression modelability index (RMODI). Besides, in this section we describe the rivality index and we introduce the formulation for the calculation of the dataset’s modelability. In the next section, we describe the results obtained for the forty datasets used for the calculation of RMODI and how this modelability index is improved by means of the

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 49

consideration of weighted distance measurements. In this section the experiments performed with the proposed index are described, showing the excellent correlations obtained with the experimental correlation coefficient in cross-validation (Q2). Finally, conclusions are introduced supporting the proposal carried out in this work.

MATERIALS AND METHODS Datasets description The calculations described in this paper have been carried out with forty datasets gathered from different sources, some of them widely used as benchmark in the literature18-22 (see Table S1 in Supporting Information). From these sources, molecules information (SDF23 or SMILE24 files) and activity values of the molecules were extracted. Molecules with inexistent activity value or non-numeric value of the activity were erased from the datasets, for instance, molecules with value of activity as >value, 1), it implies to consider a very low certainty in the activity values of the nearest neighbors needed to predict the activity of a given molecule and, therefore, most (or all) the molecules of the dataset would be assigned to class 0. Then, the WRMODI would be close (or equal) to 1. In this case, the selected molecule and the molecules with an activity value very far to the activity value of the selected molecule would be considered as belonging to the same class. In this case, the modeler would be establishing that similar and very dissimilar molecules (structurally) are those molecules with very close value of activity. Therefore, the value of δ determines the interval of confidence (see eq. 9) that the modeler should assign in order to discretize in the same class the selected molecule and the molecules with activity values in this interval.

ACS Paragon Plus Environment

42

Page 43 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

On the other hand, the adjustment of TN by the modeler is simpler. The increase of TN value determines changes in the number of nearest neighbors of both classes considered in the calculation of the weighted rivality index and, therefore, the consideration of different densities of molecules surrounding the selected molecule. High values of TN imply the consideration of farthest neighbors and, therefore, the surpassing increase of the certainty of the prediction and the decreasing of WRMODI, and vice versa. Then, values of TN considering the need of few neighbors in the prediction should be tested.

CONCLUSIONS The prediction of the capability of a dataset to be modeled by a statistic algorithm in the development of regression QSAR models is an important issue that allows the researchers to avoid unnecessary tasks, waste time and/or to depurate the molecule composition of the dataset to achieve an improvement of the model’s accuracy, generating robust and validated models with a high applicability domain. In this paper we have proposed a new index, the Regression MODelability Index (RMODI). This index is highly correlated with the correlation coefficient in cross-validation Q2 obtained in the building of the regression QSAR models. The formulation of this index is based on a model of categorization of the continuous values of the activity of the molecules of the dataset using a self-adapted model that dynamically builds binary datasets depending on the activity value of each molecule of the dataset. As binary datasets are always generated, and nearest activity molecules are always assigned to the same class, the calculation of this modelability index for regression is fast with a very low computational cost and as simple as the calculation of the modelability index for classification.

ACS Paragon Plus Environment

43

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 49

RMODI formulation is based on the measurement of the rivality between the nearest neighbors of the molecules of the dataset. Thus, by means of the calculation of a rivality index for the molecules of the dataset, predictable molecules and activity cliffs are discovered and, with a very low computational cost, RMODI index is calculated. In addition, the correlation between RMODI and Q2 is improved by means of the weighting of the rivality index. This weighting is carried out by considering a threshold of neighbors of the molecules of the exiting classes. Thanks to this threshold a different number of nearest neighbor for each one of the molecules of the dataset is considered, which refines the values of RMODI until reaching correlations with Q2 with values of r2 greater than 0.8, slope close to 1 and intercept close to zero. Moreover, the values of RMODI are very close to the values of Q2, so any specific criteria different to the existing for Q2 does not have to be defined for the consideration of a dataset as modelable or not. The experimental carried out in this paper, using forty very diverse datasets composed by molecules with a wide range of activity values has allowed us to validate the proposed index and the results have proved that this index could be a good tool to be used by the researchers in the early stages of the building of regression QSAR models, because apart from predicting the results of the regression models, this index is capable of detecting the main outliers of the models as those molecules with an activity cliff behavior.

ACS Paragon Plus Environment

44

Page 45 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ASSOCIATED CONTENT Supporting Information Figure S1.

Frequency histograms of the activity values for the forty datasets used in the experiments.

Table S1.

Datasets used in the experimental with their references of information source, name, assigned cod, number of molecules and number of descriptors (columns) contained in the descriptor matrices.

Table S2.

Information of the descriptor matrixes representing the forty datasets.

Tables S3-S11. Results of the regression models generated for the forty datasets with the nine statistic algorithms considered and using five folds cross validation. NOTES The authors declare no competing financial interest.

ACS Paragon Plus Environment

45

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 46 of 49

REFERENCES [1]

Tropsha, A. Best Practices for QSAR Model Development, Validation, and Exploitation. Mol. Inform. 2010, 29, 476−488.

[2]

Cherkasov, A.; Muratov, E.N.; Fourches, D.; Varnek, A.; Baskin, I.I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y.C.; Todeschini, R.; Consonni, V.; Kuz’min, V.E.; Cramer, R.; Benigni, R.; Yang, C.; Rathman, J.; Terfloth, L.; Gasteiger, J.; Richard, A.; Tropsha, A. QSAR Modeling: Where Have You Been? Where Are You Going To?. J. Med. Chem. 2014, 57, 4977-5010.

[3]

Roy, K.; Kar, S.; Ambure, P. On a simple approach for determining applicability domain of QSAR models. Chemometr. Intell. Lab. 2015, 145, 22–29.

[4]

Netzeva, T.I.; Worth, A.P.; Aldenberg, T.; Benigni, R.; Cronin, M.T.D.; Gramatica, P., Jaworska, J.S., Kahn, S., Klopman, G., Marchant, C.A. Current status of methods for defining the applicability domain of (quantitative) structure–activity relationships, ATLA Altern. Lab. Anim. 2005, 33, 155–173.

[5]

Veerasamy, R.; Rajak, H.; Jain. A.; Sivadasan, S.; Varghese, C.P.; Agrawal, R.K. Validation of QSAR Models-Strategies and Importance. International J. Drug Des. Disc.. 2011, 2, 511-519.

[6]

Golbraikh, A.; Muratov, E.; Fourches, D.; Tropsha, A. Data Set Modelability by QSAR. J. Chem. Inf. Model. 2014, 54, 1−4.

[7]

Maggiora, G. M. On Outliers and Activity Cliffs–Why QSAR Often Disappoints. J. Chem. Inf. Model. 2006, 46, 1535-1535.

[8]

Stumpfe, D.; Bajorath, J. Exploring Activity Cliffs in Medicinal Chemistry. J. Med. Chem. 2012, 55, 2932−2942.

[9]

Stumpfe, D.; Hu, Y.; Dimova, D.; Bajorath, J. Recent Progress in Understanding Activity Cliffs and Their Utility in Medicinal Chemistry. J. Med. Chem. 2014, 57, 18−28.

[10]

Vogt, M.; Huang, Y.; Bajorath, J. From Activity Cliffs to Activity Ridges: Informative Data Structures for SAR Analysis. J. Chem. Inf. Model. 2011, 51, 1848−1856.

ACS Paragon Plus Environment

46

Page 47 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

[11]

Hu, Y.; Bajorath, J. Extending the Activity Cliff Concept: Structural Categorization of Activity Cliffs and Systematic Identification of Different Types of Cliffs in the ChEMBL Database. J. Chem. Inf. Model. 2012, 52, 1806−1811.

[12]

Guha,

R.

Exploring

Uncharted

Territories:

Predicting

Activity

Cliffs

in

Structure−Activity Landscapes. J. Chem. Inf. Model. 2012, 52, 2181−2191. [13]

Luque Ruiz, I.; Gómez-Nieto, M.A. Study of the Datasets Modelability: Modelability, Rivality and Weighted Modelability Indexes. J. Chem. Inf. Model. (ci-2018-001888). Accepted for publication, August, 2018.

[14]

Guha, R.; Van Drie, J.H. Structure-Activity Landscape Index: Identifying and Quantifying Activity Cliffs. J. Chem. Inf. Model. 2008, 48, 646−658.

[15]

Marcou, G.; Horvath, D.; Varnek, A. Kernel Target Alignment Parameter: A New Modelability Measure for Regression Tasks. J. Chem. Inf. Model. 2016, 56, 6−11.

[16]

Golbraikh, A.; Fourches, D.; Sedykh, A.; Muratov, E.; Liepina, I.: Tropsha, A. Modelability Criteria: Statistical Characteristics Estimating Feasibility to Build Predictive QSAR Models for a Dataset. In Practical Aspects of Computational Chemistry III, Leszczynski, J., Shukla, M. K., Eds. Chapter 7. Springer. 2014. pp. 187230.

[17]

Zheng, W.; Tropsha, A. Novel Variable Selection Quantitative Structure–Property Relationship Approach Based on The k-Nearest-Neighbor Principle. J. Chem. Inf. Comput. Sci. 2000, 40, 185-194.

[18]

Cassotti, M.; Ballabio, D.; Consonni, V.; Mauri, A.; Tetko, I.V.; Todeschini, R. Prediction of Acute Aquatic Toxicity Towards Daphnia Magna Using GA-kNN Method, ATLA Altern. Lab. Anim. 2014, 42, 31-41.

[19]

Chen, H.; Carlsson, L.; Eriksson, M.; Varkonyi, P.; Norinder, U.; Nilsson, I. Beyond the Scope of Free-Wilson Analysis: Building Interpretable QSAR Models with Machine Learning Algorithms. J. Chem. Inf. Model. 2013, 53, 1324−1336.

[20]

Cortés-Ciriano, I. Benchmarking the Predictive Power of Ligand Efficiency Indices in QSAR. J. Chem. Inf. Model. 2016, 56, 1576−1587.

[21]

Chembench, Carolina Exploratory Center for Cheminformatics Research (CECCR). https://chembench.mml.unc.edu/. Last accessed February, 2018.

ACS Paragon Plus Environment

47

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

[22]

Page 48 of 49

The Binding Database. https://www.bindingdb.org/bind/index.jsp. Last accessed February, 2018.

[23]

Dalby, A.; Nourse, J.G.; Hounshell, W.D.; Gushurt, A.K.I.; Grier, D.L.; Leland, B.A.; Laufer, J. Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci., 1992, 32, 1992 245.

[24]

Daylight.

Chemical

Information

Systems,

Inc.

http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html. Last accessed February, 2018. [25]

Yap, C.W. PaDEL-Descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466-1474.

[26]

Matlab software. MathWorks. https://www.mathworks.com/. Last accessed March, 2018.

[27]

Statistics and Machine Learning Toolbox. Matlab 2017Rb. https://es.mathworks.com/programs/trials/trial_request.html?ref=ggl&s_eid=ppc_29742 641962&q=matlab. Last accessed March, 2018.

[28]

Namasivayam, V.; Iyer, P.; Bajorath, J. Prediction of Individual Compounds Forming Activity Cliffs Using Emerging Chemical Patterns. J. Chem. Inf. Model. 2013, 53, 3131−3139.

[29]

Roy, K.; Ambure, P.; Kar, S.; Ojha, P.K. Is it Possible to Improve the Quality of Predictions from an “Intelligent” Use of Multiple QSAR/QSPR/QSTR models?. J. Chemometrics. 2018, 32, e2992. https://doi.org/10.1002/cem.2992.

[30]

Cook, R.D. Detection of Influential Observations in Linear Regression. Technometrics. 1977, 19, 15-18.

[31]

Vogt, M.; Iyer, P.; Maggiora, G.M.; Bajorath, J. Conditional Probabilities of Activity Landscape Features for Individual Compounds. J. Chem. Inf. Model. 2013, 53, 1602−1612.

ACS Paragon Plus Environment

48

Page 49 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

For Table of Contents Use Only

Manuscript title: Authors:

Regression Modelability Index: A New Index for the Prediction of the Modelability of the Datasets in the Development of QSAR Regression Models Irene Luque Ruiz and Miguel Ángel Gómez-Nieto

Author-created TOC graphic: Irene Luque Ruiz

ACS Paragon Plus Environment

49