Subscriber access provided by UNIV OF NEBRASKA - LINCOLN
Perspective
A Historical Excursus on the Statistical Validation Parameters for QSAR Models: a Clarification Concerning Metrics and Terminology Paola Gramatica, and Alessandro Sangion J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00088 • Publication Date (Web): 24 May 2016 Downloaded from http://pubs.acs.org on May 24, 2016
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
2
A Historical Excursus on the Statistical Validation Parameters for QSAR Models: a Clarification Concerning Metrics and Terminology
3
Paola Gramatica* and Alessandro Sangion
4 5
QSAR Research Unit in Environmental Chemistry and Ecotoxicology, Department of Theoretical and Applied Sciences, University of Insubria, 21100, Varese, Italy
6
E-mail:
[email protected]; http://www.qsar.it; Tel: +39-0332-421573
1
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
ABSTRACT In the last years, external validation of QSAR models was the subject of intensive debate in the scientific literature. Different groups have proposed different metrics to find “the best” parameter to characterize the external predictivity of a QSAR model. This editorial summarizes the history of parameter development for the external QSAR model validation and suggests, once again, the concurrent use of several different metrics, to assess the real predictive capability of QSAR models. From Beware of Q2 (2002) to Beware of R2 (2015)… The main purpose of this editorial is to summarize the major developments in the QSAR community in recent years leading to various formulas of statistical parameters, applied for the external validation of QSAR models. This clarification seems now useful mainly because a recent paper of Alexander et al. in JCIM “Beware of R2: simple, unambiguous assessment of the prediction accuracy of QSAR and QSPR models”1 could cause some confusion in QSAR community concerning the nomenclature and formulas. The external validation of models has been a highly significant and widely debated topic of QSAR modelling since the first observations of Oprea and Garcia2 and mainly the fundamental paper of Golbraikh and Tropsha3 “Beware of Q2” which pointed out that Q2loo of cross validation is a necessary but not sufficient condition for assessing the external predictivity of models. In fact, the real predictive ability of a QSAR model can only be estimated using an external set of compounds never used for the model development. Golbraikh and Tropsha formulated a series of criteria to ensure a high predictive power of the model: the squared correlation coefficient R2 of predicted versus experimental data, for the external set, should be close to 1, the R20 of the regression through the origin should be close to the unconstrained R2 and the slope of the regression through the origin should be close to 1. It is interesting to note that some shortcomings in these criteria are identified in the recent Alexander et al. paper.1 In following years some authors proposed several alternative statistical parameters to verify the true predictive performances of QSAR models 4–14 and of course the diversity of these metrics could appear a little bit confusing to those relatively new to the QSAR field, mainly if the same formula is used with different name. However the different proposed metrics are now well consolidated in the literature.9,12,15,16 Herein, we shall summarize the evolution and important facts concerning major formulas used to characterize the predictive power of QSAR models for chemicals not used in model development. First of all, it is useful to remind the canonical formulation of R2, the coefficient of determination, commonly used, for multiple regression models, to assess how well the model is able to reproduce the data used for its development, i.e. the goodness of fit of the training set: = 1 −
∑ ∑
= 1 −
[eq.1]
where are the observed values of the response, the corresponding average, are the calculated values, RSS is the residual sum of squares, i.e. the sum of squared differences between observed and
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
calculated response values over the training set, and TSS is the total sum of squares, i.e. is the sum of squared deviations from the dataset mean. It is important to note that R2 is the square of the correlation coefficient between the original and modeled data values for multivariate regression models, while it equals the squared Pearson correlation coefficient of the dependent and explanatory variable only in an univariate linear least squares regression. It is fundamental to point out that, the notation Q2, instead of R2, is commonly used to differentiate the predictive performances (verified by Q2) from the fitting (verified by R2). The coefficient of determination, R2 uses ŷ, calculated by the model for each object used in the model development, while in the formula for Q2, ŷ are the values predicted by the model when the objects are not in the training set. Therefore, in evaluating model predictivity, the same formulation of R2 is named Q2 when the predicted values instead of the calculated ones are included, so the notation changes in: = 1 −
∑ ∑
=1−
[eq. 2]
where are the observed values of the response, is the corresponding average, are the values predicted for each object when it is not in the training set, PRESS is the predicted residual sum of squares, i.e., the sum of squared differences between observed and predicted responses. This formula is applied when iterative cross-validation is done for internal validation of QSAR models to verify their stability/robustness by Leave-one-out (LOO) or Leave-more-out (LMO) perturbation. The values of Q2LOO and Q2LMO only slightly higher than 0.5 are not indicative of good QSAR models; in my opinion and practical experience, these parameters should also have values as close as possible to each other, as well as not too far from the R2 value for fitting (not less than 0.65), to guarantee high robustness of the models. 5, 17,18 Regarding external predictivity, on the basis of eq 2, one of the first validation parameters to assess the external predictive performance of QSAR models, proposed by Shi et al., 19 had this formulation: = 1−
∑ !
" ∑ !
= 1 −
∑ ! ∑ !
= 1 −
% " & #$
[eq 3]
where '(()" is the total sum of squares of the external set calculated using the training set mean. This external predictivity parameter was reported in 2003 in the paper “The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models”.4 It was, at that time, applied in the software MOBYDIGS for QSAR modeling 20 and it was also suggested in the guidance document of the OECD Principles for QSAR model validation. 21,22 The label F1 for this Q2 formula was later proposed by Consonni et al 9 and this label is commonly used by several QSAR modellers. Later, the Roy group 6,8,14 proposed a new parameter series, called r2m , similar to the Golbraikh and Tropsha concept. In 2008, Schüurmann demonstrated that the formulation of Q2F1 yields too optimistic estimates of the model predictive power and tends to increase with an increasing differences between and ) , moreover it is not applicable if information about the model training set are not available.7 Thus, Schüurmann proposed to apply the following formula: =1−
% & #$
[eq. 4]
where '(() is the total sum of squares of external set calculated using the external set mean; thus all the values in this equation are derived from the external set. This formulation, alternative to Q2F1, was then named Q2F2 by Consonni et al.9 It is interesting to note that this formula is the same, recently proposed by Alexander et al. 1 and named R2 (eq 1 in which all the values are related to the external set). This was also pointed out by Roy et al.23 Therefore, the recent proposal of Alexander et al. 1, in our opinion, adds confusion concerning the terminology instead of clarifying the topic.
ACS Paragon Plus Environment
Page 2 of 8
Page 3 of 8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
Journal of Chemical Information and Modeling
In 2010 Tropsha in his paper “Best Practices for QSAR Model Development, Validation, and Exploitation” 11 reaffirmed the criteria for external validation of predictive QSAR models.3 Moreover, he suggested the use of the coefficient of determination for the external validation set, using all the predicted values for external sets: therefore also this formulation, called by Tropsha R2abs, as well as the R2 of Alexander et al.1 corresponds to the formula of Q2F2 , proposed by Schüurmann 6 (eq. 4). In the same years, the Todeschini group made an overview on the above external validation parameters and commented on some drawbacks in the Q2F1 and Q2F2 formulations.9,10 They demonstrated, mathematically and by extensive simulations, that the Q2F2 formulation does not account for any information about the reference model training, since ) only encodes information derived from the external set and therefore it is not independent from the external set composition. “ If external set data are not uniformly distributed over the range of the training set, both Q2F1 and Q2F2 suffer from some drawbacks”.9 To overcome these problems they proposed the following formula for Q2F3: * =1−
,.- +∑ !
" ,.-" +∑ " !
=1−
/- /-"
[eq. 5]
This formulation of Q2F3 makes this parameter independent from the distribution and size of the external set. Moreover, according to the authors, Q2F3 is able to perfectly reproduce the ranking of the root-meansquare-error of the external set (RMSEP). In two following papers our group in Insubria had compared all these parameters in different situations of realistic and extreme data, by means of simulations, and had proposed 12,13 to calculate the Concordance Correlation Coefficient (CCC) of Lin24,25 for QSAR models (below in the formulation with notations commonly used for QSAR models), to verify simply if the predicted values match with the observed values, i.e. an agreement assessment : 000 =
% ∑ !
∑ % 1 ∑ & 1-# & ! ! #
[eq.6]
This coefficient measures both precision (how far the observations are from the regression line) and accuracy (how far the regression line deviates from the slope 1 line passing to the origin, i. e., the diagonal concordance line). Thus, CCC quantifies the similarity of the predicted and experimental values (the agreement) as a single criterion, while achieving the same goal as the Golbraikh and Tropsha method, which, however, requires several conditions to be met. In those studies, we compared the behaviour of all parameters, considered in this paper (Eqs 1-6), not only on a few cases, but in extensive simulation exercises, also in different extreme situations, even if not common in QSAR data distribution. These simulations, similar to studies performed by Lin to verify cases when a statistical parameter fails to detect non-reproducibility of data,24are always necessary in order to verify the stability and statistical reliability of the compared parameters, in general and not only for particular and more common QSAR situations. A statistical parameter must give reliable results also when it is assessed in various extreme situations. We verified some drawbacks of Q2F1, Q2F2 and r2m in some situations. In particular, r2m ,6,8,14 in cases of specific distribution of the modelled data (location and location plus scale shifts, which are systematic data biases), and Q2F1 and Q2F2, in cases of scale shift, could give less reliable and too optimistic results for the real predictivity of QSAR models, giving also different values for the same RMSE, depending on the side of the bias in the data, while CCC and Q2F3 are the most reliable and stable parameters in all the studied situations. Moreover, we had also defined inter-comparable values of all these parameters, considering different degree of accuracy and precision and proposed thresholds values for the predictive models acceptability.13 Other important parameters that must be also considered, in addition to the above commented parameters, for the selection of good predictive models are the Root Mean Squared Error (RMSE), as previously already pointed out by all the above cited authors including Alexander et al.,1 and the Mean Absolute Error (MAE), as recently proposed by Roy et al.23 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
Page 4 of 8
As we stated earlier in the paper “Principles of QSAR models validation: Internal and external“ 5 : “It is important to note that RMSE values must not only be low, but also as similar as possible for the training, CV and external prediction sets: this suggests that the proposed model has both predictive ability (low values) as well as sufficient generalizability (similar values)”. However, it is also necessary to highlight that while Q2 parameters allow the comparison of the predictivity of models developed for different endpoints, RMSE, which is dependent on the measure scale of the endpoints values, is useful mainly to compare the quality of models developed for the same end point and for this reason should not be considered a useful parameter for the comparison of models for different end-points.9 To avoid this problem normalization or scaling on the response range or the response mean could be applied.23 Regarding MAE, Chai and Draxler26 had pointed out that, for several sets of errors with the same RMSE, MAE would vary from set to set. Additionally, they stated that, while MAE is not able to adequately reflect some larger errors, RMSE seems better at revealing model performance differences because it gives higher weights to the larger errors. Roy has recently proposed to use MAE based criteria, but considering contemporaneously also the data range and the dispersion of errors.23 Thus, in my opinion, as these two parameters are not equivalent26 and there are no concordant opinions on the superiority of one parameter over the other,23,26 they should be verified together. Having summarized the history of the external validation parameters in QSAR, it is now evident that the R2 calculated for the test data, proposed by Alexander et al. (eq 1 with external data) as a simple and unambiguous way to verify external predictivity, is precisely the same metric as named by many QSAR modelers Q2F2 (eq 4), which was proposed by Schüurmann et al.7 already in 2008. But, since this parameter has some highlighted drawbacks when the external set is composed by objects with particular distributions 9,10 or when there is a scale shift in the modelled data,9,12,13 it could be certainly considered a simple, but probably not an unambiguous parameter. Of value, the paper of Alexander et al. 1 highlighted the important point, already underlined by all the previously cited authors,3,7,9,13 that R2 (the squared correlation coefficient) derived from the simple regression on the observed and predicted values for the external set data, which we named R2EXT ,13 is not a suitable indicator of the predictivity of any QSAR model, since it simply gives information about the correlation between two data arrays, i.e. “the degree to which its predictions are correlated with the observations”.1 However, it could be useful if the aim is to rank or prioritize chemicals in a data set. The above cited and commented parameters for external validation of QSAR models are all summarized in the following Table 1: Table 1: Parameters for external validation of QSAR models (adapted with permission from ref. 13. Copyright 2012 ACS) = 3 3
Golbraikh and Tropsha criteria
9 = 1 −
==
∑-4
5 - 2 ∑4
:; ∑ ! # & ∑ & ! #
∑ ∑ !
≈ 1,
=′ =
−
− −
- ∑4 #
≈ ,
∑ !
≈ 1,
7 ≈1
− & 6
9