Anal. Chem. 2000, 72, 4675-4676
Improved Computation of the Standard Error in the Regression Coefficient Estimates of a Multivariate Calibration Model Nicolaas (Klaas) M. Faber
Dunantsingel 28, 2806 JB Gouda, The Netherlands
A multivariate calibration model consists of regression coefficient estimates whose significance depends on the associated standard errors. A recently introduced leaveone-out (LOO) method for computing these standard errors is modified to achieve consistency with the jackknife method. The proposed modification amounts to multiplying the LOO standard errors with the factor (n 1)/n1/2, where n denotes the number of calibration samples. The potential improvement for realistic values of n is illustrated using a practical example. Usually, the goal of multivariate calibration is to model the relationship between a property of interest, which is obtained using a time-consuming or laborious reference method, and a set of predictor variables, which are measured relatively effortless. An example is the prediction of octane number from a near-infrared (NIR) spectrum. Having a good model enables one to replace the reference method by the multivariate technique. Unfortunately, constructing a good model is a nontrivial task, unless the multivariate technique is customized for the specific application at hand. A multivariate calibration model consists of regression coefficient estimates. It stands to reason that the predictive ability of a model is related to the significance of these estimates, which in turn depends on the associated standard errors. Following this line of reasoning, Centner et al.1 developed a method for eliminating uninformative predictor variables by ranking the coefficient estimates according to their ratio to the associated standard error. They proposed a leave-one-out (LOO) method for computing the standard errors and obtained results that compared favorably with those obtained using the full NIR spectra. Although their study focused on the regression coefficient estimates obtained using partial least squares (PLS), the underlying reasoning implies that eliminating uninformative predictor variables should be equally useful for principal component regression (PCR) or other related methods. The purpose of this paper is to show that the LOO method proposed by Centner et al. can be improved by multiplying the standard errors with the factor (n - 1)/n1/2, where n denotes the number of calibration samples. As explained below, owing to a special precaution, none of the conclusions arrived at by Centner (1) Centner, V.; Massart, D. L.; de Noord, O. E.; de Jong, S.; Vandeginste, B. G. M.; Sterna, C. Anal. Chem. 1996, 68, 3851-3858. 10.1021/ac0001479 CCC: $19.00 Published on Web 08/24/2000
© 2000 American Chemical Society
et al. is affected by the proposed correction factor. In other situations, however, the modification is likely to play an important role. STANDARD ERROR IN REGRESSION COEFFICIENT ESTIMATES It is assumed that the calibration data are related as
y ) Xb + e
(1)
where y (n × 1) contains the true property values, X (n × p) is the true predictor matrix, b (p × 1) is the true regression vector, e is a vector of residuals (n × 1), n is the number of calibration samples, and p is the number of variables. Usually, the true property values and predictor variables are unknown, and the model is constructed from calibration data that carry measurement errors. Centner et al. have developed a procedure for eliminating uninformative variables that is based on ranking them according to the ratios,
cj ) bj/s(bj) for j ) 1, ..., p
(2)
where bj denotes the coefficient estimate for the jth variable and s(bj) is its estimated standard error. Centner et al. refer to these ratios as the reliability of the regression coefficient estimates. Underlying is the statistical consideration that the particular value of a regression coefficient can only be relied on if it is significantly larger than its standard error. They proposed to determine the s(bj)’s as follows. Leaving out one calibration sample at a time (i ) 1, ..., n) yields n reduced data sets for which regression vectors are computed. Estimates of the standard errors are obtained as n
sLOO(bj) ) (
∑(b
ij
- bhj)2/(n - 1))1/2 for j ) 1, ..., p
(3)
i)1
where bij denotes the regression coefficient estimate for the jth wavelength when leaving out calibration sample i and bhj is the mean of the bij’s. Centner et al. referred to this LOO procedure as the jackknife method. However, close examination shows that eq 3 does not yield jackknife standard errors because the latter are derived from Analytical Chemistry, Vol. 72, No. 19, October 1, 2000 4675
the spread in so-called jackknife pseudovalues,2
bpseudo ) nbj - (n - 1)bij for i ) 1, ..., n and ij j ) 1, ..., p (4) One of the conjectures underlying the jackknife method is that a (i ) 1, ..., n) has approximately the same pseudovalue bpseudo ij variance as n1/2bj. Thus, the jackknife standard errors follow as n
sJK(bj) ) (
∑(b
pseudo ij
- bhpseudo )2/(n(n - 1)))1/2 for j
i)1
j ) 1,...,p (5) where bhpseudo is the mean of the bpseudo ’s. j ij Alternatively, the jackknife standard errors can be expressed as2 n
sJK(bj) ) (
∑(b
ij
- bhj)2(n - 1)/n)1/2 for j ) 1, ..., p (6)
i)1
and combining eqs 3 and 6 yields that
sJK(bj) ) ((n - 1)/n1/2)sLOO(bj) for
j ) 1, ..., p (7)
The proposed correction factor is implied by eq 7. PRACTICAL EXAMPLE It is illustrative to look at the reliabilities reported by Centner et al. They recognized that to determine whether a variable should be eliminated, an objective cutoff value is required for the reliabilities. This value is determined by extending the predictor matrix with artificial (i.e., pure random) variables. The motivation for doing so is that informative variables should distinguish themselves from the artificial ones by having a significantly larger reliability, which is intuitively appealing. Several objective decision (2) Shao, J.; Tu, D. The Jackknife and Bootstrap; Springer: New York, 1995.
4676
Analytical Chemistry, Vol. 72, No. 19, October 1, 2000
rules are conceivable. Centner et al. investigated a cutoff value equal to the maximum absolute value for the artificial variables and the value that corresponds to a certain quantile (say, 95 or 99%). For the data set POLY-DAT, Centner et al. reported a 95% quantile corresponding to an absolute reliability of 9 (Figure 4d in ref 1). However, ideally, the reliabilities of the artificial variables should be distributed approximately as Student’s t with n degrees of freedom. Thus, the ideal value is 2.06 (n ) 26). Clearly, the LOO method has led to a value that is too high. Introducing the proposed correction factor (n - 1)/n1/2 ) (26 - 1)/5.1 ) 4.9 leads to the value 1.8 and the improvement is obvious. It is emphasized that none of the conclusions arrived at by Centner et al. is affected by the proposed correction factor. Since the factor is the same for all regression coefficients, the cutoff value is affected in the same way; hence, the ranking of the variables, as well as the eliminated portion, remains unaltered. In the context of the study of Centner et al., the only benefit of the proposed correction factor may be that it enhances the interpretability of the results. Finally, it is noted that the observed improvement suggests a procedure for determining the cutoff value without using artificial variables. This is understood as follows. If the jackknife reliabilities are indeed “more or less” distributed as Student’s t, one could vary the cutoff value between “obviously” nonsignificant (say, 1) and “obviously” significant (say, 5). The optimum value could be determined, for example, by minimizing the cross-validated prediction error.3 This procedure will not work with the LOO reliabilities, since these depend strongly on the number of calibration samples (n). Avoiding the use of artificial variables is attractive because they contribute to the constructed model and thereby indirectly affect which variables are eliminated. In addition, determining the cutoff value in the same way as the optimum model dimensionality is appealing from a practical point of view. Received for review February 7, 2000. Accepted July 5, 2000. AC0001479 (3) Martens, H.; Næs, T. Multivariate Calibration; Wiley: Chichester, 1989.