Anal. Chem. 2005, 77, 1423-1431
Boosting Partial Least Squares M. H. Zhang,† Q. S. Xu,‡ and D. L. Massart*,†
ChemoAC, Department of Pharmaceutical and Biomedical Analysis, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium, and School of Mathematical Science and Computing Technology, Central South University, Changsha, 410083 P.R. China
A difficulty when applying partial least squares (PLS) in multivariate calibration is that overfitting may occur. This study proposes a novel approach by combining PLS and boosting. The latter is said to be resistant to overfitting. The proposed method, called boosting PLS (BPLS), combines a set of shrunken PLS models, each with only one PLS component. The method is iterative: the models are constructed on the basis of the residuals of the responses that are not explained by previous models. Unlike classical PLS, BPLS does not need to select an adequate number of PLS components to be included in the model. On the other hand, two parameters must be determined: the shrinkage value and the iteration number. Criteria are proposed for these two purposes. BPLS was applied to seven real data sets, and the results demonstrate that it is more resistant than classical PLS to overfitting without loosing accuracy. Partial least squares (PLS) was developed and introduced into chemistry by Wold and Martens in 1983.1 Numerous successful analytical applications made PLS a popular method in multivariate calibration. However, there are still some problems. A decision about the number of suitable PLS components is necessary, and this requires an experienced analyst. Too few components leads to underfitting so that the prediction is not adequate, since the information extracted by the model is not enough to explain the data, whereas too many components leads to overfitting, that is, the model cannot be generalized to new data that did not contribute to the model construction. To obtain a PLS version in which the correct determination of the number of PLS components would be less important, we already proposed a weighted averaged PLS (APLS) method.2 When compared with classical PLS method, this method has a comparable prediction ability, and it is also relatively robust to overfitting. Nevertheless, APLS loses somewhat in accuracy when the PLS components included in the averaging process cause high levels of overfitting or underfitting. Recently, a new method called boosting was brought to our attention. Boosting should allow both accuracy improvement and †
Pharmaceutical Institute. Central South University. * Corresponding author. Tel: +32-2-477-4734. Fax: +32-2-477-4735. E-mail:
[email protected]. (1) Wold, S.; Martens, H.; Wold, H. The multivariate calibration method in chemistry solved by the PLS method. In: Ruhe, A., Kågstro ¨m, B., Eds.; Proceedings on the Conference on Matrix Pencils, Lecture Notes in Mathematics; Springer-Verlag: Heidelberg, 1983; pp 286-293. (2) Zhang, M. H.; Xu, Q. S.; Massart, D. L. Anal. Chim. Acta 2004, 504, 279289. ‡
10.1021/ac048561m CCC: $30.25 Published on Web 02/01/2005
© 2005 American Chemical Society
resistance to overfitting. It originated from the field of machine learning3,4 and was proposed first by Schapire in 1990.5 It is a method that combines a group of weak learners (classifiers or fitting methods) to obtain a powerful learner. Boosting is used to solve classification and regression problems. Many versions of boosting have been developed.6 Up to now, boosting has been successfully applied in classification problems but not in linear regression problems. The history of boosting in regression can be divided into two stages. In 1997, Freund and Schapire first theoretically proposed an AdaBoost.R algorithm for regression problems.7 The most important contribution of this method is the majority vote idea that combines a group of weighted learners, the weight of which is defined by the accuracy of the learner. Though later some other versions were developed,8-10 those earlier versions are mainly based on this idea. Most of them are restricted to nonlinear learners, and regression problems have to be transformed into classification problems. An exception is a modification of AdaBoost.R provided by Drucker; this is also the first actual experiment using boosting regression models.11 The advantage of this method is its ad hoc ability; i.e., any learner, no matter linear or nonlinear, can be incorporated. We applied it in multivariate calibration for NIR (near-infrared spectroscopy) data, but the results were not satisfactory. More recently, Friedman explained that boosting could be viewed as a gradient additive model12 and suggested a method called “stochastic gradient model”.13 This is the start of the second stage of boosting in regression. The gradient additive form of boosting is a combination of a set of learners. These learners are constructed through iterative steps by always using a basic learning algorithm. In each step, a new learner is established by relating the predictors X to the residuals of the responses y that are not fitted by previous learners. Soon after, Bu¨hlmann et al. proposed an L2-boost method based on a functional gradient descent algorithm with the L2-loss (3) Valiant, L. G. Commun. ACM 1984, 27 (11), 1134-1142. (4) Kearns, M.; Valiant, L. J. ACM 1994, 41 (1), 67-95, See also Proceedings of the 21st ACM Symposium on the Theory of Computing; ACM Press: 1989; pp 433-444. (5) Schapire, R. E. Mach. Learning 1990, 5 (2), 197-227. (6) Friedman, J.; Hastie, T.; Tibshirani, R. Ann. Stat. 2000, 28 (2), 337-407. (7) Freund, Y.; Schapire, R. E. J. Comput. Syst. Sci. 1997, 55, 119-139. (8) Ridgeway, G.; Madigan, D.; Richardson, T. Boosting Methodology for Regression Problems. Proc. Artif. Intell. Stat. ’99; Heckerman, D., Whittaker, J., Eds.; 1999; pp 152-161. (9) Duffy, N.; Helmbold, D. Mach. Learning 2002, 47(23), 153-200. (10) Borra, S.; Di Ciaccio, A. Comput. Stat. Data Anal. 2002, 38 (4), 407-420. (11) Drucker, H. Improving regressors using boosting techniques. Proc. 14th Int. Conf. Mach. Learning; Nashville, TN, July 8-12, 1997; pp 107-115. (12) Friedman, J. H. Ann. Stat. 2001, 29 (5), 1189-1232. (13) Friedman, J. H. Comput. Stat. Data Anal. 2002, 38 (4), 367-378.
Analytical Chemistry, Vol. 77, No. 5, March 1, 2005 1423
function.14 They applied it to nonlinear learners. PLS is not strictly a linear method,15,16 and it can be considered as a conjugate gradient descent method.16,17 PLS works identically to boosting in that it also contains stepwise iterations for adding new learners. The difference is that it considers residuals both from X and y, whereas boosting only considers residuals from y.18 These studies led us to investigate the possibility of combining PLS and boosting, hoping that this would lead to better resistance to overfitting without loosing accuracy. The method that is proposed here is also verified by applying it to many, mainly analytical, data sets. THEORY PLS (Partial Least Squares). Suppose a data set {X, y}, where X is a n × p matrix that contains p predictor variables for n samples, and y is the corresponding dependent variable vector with size n × 1. The PLS relates y to X through a latent variables (PLS components)
X ) t1p′1 + t2p′2 + ... + tap′a + Ea
(1)
y ) t1q1 + t2q2 + ... + taqa + ka
(2)
where ta is the score vector (n × 1), pa is the loading vector (p × 1) of X, and q is the loading vector (a × 1) of y. Ea and ka are the residuals of X and y, respectively. The standard PLS procedure is based on the NIPALS (nonlinear iterative partial least squares) algorithm.19-23 If na PLS components are used, for a ) 1, 2, ..., na, define
ta ) Ea-1wa
(4)
ba ) Wa(P′aWa)-1qa
(8)
h ba b0a ) yj - X
(9)
where Wa ) (w1, w2, ..., wa) is the weight matrix and Pa ) (p1, p2, ..., pa) is the loading matrix of X when a PLS components are used. In APLS,2 if the first a PLS components are considered, the prediction of a sample i is the average of the prediction of the first a PLS models, and each contains 1, 2, ..., a PLS components, respectively.
yˆ i,a ) xi((b1 + ... + ba)/a) + (b01 + ... + b0a)/a
qa ) k′a-1ta/t′ata
(6)
The prediction of a sample i can be written as (14) Bu ¨ hlmann, P.; Yu, B. J. Am. Stat. Assoc. 2003, 98, 324-339. (15) Frank, I. E.; Friedman, J. H. Technometrics 1993, 35, 109-135. (16) Friedman, J. H.; Popescu, B. E. Gradient directed regularization; Department of Statistics, Stanford University; http://www-stat.stanford.edu/∼jhf/pathlite.pdf, 2004. (17) Wold, S.; Ruhe, A.; Wold, H.; Duun, W. J., III. SIAM J. Sci. Stat. Comput. 1984, 5, 735-742. (18) Betzin, J. PLS-regression in the boosting framework. 3rd Int. Symp. PLS Relat. Methods, Lisbon, Portugal, September 15-17, 2003; pp 261-269. (19) Wold, H. Estimation of principal components and related models by iterative least squares. Krishnaiah, P. R., Ed.; Multivariate Analysis; Academic Press: New York, 1966; pp 391-420. (20) Martens, H. Chemom. Intell. Lab. Syst. 2001, 58, 85-95. (21) Martens, H.; Næs, T. Multivariate Calibration; Wiley: Chichester, 1989; pp 116-165. (22) Helland, I. S. Chemom. Intell. Lab. Syst. 2001, 58, 97-107. (23) Geladi, P.; Kowalski, B. R. Anal. Chim. Acta 1986, 185, 1-17.
1424
Analytical Chemistry, Vol. 77, No. 5, March 1, 2005
x
n
∑(y - yˆ
RMSECVa )
i
2 \i,a)
i)1
(11)
n
where yˆ\i,a is the prediction of yi when the model is constructed without sample i. The performance of the model can also be evaluated by means of RMSEP (root mean square error of prediction) using a test set, which was not involved in the model construction. The RMSEP is calculated as
x
nt
∑(y - yˆ
(5)
(10)
The prediction ability of the model can be evaluated in terms of RMSECV (root mean square error of cross validation) of the training data, which is defined as
Note that ta is based on the residuals of both X and y. The loadings are determined by least-squares fitting.
pa ) E′a-1ta/t′ata
(7)
where ba is the coefficient vector, b0a is the offset when a PLS components are used, and xi is the predictor vector (1× p) of sample i.
(3)
where wa is the weight vector with size p × 1.
wa ) E′a-1ka-1
yˆi,a ) xiba + b0a
RMSEPa )
i
2 i,a)
i)1
nt
(12)
where nt is the number of samples of the test set. Boosting as an Additive Model. The basic idea of boosting as an additive model is to sequentially construct additive regression models by fitting a basic learner to the current residuals that are not fitted by previous models.12 Boosting in regression tries to find a F(x) that minimizes a loss function (such as a square-error loss function, indicated in eq 14) between y and F(x). The F(x) is estimated by means of an additive expansion, such as M
Fˆ (x) )
∑f
m(x)
(13)
m)1
where
fm(x) ) argmin(|y - Fˆ m-1(x) - f(x)|2) f
(14)
where argmin(*) is the function to find the optimal f(x) that minimizes (*), and m is the number of the f(x) added. The prediction for sample i is
M
yˆ ) v1yˆ 1 + v2yˆ 2 + v2yˆ 3 + ... + vMyˆ M )
∑v
m(Xbm
+ b0m)
m)1
(19)
M
yˆi ) yˆi,1 + yˆi,2 + ... + yˆi,m )
∑ yˆ
i,m
(15)
m)1
Fitting the training data too closely may result in bad prediction due to the overfitting; therefore, a too large value for M should be avoided. Copas suggested that using shrinkage could lead to better results than limiting the size of M.24 Friedman then suggested using a shrinkage parameter 0 < v e1 to control the learning rate of the procedure, that is M
Fˆ (x) )
∑ vf
m(x)
(16)
m)1
However, there is no absolute rule to define M and v. BPLS (Boosting PLS). The BPLS proposed by us combines PLS and boosting. Initially, a PLS model with one component is built by fitting the response y with the predictor X, and the initial residual of y is calculated. From then on, BPLS sequentially adds a series of PLS models, each with one PLS component. Unlike classical PLS, which fits the residual of y with the residual of X, BPLS only considers the residual of y; i.e., it always fits the residual of y with the original X. The constructed model is shrunk by a shrinkage parameter v, which is a positive value 0.5) (22) Equation 22 ensures that samples with higher loss values have higher weights, and thus, in the next iteration with the same loss made, they contribute more to the total loss value than samples that have lower weights. As indicated in eq 24, the total loss of a model decides the use extent of the next model. Normalize w. n
wi,m+1 ) wi,m+1/
∑w
j,m+1
(23)
j)1
vm+1 ) 1 - Lm
(24)
This means that an individual model with smaller loss value leads to the next model that is used to a higher extent. Both shrinkage methods are applied to the data sets. The results for the hydrogen data and the green tea data (Figure 3) show that the RMSEP curves using v ) 1 - L are flatter than those using v ) 0.9. When using a changeable v, v ranges from
Figure 1. The RMSECV (-) and RMSEP (O) curves of the data sets using classical PLS.
0.37 to 0.93 for the hydrogen data and from 0.37 to 0.78 for the green tea data (Figure 4). Most v’s for both data sets are smaller than 0.5, which explains why curves with v ) 1 - L lead to flatter RMSEP curves than those using v ) 0.9 as M increases. On the other hand, it should be noted that for the hydrogen data, both shrinkage methods achieve a similar minimal RMSEP, but for the
green tea data, the minimal RMSEP value obtained with v ) 1 L is higher than that obtained with v ) 0.9. Therefore, at this stage, it is not possible to conclude that one of the two methods for computing v should be preferred, and both will still be considered in the next step, the determination of an optimal iteration time M. Analytical Chemistry, Vol. 77, No. 5, March 1, 2005
1427
Table 1. The RMSEP Values of the Data Sets Obtained by Classical PLS (LOO-RMSEP) and APLS (APLS RMSEP)a data
LOO-RMSEP
optimal PLS number
APLS RMSEP
hydrogen wheat green tea CR HIV cream meat
0.059 0.256 1.759 1.446 0.619 0.049 3.761
5 7 13 1 3 15 19
0.071 0.216 1.837 0.952 0.603 0.142 3.001
a For classical PLS, the optimal number of PLS components is determined by LOOCV. For APLS, 20 components are considered.
are proposed to decide on the optimal value of M through fast computation. The maximal M considered is 6000, since preliminary study indicates that within this number, all data sets reach LOO-RMSEP for both v ) 1 - L and v ) 0.9. Monte Carlo Determination. In each Monte Carlo determination, 50% of the samples are randomly drawn from the calibration set without replacement. These samples are used to build the BPLS model and to predict the remaining 50% samples in the calibration set. The M for which the minimal RMSEP value is obtained is considered optimal in this Monte Carlo determination. The procedure is repeated 50 times, and hence, 50 optimal M values are obtained. The mean of these 50 optimal M values is taken as the final value for M. The results are listed in Table 3. Most results are comparable to the LOO-RMSEP. An advantage of this method is its repeatability, since when this method is repeated several times, the obtained optimal Ms are quite similar; however, this method is not economic in computation. Fewer Monte Carlo repetitions (10 or 20) using 80 or 90% samples for training were also investigated. In most cases, satisfactory results, comparable to the LOO-RMSEP, are obtained; however, the optimal Ms are less repeatable for some data sets, suggesting that more Monte Carlo repetitions are to be preferred. Cross-validations of 3-, 5-, and 10-fold were also considered, but the overall results demonstrate that Monte Carlo determination is superior. Determination Based on the Quality of the Fit of the Calibration Data. The RMSEC (root mean square error of calibration) value describes the quality of the fit for the calibration set. Here,
x
n
∑(y - yˆ
RMSECm )
i
i,m)
i)1
n
2
(25)
where yˆi,m is the prediction of yi when the model is constructed with all calibration samples. As shown in Figure 5, the RMSEC curve often drops dramatically after the first iterations and tends to slow as iteration continues. This suggests a possible stopping criterion based on the change of the RMSEC. When the difference between successive RMSECs is less than a chosen threshold, the boosting can stop. It is not necessary to calculate the change of the RMSEC at each iteration. It can be checked every k (e.g., 100) iterations. Meanwhile, the size of the RMSEC is correlated with the y value (Figure 5a vs Figure 5b). Therefore, the mean of y is also taken into account. The proposed stopping criterion is Figure 2. The RMSEP curves of (a) the hydrogen data and (b) the green tea data obtained by BPLS for different fixed shrinkage values.
∆λs )
RMs+1 - RMs mean(y)
< threshold (s ) 1, ... , m/k; k ) 50 or 100) (26)
Determination of M. As seen in Figure 3, within a wide range of iterations (hydrogen data, [1-2000]; green tea data, [30006000]), the RMSEP values obtained by BPLS are comparable to LOO-RMSEP, and they do not change much, indicating they are insensitive to M within those ranges. It is, therefore, not necessary to select a precise number of M. In what follows, several methods 1428
Analytical Chemistry, Vol. 77, No. 5, March 1, 2005
where k×s
RMs )
∑
i)k×(s-1)+1
RMSECi
Table 2. The RMSEP Values of the Data Sets Obtained by BPLS Using Different Fixed v’s and with Different Iteration Numbers
v
100
500
1000
iteration numbers 1500 2000 3000
0.9 0.5 0.1 0.01
0.058 0.059 0.064 0.124
0.069 0.069 0.060 0.060
0.074 0.074 0.065 0.059
0.076 0.075 0.067 0.061
0.9 0.5 0.1 0.01
2.546 2.821 2.914 3.475
2.308 2.364 2.691 2.892
1.974 2.089 2.500 2.728
0.9 0.5 0.1 0.01
0.222 0.231 0.332 0.639
0.228 0.225 0.221 0.225
0.9 0.5 0.1 0.01
1.110 1.145 1.226 1.390
0.9 0.5 0.1 0.01
RMSEP (min), corresponding iteration number
4000
5000
6000
Hydrogen data 0.077 0.093 0.078 0.075 0.068 0.075 0.063 0.065
0.097 0.097 0.076 0.066
0.098 0.098 0.077 0.067
0.137 0.098 0.076 0.068
0.057 0.057 0.058 0.058
126 109 274 724
1.953 2.002 2.442 2.627
Green Tea Data 1.938 1.848 1.966 1.922 2.367 2.218 2.559 2.486
1.797 1.909 2.104 2.449
1.633 1.857 2.035 2.422
1.602 1.815 2.005 2.397
1.602 1.815 2.005 2.397
5896 5968 5995 6000
0.258 0.247 0.220 0.220
0.263 0.259 0.231 0.219
0.260 0.260 0.237 0.219
Wheat Data 0.259 0.260 0.240 0.221
0.258 0.259 0.248 0.223
0.257 0.258 0.255 0.226
0.256 0.253 0.258 0.228
0.217 0.217 0.219 0.219
242 152 858 1659
1.018 1.020 1.085 1.151
1.064 1.022 1.005 1.073
1.148 1.050 1.002 1.031
1.185 1.056 1.008 1.012
CR Data 1.231 1.216 1.015 1.002
1.285 1.265 1.062 1.005
1.317 1.286 1.079 1.009
1.289 1.296 1.085 1.012
1.011 1.002 1.002 1.002
765 309 1437 2963
0.613 0.615 0.617 0.638
0.648 0.610 0.611 0.615
0.689 0.658 0.610 0.614
0.693 0.687 0.619 0.610
0.693 0.693 0.628 0.609
HIV Data 0.693 0.693 0.649 0.610
0.693 0.693 0.658 0.613
0.693 0.693 0.670 0.618
0.693 0.693 0.678 0.624
0.608 0.609 0.609 0.608
142 436 825 2217
0.9 0.5 0.1 0.01
0.276 0.533 0.676 1.078
0.113 0.171 0.363 0.499
0.101 0.147 0.253 0.338
0.100 0.129 0.203 0.279
0.079 0.108 0.169 0.243
Cream Data 0.059 0.076 0.134 0.203
0.057 0.070 0.120 0.180
0.055 0.062 0.114 0.169
0.050 0.060 0.105 0.161
0.050 0.060 0.104 0.161
5935 5998 5971 5997
0.9 0.5 0.1 0.01
3.517 3.965 5.946 8.606
2.694 2.832 3.478 4.024
2.468 2.478 3.223 3.543
2.497 2.412 3.030 3.423
2.510 2.446 2.914 3.331
Meat Data 2.544 2.458 2.684 3.199
2.541 2.486 2.609 3.097
2.732 2.546 2.512 3.008
2.703 2.548 2.491 2.933
2.432 2.392 2.491 2.932
769 1390 5974 5998
Threshold values of 0.005, 0.001, 0.0005, 0.0001, 0.000 05, and 0.000 01 were considered. For all the data sets, the selected M tends to increase when the threshold value decreases. Generally, if a data set requires a high number of components in classical PLS, the M needed in BPLS to reach LOO-RMSEP is also high. The results indicate that 0.0001 or 0.0005 can be considered as acceptable threshold values. Table 4 lists the results obtained by BPLS when the threshold is 0.0001. Most results are acceptable with respect to the LOO-RMSEP. An additional advantage of this method is its fast computation when many less than 6000 iterations is required; however, it is noticed that for data sets that require very low or high numbers of components in classical PLS, for instance, the HIV data and the cream data, the result is sometimes not satisfactory. Determination Based on the Variance of the Scores. The variance of the scores in the mth PLS model is equal to t′mtm. When the variance becomes too small, the extracted information is also small and is probably due to noise. As a variance criterion, we use |b|, the norm of the vector of the regression coefficients, and more specifically 1/|b|, which is considered as an indication of sensitivity for the model.34 (34) Kalivas, J. H.; Green, R. L. Appl. Spectrosc. 2001, 55 (12), 1645-1652.
The stopping criterion is
30 × t′mtm e
1 |bm|
(27)
The results obtained with this stopping criterion are listed in Table 5. Most results are comparable to the LOO-RMSEP. Fast computation is also an advantage of this method. Determination by Combining Monte Carlo Estimation and the Variance Stopping Criterion. Combination of Monte Carlo estimation and the variance stopping criterion is studied for the determination of the optimal M. Each time, 90% samples are randomly selected from the training set without replacement. These samples construct a BPLS model, and 50 × t′mtm e (1/ |bm|) is used as a stopping criterion with a maximum of 6000 iterations. The obtained BPLS model predicts the remaining 10% samples in the training set. The iteration time that gives the minimal RMSEP value is considered optimal. The procedure is repeated 10 times, and the final optimal M is the mean of the 10 optimal iteration times. The selection of 90% samples is preferred to the selection of 50% samples, because when combined with variance stopping criterion, the optimal M obtained by both Analytical Chemistry, Vol. 77, No. 5, March 1, 2005
1429
Figure 4. The change of the shrinkage value (v) for the hydrogen data and the green tea data using v ) 1 - L method. Table 3. The RMSEP Value Obtained by BPLS Using the Optimal M Determined by Monte Carlo v ) 0.9
Figure 3. The RMSEP curves of (a) the hydrogen data and (b) the green tea data obtained by BPLS using fixed (v ) 0.9) or changeable (v ) 1 - L) shrinkage values.
selections is repeatable, but the former needs only 10 repetitions instead of 50. The variance stopping criterion in eq 27 is relaxed by using 50 instead of 30, because when combined with Monte Carlo, it was found that this results in a more stable and better RMSEP. The results listed in Table 6 indicate that this method gives the overall best results of all the proposed methods. In addition to the advantage in accuracy, the combination of Monte Carlo with the variance of scores stopping criterion has two additional advantages: the first is to keep the stability of 50 times Monte Carlo with selection of 50% samples, but fewer Monte Carlo times (10) are needed; the second is that it profits from the advantage of eq 27 in computation speed. As shown in Table 6, the two shrinkage methods achieve similar accuracy. Since using v ) 0.9 is simple and time-saving, this method is preferred. For convenience, Table 6 also repeats the results for classical PLS of Table 1. It can be verified that in general, BPLS yields results that are at least comparable in quality to those obtained by classical PLS. The results in Table 6 are also 1430 Analytical Chemistry, Vol. 77, No. 5, March 1, 2005
v)1-L
data
RMSEP
optimal M
RMSEP
optimal M
hydrogen green tea wheat CR HIV cream meat
0.060 1.944 0.250 1.017 0.628 0.052 2.540
38 1739 597 687 7 5605 3871
0.057 2.089 0.247 1.021 0.642 0.060 2.478
125 2744 1193 787 11 5840 4432
compared with the minimal RMSEP value and the corresponding iteration number, which are obtained by v ) 0.9 and indicated in the last two columns of Table 2. In Table 6, four of the seven data sets (Hydrogen, Wheat, CR and HIV) select a number of iterations that is not far from the number which gives minimal RMSEP, and the other two data sets (green tea and cream) select a much smaller number of iterations. Only the meat data selects a much larger number of iterations. This indicates that BPLS tends to select a smaller model, and thus, overfitting may be avoided. For the cream and the meat data, although the difference between the iteration numbers is large, the RMSEP is similar to the minimal RMSEP, suggesting that the BPLS model is stable within a wide range of iterations. CONCLUSIONS PLS is often and very successfully used in multivariate calibration for analytical purposes; however, the practicing analytical chemist may have difficulties in avoiding overfitting. This study investigates an alternative approach of PLS by combing PLS and boosting. BPLS does not require selection of an adequate number of PLS components, as is the case for classical PLS, but two other parameters must be determined: the shrinkage value and the number of iterations. A standard value is proposed for the former, and a method is proposed for the latter. As a result, BPLS has at least comparable prediction ability with respect to classical PLS. It is relatively resistantant to overfitting, since the prediction is more stable when the adjustable parameter of PLS, namely, the
Table 5. RMSEP Value Obtained by BPLS Using 30 × t′mtm e (1/|bm|) as a Stopping Criterion v ) 0.9
v)1-L
data
RMSEP
iteration no.
RMSEP
iteration no.
hydrogen green tea wheat CR HIV cream meat
0.059 1.633 0.228 1.177 0.657 0.060 2.703
116 5011 508 6000 504 2485 6000
0.059 1.822 0.232 1.177 0.640 0.063 2.535
175 6000 514 6000 821 4768 6000
Table 6. RMSEP Value Obtained by BPLSa,b v ) 0.9 data
v)1-L
RMSEP iteration no. RMSEP iteration no. LOO-RMSEP
hydrogen green tea wheat CR HIV cream meat
0.060 1.966 0.221 1.065 0.610 0.058 2.539
74 1270 323 1090 164 3529 3160
0.059 1.983 0.234 1.028 0.614 0.061 2.463
91 3453 564 1749 263 5503 3605
0.059 1.759 0.256 1.446 0.619 0.049 3.761
a The optimal M is determined by the combination of Monte Carlo with 50 × t′mtm e (1/|bm|) as a stopping criterion. b The RMSEP value obtained by classical PLS (LOO-RMSEP) is also listed.
Figure 5. The RMSEC curves of (a) the hydrogen data and (b) the green tea data obtained by BPLS using fixed (v ) 0.9) or changeable (v ) 1 - L) shrinkage values. Table 4. RMSEP Value Obtained by BPLS Using ∆λ < 0.0001 as a Stopping Criteriona v ) 0.9
v)1-L
data
k ) 50
k ) 100
k ) 50
k ) 100
hydrogen green tea wheat CR HIV cream meat
0.059 (150) 2.224 (700) 0.228 (500) 1.097 (1350) 0.657 (600) 0.102 (1200) 2.496 (1150)
0.069 (500) 1.971 (1100) 0.258 (1000) 1.231 (3100) 0.657 (700) 0.078 (2100) 2.505 (1700)
0.059 (200) 2.293 (1100) 0.237 (700) 1.064(2750) 0.611 (500) 0.090 (1400) 2.650 (1000)
0.066 (400) 2.141 (1700) 0.239 (800) 1.139 (4600) 0.643 (1000) 0.089 (1600) 2.435 (1800)
a
The number in parentheses indicates the selected M.
number of iterations, increases than in classical PLS, in which large changes can be seen when the number of components increases. The selection of an accurate optimal number is not strictly required, leading to an advantage in computation saving, which may be meaningful in updating. When new samples are
added into an existing classical PLS model, the number of components should always be reconsidered. Since our method does not need a very precise M, simpler solutions might be available. Changing the order of the calibration samples may result in somewhat different results in prediction. This is due to rounding errors in the calculation and their propagation by iteration. This problem was never mentioned in previous articles about boosting in regression and, therefore, perhaps not investigated. To check the size of this problem, the calibration sets have been reordered several times, and BPLS models were reconstructed. The results from our data sets show that, using our criteria to select a suitable number of boosting iterations, the numerical stability problem is not critical, since the difference in prediction is acceptable; however, this problem should not be neglected. One of the possible solutions may be to find a more stable numerical algorithm for PLS. Our overall conclusion is that BPLS is a promising new approach to PLS that has the potential of correcting some difficulties that arise with the use of the latter in practical analytical application. ACKNOWLEDGMENT The authors thank Xavier Capron for attracting their attention to the numerical stability problem. They also thank P. Dardenne and J. A. Fernandez for supplying the meat data.
Received for review September 27, 2004. Accepted December 1, 2004. AC048561M
Analytical Chemistry, Vol. 77, No. 5, March 1, 2005
1431