Genetic Algorithms as a Tool for Wavelength Selection in Multivariate

Dec 1, 1995 - Genetic Algorithms as a Tool for Wavelength Selection in Multivariate Calibration. Delphine. Jouan-Rimbaud, Desire-Luc. Massart, Riccard...
0 downloads 4 Views 752KB Size
Anal. Chem. 1995, 67,4295-4301

Genetic Algorithms as a Tool for Wavelength Selection in Multivariate Calibration Delphine Jouen=Rimbaudand Wsir&Luc Massart* ChemoAC, Pharmaceutical Institute, Vroe Universiteit Brussel, Laarbeeklaan 103, B- 1090 Brussel, Belgium Riccardo Leardi lstituto di Analisi e Tecnologie Farmaceutiche ed Alimentati, Via Btigata Salerno (ponte), 1-16147 Genova, Italy Onno E. De Noord Koninklijke/Shell Laboratotium, Amsterdam (Shell Research B. V.) P.0. Box 38000,1030 BN Amsterdam, The Netherlands

A comparison of multiple linear regression (MLR) with partial least-squares (PLS)regression is presented, for the multivariate modeling of hydroxyl number in a certain polymer of a heterogeneous near-IR spectroscopic data set. The MLR model was performed by selecting the variables with a genetic algorithm. A good model could be obtained with both methods. It was shown that the MLR and PLS solutions are very similar. The two models include the same number of variables, and the first variables in each model have similar, chemically understandable functions. It is concluded that both solutions are equivalent and that each has some advantages and disadvantages. This also means that even in very complex situations such as here, MLR can replace PLS in certain cases. Two popular methods for the quantitative analysis of spectral data by near-infrared (near-IR) spectroscopy are principal component regression (PCR) and partial least squares (PLS),'this last method usually being preferred. PCR and PLS have the advantage of being full-spectrum methods, which means that all the information contained in the spectra is available for the modeling. Another approach for the modeling is the use of multiple linear regression (MLR). Some applications of MLR modeling of near-IR data are given in the literatures2 In such cases, MLR is either performed on near-IR data measured on a fixed-filter instrument, which records the spectra at a few wavelengths only, or after a preselection of wavelengths, which is most of the time performed in the stepwise manner. The advantage of MLR is that, for a person that has relatively little knowledge of chemometric techniques, it should be easier to interpret in chemical terms than methods based on latent vectors. Moreover, one may expect that statistical interpretation of the results is more straightforward as the theory of MLR is better understood at the present time than that of PCR and PLS. This should also be the case for quality control, where the interpretation of a situation that is out of control is probably easier than with (1) Martens, H.; N m , T. Multivariate Calibration; John Wiley and Sons: New York, 1991. (2) Hildrum, K. I.; Isaksson, T.; Ncs, T.; Tandberg, A Near Infra-Red Spectroscopy, Bridging the Gap between Data Analysis and NIR Applications; Ellis Honvood Limited: Chichester, U.K., 1992. 0003-270019510367-4295$9.00/0 0 1995 American Chemical Society

PCR and PLS. The main problem for MLR is the selection procedure of the wavelengths to be retained in the model. A stepwise selection does not always yield satisfactory results. In particular, the fitted models resulting from a stepwise selection may have a poor predictive ability. Another possible drawback of the method is that the wavelengths are selected one at a time, whereas for multivariate calibration, one may prefer a truly multivariate approach and prefer to select the whole subset at the same time. Therefore, in the case of near-IR spectra containing many variables, an appropriate variable selection must be performed before modeling with MLR The genetic (GA) have already been used in variable selection problem^^-^ and seem to be a solution to a multivariate selection of variables. There are indications in the literature that MLR after selection by GAS yields models with the same number of regression variables as PCR or PLS, usually with somewhat better predictive ability, and we wanted to investigate this phenomenon in more detail. In order to study the performance of genetic algorithms in variable selection, we have chosen to work with a complex data set, for which simple methods of variable selection, such as the stepwise selection procedure, did not yield satisfactory results in terms of predictive ability of the models obtained. The near-IR data set used is heterogeneous and contains a few outliers. The samples analyzed are polyether polyols, with different chemical structures, which is the reason for the main heterogeneity. The presence of water in different amounts in the samples is another reason for differences in the absorption spectra. We decided to make a single model with the whole data set, instead of making separate models for the main clusters. THEORY

Genetic Algorithms as a Tool for Wavelength Selection. The ultimate goal of the genetic algorithms is the optimization of (3) Goldberg, D. E. Genetic Algorithm in Search Optimisation and Machine Learning Addison-Wesley Publishing Co. Inc.: Reading, MA, 1989. (4) Lucasius, C. B.;Kateman, G. Chemom. Intell. Lab. Syst. 1993,29, 1-33. (5) Lucasius, C. B.; Kateman, G. Chemom. Intell. Lab. Syst 1994,25, 99-145. (6)Hibbert, D. B. Chemom. Intell. Lab. Syst. 1993,19, 277-293. (7) Lucasius, C. B.; Becker, M. L. M.; Kateman, G. Anal. Chim. Acta 1994, 286, 135-153. (8) Leardi, R; Boggia, R; Temle, M.]. Chemom. 1992,6, 267-281. (9) Leardi, R. 1.Chemom. 1994,8, 65-79.

Analytical Chernistty, Vol. 67, No. 23, December 1, 1995 4295

a given response function. These algorithms are inspired by the theory of evolution: in a living environment, the “best” individuals have a greater chance to survive and a greater probability to spread their genomes by reproduction. The mating of two “good” individuals causes the mixing of their genomes, which may result in a “better” offspring. The terms “good“, “best”, and “better” are related to the fitness of the individuals to their environment. Simulated annealing, an optimization method similar to the genetic algorithms, has already been used for variable selection,7,10,11 In genetic algorithms, a population of strings is randomly created. In feature selection, each string is a row vector containing as many elements as there are variables, each element being coded as 1 if the Corresponding variable was selected and 0 if it was not. The fitness of the string is equal to the evaluation response which is to be optimized by the algorithm, for instance, the percentage of explained variance with a given subset of selected wavelengths. This randomly initiated population of strings then evolves, its fittest members being selected, and then undergoing crossover (Le., combination of their elements) and, to a lesser extent, mutations. This procedure is repeated a certain number of times. There are many different versions of genetic algorithms, depending on the way to perform reproduction, crossover, etc. The algorithm we have used is specifically devoted to the problem of variable selection. Its characteristics, such as, for example, the protection of some strings, and the validation procedure are described in detail in refs 8 and 9. As validation is a very important step of the algorithm, we will briefly recall how it is performed, but a full description can be found in ref 9: (1) The data set is separated into g groups; a training set is built from g - 1 groups, and the gth group constitutes the test set. Hence there exist g different training sets, associated to g different test sets. (2) For each training set, the genetic algorithm described in the flow diagram presented in Figure 1 is performed, and the subsets of variables found are validated by predicting the test set from the models built with the calibration objects and each subset of selected variables. Therefore, the genetic algorithm itself is run g times (each time on a different training set). (3) After g runs, some of the evaluated strings have a validated response larger than the minimum acceptable value (which is given by the user); for each of these subsets: (a) perform g models, with the g training sets, and for each model, predict the corresponding test set. (b) Therefore, to each subset, correspond g values. If the g responses are all larger than the minimum acceptable-value,their geometrical mean is calculated; it is this value which will then be used to compare the strings with one another in order to find the “best” solution. This validation approach comes as close to full validation12as possible, but one might argue that it is not a full validation. Indeed, there is a risk of chance effect, not only risks of “chance correlation”,but also risks of “chance predictions”: Because many different subsets are evaluated, the probability that something is found by chance is increased. Moreover, the prediction results for the “Validation”step are used in the subset selection procedure to choose the best string from a number of acceptable solutions, (10) Kalivas. J. H. Chemom. Intell. Lab. Syst. 1992, 15,1-12. (11) Kalivas, J. H.; Roberts, N.; Sutter, J. M.Anal. Chem. 1989,61,2024-2030. (12) Lanteri, S. Chemom. Intell. Lab, Syst. 1992, 15, 159-169.

4296 Analytical Chemistry, Vol. 67, No. 23, December 1, 1995

c Crossover (L Mutabon crsnbon of 2 “msmngs”

I

,

I

$.

.

Figure 1. Flow diagram of the genetic algorithm.

and therefore, the validation results are not completely independent from the selection procedure. The genetic algorithm itself is described in the flow diagram of Figure 1. Some modifications have been brought to the algorithm since the publication of refs 8 and 9. They are as follows. ne Check of Subsets: This step is performed during the initiation of the population and after evaluation of each new offspring. A string s can enter the population only if none of the solutions rating better than s use a combination of variables which is a subset of the variables used by solution s. After solution s

has entered the population, the solutions rating worse than s and using a combination of variables of which string s is a subset are eliminated. This allows to eliminate some useless information. The Use of a Hybrid Algorithm: In order to improve the performance of the genetic algorithm, a stepwise procedure was added. With a given frequency, the best solution (or one of the best) undergoes a backward elimination,13which is performed in the following way. If n variables are selected in the string which undergoes the backward elimination step, n different subsets of n - 1 elements of this string are evaluated, each one lacking one of the n variables. If the best of these subsets yields a response better than the original string, the backward elimination will be repeated with its n - 1subsets of n - 2 elements, etc. All subsets of variables deriving from the stepwise selection are considered as new strings and therefore can enter the population if they satisfy the constraints described above. After the stepwise selection, the genetic algorithm starts again, and the two steps alternate until a termination criterion is reached. A backward elimination was preferred to a forward selection, or forward and backward selection, because the many variables present in near-IR data would require too much time to perform a cycle of forward selection and because the probability that useless variables are among the ones present in the string is rather high. The stepwise selection has a double goal-to find new strings having a higher fitness and at the same time simpler since they use a lower number of variables. Check of Orthogonality. The selection of variables is performed for a MLR model. It is known that collinearity of the variables can seriously affect the stability of this model and therefore its quality. Different diagnostics exist to determine and evaluate the collinearity in a data set, and an overview of these diagnostics is presented, for instance, in ref 14. Two of these diagnostics were used, namely, the variance inflation factor (VIF) and the condition indexes. Variance Inflation Factor. A useful parameter for the description of the interdependency in a data set is the tolerance, which for a variable i is calculated as

To~(z] = 1 - P(z]

P(z] being the coefficient of multiple determination for the regression between the variable i, which is considered dependent, and the other independent variables. P(z]is defined as n

n

where xj(z] is the value of variable i for object i and ji(z] is the mean value of x(i) over the n objects. The larger R2(z], the more relation exists between variable i and the other independent variables, and the smaller the tolerance. The VIF is defined as the inverse of the tolerance:

VIF(i) = 1/(1 - P) The VIFs are the diagonal elements of the inverse of the correlation matrix. (13) Draper, N. R; Smith, H. Applied Regression Analysis,2nd ed.; John Wiley and Sons: New York, 1981. (14) Belsy, D. A; Kuh, E.; Welsch, R. E. Regression Diagnostics-ldentiljing Znjuentzal Data and Sources ofCollineady;John Wiley and Sons: New York, 1980; Chapter 3.

This factor is linked with the variance of the regression coefficient calculated for variable i: The larger the VIF, the larger the variance of the regression coefficient, and therefore the less stable the model. Some authorslj consider a value of VIF of 5 or 10 to be large (they respectively correspond to values of R2 of 0.8 and 0.9). Condition Indexes. Any matrix X of rank r has r positive singular values. Let us call AI,A*, ...,,Iythe r singular values such that ,I1 z A2 L ... L ,Iy > 0. The condition number of the matrix c 0 is c o =AJA,

Similarly, a condition index can be defined, for any singular value e, as

c(e>= A I D e and it is stated that there are as many near dependencies among the variables as there are large condition indexes. ‘Weak dependencies are associated with condition indexes around 5 or 10, whereas moderate to strongly linear relations are associated with condition indexes of 30 to 100”.14 EXPERIMENTAL SECTION

The Data. The data set consists of 248 near-IR spectra of polyether polyols @-matrix), and the accompanying hydroxyl number (y-vector), expressed in mg of KOH/g. Different product types are present, with different chemical structures, making the data set inhomogeneous. The spectra of 87 samples were recorded on a Pacific Scientific 6250 scanning spectrometer (NIRSystem Inc., Silver Spring, MD), between 1100 and 2158 nm (one data point each two nanometers). An offset correction was performed, in which one subtracted from each variable the absorbance at 1100 nm (first variable); therefore, the first column of the X-matrix was the null vector, which was removed from X, and 529 variables were left. Each measurement was duplicated or triplicated. The two or three spectra of each sample were averaged, so that the data set was composed of 87 objects. The variables being highly correlated, to reduce the complexity of the search and to reduce the ratio variables/objects, and then the random correlations, only onefifth of the variables were used (i.e., each fifth wavelength); the spectral matrix, which was input to undergo the wavelength selection with the genetic algorithm, was 87 x 105. Parameters Input to the Genetic Algorithm. Number of outer cycles Number of chromosomes in the population Probability for a variable to undergo crossover probability of mutations minimal accepted predicted variance Frequency for the backward stepwise selection Stop criterion for outer cycles

5 20 50% 1% 99% 100 evaluations 200 evaluations one cycle of backward selection

+

Response to be Maximized. The response to be maximized is the percentage of predicted variance, defined as n

n

where 9i is the predicted value of object i, n is the number of (15) Snee, R.

D.Technometn’cs

1977, 19, 415-428

Analytical Chemistry, Vol. 67, No. 23, December 1, 1995

4297

I

1.61

_

_

_

_

_

_

_

_

~

~

Table 1. Results Obtained for the Selection of Wavelengths with MLR

14-

12-

5

i

1-

wavelengths

no.

% of pred variance“

cross-validated RMSEP

RMSEC

7 6 5

99.80 99.79 99.63

1.39 1.46 1.72

1.1676 1.1331 1.6245

” Geometrical mean of the five predictions.

E08-

s

$06-

Table 2. Largest Variance Inflation Factor and Condition Index for the Subsets Selected with MLR

0 4-

1800 Wavelengths (nm) 17’00

1900

ZOO0

2100



Figure 2. Position of the seven selected wavelengths on a spectrum of x.

objects to be predicted, k = n - 1 in the case of cross-validation, and n in the case of external validation set prediction. This formula is usually applied to describe the percentage of variance explained by a model, which expresses the goodness of the fit of this model. Here, it was used in the genetic algorithm to estimate the predictive ability of the model (in the validation step), and for this reason, we call it the “predicted” variance. Evaluation of the Performance of Each Model. The results are presented in terms of root-mean-square error of prediction (RMSEP), calculated by predicting y by cross-validation, and which gives an estimation of the prediction error to expect from the model, and of root-mean-square error in calibration (RMSEC), which has the same equation, but in which y is predicted after the model is built. Since these objects belong to the modeling set, this parameter describes the lack of fit of the model. The RMSEP (respectively RMSEC) is defined as

I,

RMSEP (resp RMSEC) = In fact, the number of degrees of freedom for the RMSEC should be n - p - 1,p being the number of variables used in the model, but n being much larger than p , we have decided to use the approximated formula presented above. Software Used. The genetic algorithm used to select the wavelengths was written in QuickBasic. The models were built in Matlab for Windows 4.0. The check for orthogonality was performed with SPSS 5.0.1 for Windows (in SPSS, the VIF and condition indexes are calculated for a spectral matrix to which a column of 1 was added for the calculation of an intercept, so for an n-variables subset, n + 1 eigenvalues can be calculated). RESULTS AND DISCUSSION

Results Concerning the Genetic Algorithm Applied to M U . The best result obtained yielded a predicted variance (geometrical mean of the prediction results for the five deletion groups) of 99.80%. The corresponding subset contains seven variables only, whose position on a spectrum is shown in Figure 2. The output of the algorithm consists of all the subsets of variables which were found to yield a predicted variance greater 4298 Analytical Chemistry, Vol. 67, No. 23, December 7, 7995

no. of variables in the subset

largest VIF

largest condition index

7 6 5

81.665 10.681 7.927

284.986 213.674 48.777

than (or equal to) 99%in each of the five deletion groups. We have also looked for the best subsets of fewer than seven variables which also yield a predicted variance greater than 99% (geometrical mean of the five deletion groups). The results are presented in Table 1. Subsets of six and five variables were found. The fit of the model built with six variables is slightly better than that of the model built from the seven variables, but the smallest prediction error is obtained with the seven-variables model; this is why the seven-variables subset was selected by the genetic algorithm, which optimizes the prediction error, and not the fit. The sixwavelengths model has five variables in common with the sevenwavelengths subset, and the five-wavelengths subset is a subgroup of the six-wavelengths one (four variables in common with the seven-wavelengths subset). This shows the similarity of the numerous potentially good strings found by the algorithm. Moreover, this can explain why, when increasing the complexity from six to seven, the RMSEC slightly increases, as here, increasing complexity does not mean adding one variable but going from one subset to another one. The check for orthogonality reveals that there exists a strong dependency among the selected X-variables (see Table 2 ) . As smaller subsets were used, the interdependency among the variables decreased. Too collinear independent variables can lead to estimation of unstable regression coefficients. This can have a very harmful effect on prediction of new independent samples. Although the MLR models obtained with the genetic algorithms were validated inside the algorithm by predicting independent samples, we considered that a further study of the MLR was needed. Indeed, there seems to be a contradiction between the good results obtained in validation and the large values of VIF and condition index, which reveal a strong collinearity and, therefore, a possible unstability of the model. In fact, the interpretation of these apparently contlicting results probably is that the model is good within the modeled region only and that it is therefore important for new samples to make sure that they are within that region, otherwise they may be very badly predicted. To understand which is the chemical information important for the MLR model, we should first investigate the structure of

2,

I

53$57 6 1.51

I

03

64

0.5

B 1

67

ea2

0m

44

29

l

-

65 4tP

.74 26

63

664 63

-0.5

~~

Table 4. Lack of Fit (Expressed in Terms of RMSEC) as a Function of the Number of Original Variables included in the MLR Model

no. of variables

lack of fit (RMSEC)

no. of variables

lack of fit (RMSEC)

1 2 3 4

9.0301 6.1837 4.9596 4.4578

5 6 7

4.1843 1.3651 1.1676

23

I!

-11

-2.

-3!5

.3

-215

.;

.;

-1)5 -015 PC 1 . ( 54.96%)

b

1!5

0:5

Figure 3. PCA score plot: PC 1-PC 2 score plot. ~~

Table 3. Correlation Coefflcient (r) between the Absorbance at a Selected Wavelength and the Hydroxyl Number Vector

selected wavelength (nm)

Y

1320 1400 1430 1460

-0.1056 0.5379 0.9539 0.9139

selected wavelength (nm)

Y

1560 1800 1910

0.8400 -0.1059 -0.0710

the data. The PC1-PC2 score plot is shown in Figure 3, where two main clusters can be seen. The cluster identity of a sample depends on the chemical structure, as indicated by the presence of a peak in its spectrum at 1690 nm. A total of 26 samples belong to the cluster of spectra with no peak at 1690 nm, and 60 samples to the cluster of spectra with a peak. The spectrum of sample 63 does not have a peak at 1690 nm, but this sample is a chemical outlier on the first PC. PC 1, which reveals the heterogeneity, is in fact a descriptor of the type of CH groups. It would seem that, within the two clusters, subclusters can be defined. The heterogeneity of the data is due to differences in the chemical structure of the samples. A further analysis of the PCA score plots has revealed that the second PC was a descriptor of the OH groups and that PC 3 was describing the presence of water in samples. These three PCs explain more than 95%of the variance in X; this enables us to understand which chemical groups contribute most to the absorbance. Table 3 presents the correlation coefficients between the selected wavelengths and the hydroxyl number. Lf some of the selected wavelengths are strongly correlated with the hydroxyl number, this is not the case for all of them (particularly wavelengths 1320,1800, and 1910 nm). Because of the difficulties with cross-validation when variable selection is involved and with deciding on the complexity of the model, it was decided to study in detail the role of the seven wavelengths. To do so, they were arranged, using a forward selection procedure, in the following order: 1430, 1800, 1460, 1910, 1560, 1320, and 1400 nm. One of the reasons to do this is to be able to compare the MLR solution with the PLS solution, in which components are ordered according to importance (but it must not be forgotten that the genetic algorithm always considers the solution as points in the experimental domain and then without establishing any hierarchy among

the selected variables). The first selected wavelength is that yielding the best correlation with the hydroxyl number (see Table 3), and the next two wavelengths showing high correlation with y (respectively, 1460 and 1560 nm) are selected in positions 3 and 5. The second, fourth, and Bth selected wavelengths have a very low correlation with the hydroxyl number. In order to understand why these wavelengths were selected, we have built one- to seven-variables models, by including the wavelengths in their order of selection. After addition of a variable in the model, the lack of fit was calculated in terms of RMSEC. The results are presented in Table 4. The RMSEC decreases regularly as the number of variables included in the MLR model increases. Figure 4a plots the hydroxyl number versus the absorbance for wavelength 1430 nm, which yields the best correlation with the hydroxyl number among the seven selected wavelengths. This represents the one-variable MLR model. On this plot, the linear relation between the wavelength considered and the hydroxyl number is evident, but the spread of some points around the straight line is quite wide. It is interesting to note that these points are grouped along two lines largely according to the two main clusters seen on the score plot on Figure 3. One evident conclusion is that better results might be obtained by splitting the data set in two according to these clusters and developing two separate models. We will report on this in a further article, but in this article, we will treat the whole data set as such, including the outliers one may distinguish, because the purpose of this article is to study the performance of genetic algorithms in MLR for complex data sets. Figure 4b shows the plot y-predicted versus y-observed for the model built from 1430 and 1800 nm. The objects have now much smaller residuals. The role of the 18Wnm variable is therefore to bring into account the difference between the two main clusters. We can notice that it was chosen instead of 1690 nm, but that it carries the same information. Indeed, it is situated on the absorption band of the first CH overtone, and as was already explained, the difference between the two clusters depends on the chemical structure. As shown in the plot of Figure 4c, this variable does separate the two main clusters of the PCA plot. The result is one single model with visually one clear outlier (63) and less clearly 72, 53, and 67. For the reasons explained before, we will keep the outliers in the data set. The third variable included, 1460 nm, has a quite large correlation with the hydroxyl number, so its selection is not surprising. Its effect is largely to reduce overall residual variance. Wavelengths 1430 and 1460 nm belong to the absorption band of the first OH overtone, but in this region, water also absorbs. Therefore, a variable should be added to correct for this water contribution: this is done with the fourth variable, 1910 nm, which is situated on the water absorption band. The corresponding model is shown on Figure 4d, where it can be seen that its effect is to ameliorate the fit for objects 53 and 67, whose residuals Analytical Chemistry, Vol. 67, No.23, December 1 , 1995

4299

la

i

100

5

09c 1

$

80-

1 Y " * EI L

-

-c

A

5

$ 0

E

60-

9"

I

.&

7

40 -

el

y

63

3

65

1 I

23

J!

49 62

Q*

i1

47,49@ Sil"'

Q1

48

0 75

i

.70 .3504 5 3 ?2

!I

I

615

0 25 03 0 35 Absorbance at 1430 nm

02

04

0 45

0 25 03 0 35 Absorbance at 1430 nm

02

04

0 45

..c/

d 120-

63

/

100-

*d

a

i

63

1 I

1

/

g

80-

64

G

E a

6040 -

1

"0

I

20

40

80 y Observed

60

100

120

140

ob

20

40

60

80

100

120

10

y observed

Figure 4. (a) One-variable MLR model, hydroxyl number versus absorbance at 1430 nm; (b) two-variables MLR model, y-predicted versus y-observed; (c) intensities at 1800 nm versus intensities at 1430 nm; (d) four-variables MLR model, y-predicted versus y-observed.

become much smaller. These two samples have a large water content, and it is why their fit was improved by addition of 1910 nm. Variable 1560 nm being correlated with the hydroxyl number, its selection will not be commented on here. There are still quite large residuals to this model (particularly 63 and 72) and wavelength 1320 nm essentially corrects for them. This wavelength belongs to the CH absorption band. Object 72 has an extreme value for wavelength 1320 nm. The last wavelength, which also carries information about CH absorption, essentially decreases the residuals of the two new extreme values (37 and 83). From this analysis, it follows that the MLR model can indeed be explained on a chemical basis. Also, although much attention was paid to the validation, there is always the fear that random selection took place. This analysis shows that the model is chemically plausible. Comparison with Full-Spectrum PCR and PLS. PLS is considered as a reference method, since it is the standard method applied for multivariate calibration. The optimal number of factors to use in PCR and PLS was determined by cross-validation, using five cross-validation segments. The results are presented in Table 5. PCR and PLS models are not as good as the six- and sevenvariables MLR model, in terms of fit as well as prediction error, which agrees with earlier conclusions made by one of US.^ 4300

Analytical Chemistry, Vol. 67, No. 23, December 1, 1995

Table 5. Results Obtained with PCR and PLS Applied to the Whole Spectra

method (optimal no. of factors)

cross-validated RMSEP

RMSEC

PCR (8 factors) PLS (7 factors)

1.84 1.78

1.5176 1.3884

Table 6. Lack of Fit (Expressed in Terms of RMSEC) as a Function of the Number of Latent Variables Included in the PLS Model no. of variables

(RMSEC)

lack of fit

no. of variables

lack of fit (RMSEC)

1 2 3 4 5

8.2802 5.3940 3.7448 3.0079 2.2152

6 7 8 9

1.6333 1.3884 1.2828 1.2596 1.1918

10

In analogy with the detailed study of each MLR variable, we have tried to see whether we could interpret the PLS factors in the model. As said earlier, the PLS model was determined to be optimal with seven factors (the same number as for MLR). In

1401

v

la

120

74

70

‘26

a72

I

ob

20

80 y observed

40

100

60

’‘

1.5

b

.70 .7‘

120

140

81

n1

4 a

5 -0.5E

66

5255

VI

%7 44

‘26

Scores on the first PLS factor

63

z

60

I 0;

20

40

60 80 y observed

100

120

140

Figure 5. (a) One-factor PLS model, y-predicted versus y-observed; (b) PLSl score plot, factor 2 versus factor 1; (c) two-factors PLS model, y-predicted versus y-observed.

Table 6, the lack of fit corresponding to each PLS model is presented.

The plots comparing y-predicted versus y-observed for the different PLS are shown in Figure 5. With the one-factor PLS model (Figure 5a), the objects are grouped along two lines, more or less depending on the two main clusters (except for objects 23, 83, and 84). This is similar to the one-variable MLR model. The PLS score plot factor 1 versus factor 2, presented in Figure 5b, shows that the second factor separates the two clusters, so that the two-factor model does not distinguish them any longer (Figure 5c). Again this is similar to the role of the 1 8 0 h m variable in MLR The third variable included in the model has the general effect of improving the overall fit, as for MLR. Factors four to seven have as effect to improve the general fit and take into account a few outliers. Therefore, there exists a remarkable similar behavior of the variables in MLR and the latent variables in PLS. Indeed, not only do both models have seven variables, but the first variables in each case have a very similar role. CONCLUSION The MLR solution is very similar to the PLS solution (same number of components, same role of the components of the model). The question is then which model to prefer. Earlier results in the literature for food analysisg and the present results seem to indicate that MLR is more precise than full-spectrum PLS. This must be due to the inclusion of nonrelevant wavelengths in PLS solution. It is probable that by variable selection on the PLS model, using methods such as the selection of wavelengths with important loadings on the first factors, or the same genetic algorithm, similar results would be obtained, as it has been shown that variable selection can improve the quality of PLS solutions. The MLR solutions are easier to interpret, certainly, for a person who is not proficient in interpreting loading plots. Some other possible advantages, which were however not tested here, are described in the introduction. A disadvantage might be the collinearity problem. It is demonstrated here that the MLR solutions select collinear wavelengths, but that this does not affect the quality of prediction. It is to be feared, however, that the MLR solution might be less good in prediction of objects situated outside the boundaries of the calibration model. Also, single wavelengths might be more prone to shift, and therefore, MLR solutions might be less robust in time. To avoid this, we are now developing a version of the algorithm which selects bands of wavelengths instead of single wavelengths. ACKNOWLEDGMENT D.J.-R thanks B. Walczak and J. Smeyers-Verbeke for constructive discussions and valuable remarks. The0 Meier is also thanked for collecting the data used in this research. Received for review June 7, 1995. Accepted September 18, 1995.@ AC9505536 Abstract published in Advance ACS Abstracts, November 1, 1995.

Analytical Chemistry, !Yo/. 67, No. 23, December 1, 1995

4301