Anal. Chem. 1900, 60, 2756-2760
2758
Data Reduction of Bilinear Matrices Prior to Calibration Jerker Ohman* Department of Analytical Chemistry, University of Umel, S-901 87 UmeP, Sweden
Walter Lindberg National Swedish Laboratory for Agricultural Chemistry, Box 5097, S-900 05 UmeP, Sweden
Svante Wold Research Group for Chemometrics, University of UmeP, S-901 87 Urnel, Sweden
A parilal least squares (PLS) calibration model Is used for quantHatlve analysts of bilinear data. I n order to reduce the number of Independent variables (X block) of the PLS model, the first few principal components for each chromatogram were calculated and the resulting scores were used as X variables. The actual number of princlpal components was determined by cross valldatlon. Appllcatlon of the method Is demmstra:ed by uslng both simulated and real data, the latter from an analysis of phenanthrene and anthracene mixtures by hlgh-performance llqukl chromatography wlth an ultraviolet-dlode array detector. Results show good predlctlve properties and stabimy compared with normal PLS.
INTRODUCTION The introduction of multivariate detectors (e.g. diode array detectors, scanning electrochemical detectors, etc.) in highperformance liquid chromatography (HPLC) has given the analytical chemist new ways to deal with the problem of overlapped peaks. If there are no selective wavelengths (potentials or other types of parameters) where the eluting analytes can be detected without interference from coeluting compounds, then two distinct approaches to making a quantitative analysis may be considered. The first possibility is to numerically resolve the peaks and then use the resulting peak heights or integrals of the resolved peaks. Alternatively, one can identify each of the coeluting compounds and perform multivariate calibration by using these compounds. To this end, partial least squares regression (PLS) (1) has proven to give better predictions than multiple linear regression (2). During the past few years, several methods to resolve overlapping peaks (usually based on rotation of principal components) have been presented (3). This approach is, however, not without difficulties, as experience from our laboratory has shown (4). Furthermore, a decrease in precision and accuracy can be expected as compared to multivariate calibration since this approach uses the additional information given by the concentrations in the calibration set. Thus, if the individual components in an unresolved peak can be analyzed as standard samples, a multivariate calibration should be used. One problem with this latter approach is, however, the amount of data that must be handled. If, as in the example that follows, the calibration set consists of nine chromatograms and the duration of the unresolved peak of interest is 30 s, then, for a sampling rate of 1spectrum per second recorded between 200 and 300 nm with a resolution of 2 nm, the X matrix in the calibration set would have 30 X 50 = 1500 variables (columns) in each of nine sample rows, giving a total of 13 500 elements (henceforth called unreduced data). Handling this amount of data on a microcomputer system is
cumbersome and so reducing the amount of data without significant loss of information is clearly desirable in order to make the multivariate calibration approach practically acceptable.
DATA REDUCTION A chromatogram recorded with an array detector can be organized as a matrix, H containing i successive spectra (columns) and k chromahgraphic profiles (rows). Fortunately, this matrix can be decomposed to a small number of principal components. In the ideal case (e.g. linear detectors, no interaction, structureless noise, no drift, etc.) the number of principal components is equal to the number of compounds in the unresolved peak. The scores from the principal component analyses (PCA) of each chromatogram can then be used as the X variables in a PLS calibration (henceforth called reduced data), thereby reducing the X matrix variables (columns) from 1500 in the example above to 2 X 50 or 100 (in the case of two components). A schematic representation of the two methods (PLS with and without data reduction) can be seen in Figure 1. Each chromatographic profile in a chromatogram can be thought of as a point in a multidimensional coordinate system where each axis represents one retention time. The principal components then represent the plane where the chromatographic profiles lie and the scores represent the coordinates for each chromatographic profile in this plane. Thus, PCA can be seen as a transformation of data from i space to a subspace of lower dimensionality. The approach used here is unorthodox in the respect that new principal components are calculated for each chromatogram, which means that the scores which are later used as x variables in the PLS calibration are coordinates from different subspaces of i space and cannot, therefore, formally be compared. It is, however, a reasonable assumption that the orientation of the principal components in i space varies in a regular way with the concentrations of the compounds; i.e. the scores contain information about the concentrations such that comparison of scores from different chromatograms is possible. A good calibration method should be able perform at least two tasks. Primarily, of course, accurate predictions of the concentrations should be obtained, but the method should also be able to detect outliers, i.e. samples that differ from the calibration model in any way that makes the predictions uncertain. In the case of chromatography, this usually results from the presence of unknown constituents in the sample that coelute with the compounds calibrated for. One of the advantages of PLS is that the detection of outliers is very simple and straightforward, but since the method described here involves x variables which represent different properties in different samples, it is essential in the presence context to investigate if outliers can still be detected and whether predictions are stable.
0003-2700/88/0360-2756$01.50/00 1988 Amerlcan Chemical Society
ANALYTICAL CHEMISTRY, VOL. 60, NO. 24, DECEMBER 15, 1988
2757
UNREDUCED D A T A Y
1
0.6
4
A
Hn
0 5 2
REDUCED D A T A H I
P I
T
rl
Q1 H2 P2
U
Y
I
-
130
,
,
,
,
,
5 4 0
,
+/set
150
160
170
$80
Figure 2. Chromatographic profiles for 42.6 ppm phenanthrene and 23.8 pp anthracene at 250 nm.
D
0
B
0Q 2
,
0
X
c
o
contains the concentrations of the j compounds in the n samples calibrated for. The concentrations in an unknown sample ( 8 ) are obtained by It,’ = (q’ - a,’)*B’ 2y, = tl*D*C’
Hn Pn
El 10Q n
Figure 1. Schematic representation of the two calibration methods. For unreduced data, each chromatogram is simply unfolded to give wavelengths times spectra variables in X. With data reduction, the scores, P, are calculated for each chromatogram and this matrix is, in turn, unfolded.
PCA AND PLS Since PCA and PLS have been extensively described elsewhere (I,5 ) only a brief resume will be given here. In the following we use boldface captial letters for matrices, primes for transposed matrices, boldface lowercase characters for vectors, and italic lowercase letters for scalars. A vector entirely composed of ones is denoted by 1. PCA. In principal component analysis a matrix H (k rows and i columns) for f principal components is modeled by an optional vector ah(i elements) which consists of the average of each column in H and two matrices, P, the scores (k by f ) , and Q,the principal component loadings (i by f), which describe the systematic part of H H = l*ah’+ P*Q Eh
+
where Eh is the unexplained part of H. The principal components (P and Q ) are calculated in such a way that the f i t component describes the single greatest contribution to the variance in H,the second the largest portion of the remaining variance, and so on. The principal component loadings are orthonormal and the scores are orthogonal. In the present study the chromatograms were not centered, i.e., the vector ahwas not used. PLS. In PLS two matrices X (nby m)and Y (nby j ) are modeled in a similar way as in PCA but a relation between the scores of each matrix is established
+ T*B’ + E, Y = l*%’ + U*C’ + E, U = T*D + E,
X = 1*a,’
where a, (i elements) and a, (melements), U and T (nby f ) , C (jbyf) and B ( mbyf), and E, and &are similar to ah,P, Q, and Eh, respectively. D is a diagonal matrix that contains the regression coefficients between corresponding columns in U and T. In calibration problems, the matrix X contains the measured variables (columns) for each sample (rows) and Y
+ a;
The standard deviation of the residual matrix E, can be compared to the standard deviation of the residuals e, were e, is e, = x, - a, - B‘*t,
by means of a simple F test to determine the probability that the sample is an outlier.
SIGN IDENTIFICATION There is one problem with reduced data that must be dealt with, namely that the principal component solution is ambiguous regarding the signs of the scores and loadings, i.e. if both the scores and loadings are multiplied by -1, the solution remains unchanged. If, however, the scores are to be used as x variables, it is necessary that they point in the “samen direction. To solve this, the scores from one of the calibration set chromatograms containing equal amounts of all compounds were compared with the scores of the other chromatograms. If the scores from the norm chromatogram are denoted by P, and the sample by P,, then for each component and sample chromatogram s calculate z = p,,’*ps
If z < 0 then ps = -l*p, DATA SETS To test the predictive and outlier detection properties of the data reduction scheme compared to “normal” PLS, three data sets (one measured and two simulated) were analyzed with and without prior data reduction. Part of each data set was used to calculate a PLS calibration model (the calibration set) which was then used to predict the concentrations in the rest of the data set (the test set). The predicted concentrations were then subtracted from the true concentrations to obtain the prediction error. The first data set (henceforth called ANFE) consisted of 21 chromatograms of which 9 were used in the calibration set. Each chromatogram was measured on a mixture of anthracene and phenanthrene ranging in concentration from 0 to 23.8 ppm for anthracene and 0 to 42.6 ppm for phenanthrene. Figure 2 shows the chromatographic profiles, Figure 3 the spectra, and Figure 4 a typical chromatogram. The other two data sets consisted of simulated data (henceforth called SIMl and SIM2). Each chromatogram comprised two peaks that were generated by multiplying a normally distributed chromatographic profie by a spectrum.
2758
ANALYTICAL CHEMISTRY, VOL. 60,NO. 24,DECEMBER 15, 1988 Spectrum o r ADm/AU 0.7
2 3 . 8 ppm A n t h r a c e n e
7
0.6
0.5 0.4
0.3 0 . 2
0.1
x/nm
0
-
0.4 0.3 -
0.6
0.2
-
0 . i
-,
0 200
.
.
220
.
, 240
,
.
1
,
260
.
.
, 200
.
.
.
rx/nm 300
Figure 3. Spectra of anthracene and phenanthrene.
FIgm 4. A typical chrometogam from the data set ANFE: 17.6 ppm anthracene and 27.4 ppm phenanthrene.
Figure 5. A typical “chromatogram”from the data set SIM1,equal “amounts”of the two constituents.
Each peak was then multiplied by a “concentration” and added to give one sample matrix. Noise was introduced by randomly altering the position of the peak maxima of the chromatographic profiles, shifting the base line, and finally normally distributed noise was added. A typical chromatogram from the data set SIMl can be seen in Figure 5. SIMl and SIM2 each consisted of 24 chromatograms of which 9 were used in the calibration set. To teat the sensitivity to outliers,another 16 chromatograms containing a third “unknown” peak were generated for each of the two simulated data seta. The unknown peak was added between the two others in eight of these “chromatograms”and directly under the “left” peak for the remaining eight. The prediction error sum of squares (PRESS) and estimated probability that these chromatograms were different from the calibration set were calculated. The probability was estimated by comparing the pooled residual standard deviations from the test set with the residual standard deviation of each chromatogram. The F probability was calculated by using
a method originally proposed by Dorrer (see ref 6).
EXPERIMENTAL SECTION Apparatus, The chromatographic system consisted of a Constametric I11 pump (Laboratory Data Control),a Rheodyne 7125 injection valve with a 20-pL injection loop, and an LKB 2140 diode array detector. For separation a 150 X 4 mm Nucleosil C18 column with 5-pm particle diameter was used. Data recording and manipulation were provided by an Ericsson-PC microcomputer with LKB software. The stored data were analyzed by the standard software package of the detector and softwaredeveloped at this laboratory. Chemicals. Methanol, anthracene, and phenanthrene were prepared and chromatographed with a 90/10(v/v) methanol/ water eluent. All chemicals used were of PA quality. RESULTS AND DISCUSSION The standard deviations of the prediction errors for the two methods and the three data seta are listed in Table I. As can be seen, reduced data actually gives better predictions than unreduced data, the first Y variable of the data set ANFE
ANALYTICAL CHEMISTRY, VOL. 60, NO. 24, DECEMBER 15, 1988 ~~~~~
~
~
SIMI Intmrtmrmnt bmtwmmn 4
~
~
PRESS
Dmmkm
~~~~~
~
2758
~
Intmrtmrmnt
SIMi unamr
'left' ~ m m k
-
c
c
/*
3 -
,
2 -
/
/ / /
un rmaucea
' e
0
0.s
PRESS
1.5
1 X
of
t o t a l Pmak
2.s
2
0
0,s
SIM2 Intmrfmrmnt bmtwmmn
1 X
VOlUmm
PRESS
Pemkm
Of
I.S
total
intmrtmrmnt
SIM2 under
2.5
2
Pmmk
3
VOlUmm
' l m t t ' omak
I
0
0 . 5
1,s
1 X
totml p m o k
of
2,s
2
0
3
0,s
1 X
volumm
1.5
2.5
2
3
total P m m k YOlUmm
O T
Figure 6. Prediction error sum of squares In arbitrary units vs the volume of the interfering peak expressed as percent of the total peak volume. SIM9 Intmrfmrmnt unamr
S I M i Log
P
Intmrfmrmnt
between
~ m a k m
Log
'lmft' omak
Unrmaucmd -3
---4 5
0
I
0.5 X
Of
1.5
total
Dmak
2.5
2
1
3
0
0,s
::I i \
LO9
1.5
2
2.5
3
P
Intmrtmrmnt
SIM2 unamr
* l m f t ' pmak
0
-I -2
-3
-3
-4
-4
-e
1
x o f t o t a l ~ m a kv a l u m o
VOlUmm
\
-5
u n r i a uc m a
, , , , " , , ,
unrmaucma
-e
-0
0
0.5
1 X
O f
1.5
tot-1
pmmk
2
2,s
3
VOlUIIIm
0
0,s
1 X
of
1.5
2
2.5
totml p m m k volumm
Flgure 7. Logarithm of the probability that a chromatogram results from the same kind of sample as used In the callbration set vs the volume of the interfering peak expressed as percent of the total peak volume.
being the only exception. This favorable result is probably due to the fact that data reduction corresponds to "feeding"
additional information to the calibration, namely that the concentration related variation in the chromatograms is bil-
Anal. Chem. 1988, 60, 2760-2765
2780
Table I. Standard Deviations of Prediction Errors, See Discussion" method
Yl
Y2
ANFE
unreduced
f=11 SIMl f = 15 SIMP f=15
reduced
0.290 09 0.363 43 0.004 77 0.002 45 0.003 64 0.001 90
0.804 58 0.576 50 0.006 14 0.002 17 0.003 24 0.001 94
data set
unreduced reduced unreduced
reduced " f means degrees of freedom. For the ANFE data set, units are parts per million and Y1 represents phenanthrene and Y2 represents anthracene. inear i.e. that it can be expressed as chromatographic profiles multiplied by spectra. Another possibility is that data reduction eliminates noise due to drift in retention times. The scores are a linear combination of the spectra of the eluting compounds and are not much affected by a small change in retention time. Figure 6 shows the prediction error s u m of squares (PRESS) as a function of the volume of the interfering third peak for the data sets SIMl and SIM2. It is evident that the prediction deterioration is not the same for the two methods and that neither method shows better predictive properties in all cases. Figure 7 is a plot of the logarithm of the probability that the
sample chromatogram is of the same kind as the calibration set versus the volume of the interfering third peak. Closer inspection reveals that the rejection of outliers is sharper for unreduced data, but if only the region down to the rejection limit (which usually is 1%or -2 in Figure 6) is considered, the difference between the two methods is slight. Above all, Figures 6 and 7 show that calibration with reduced data indeed handles outliersjust as well as without data reduction. In fact, data reduction gives smoother and more monotone curves for both prediction deterioration and outlier rejection. To summarize, the data reduction scheme proposed here seems to give good and stable predictions as well as robust handling of outliers.
LITERATURE CITED ( 1 ) Wold, S.; Geladl, P.;Esbensen, K. J . Chemom. 1987, 1 , 41. (2) Otto, M.; Wegschelder, W.; Lankmayr, E. P. Anal. Chlm. Acta 1985, 171, 13. (3) Vandeginste, 6. G. M.; Leyten. F.; Gerritsen, M.; Noor, J. W.; Kateman, 0.;Frank, J. J.: Chemom. 1987, 1 , 57. (4) Llndberg, W.; Ohman, J.; WoM, S. Anal. Chem. 1986, 58, 299. (5) Wold, S.; Esbensen, K.; Geladi. P. Chemom. Intell. Lab. Syst. 1987, 2, 37. (6) Kennedy, J. W.; Gentle, J. E. Statisticel Computlng; Marcel Dekker: New York, 1980; p 112.
RECEIVED for review December 23,1987. Accepted September 22, 1988.
Sequential Determination of Biological and Pollutant Elements in Marine Bivalves Rolf Zeisler* and Susan F. Stone Center for Analytical Chemistry, National Institute of Standards and Technology, Gaithersburg, Maryland 20899
Ronald W. Sanders Pacific Northwest Laboratory, Richland, Washington 99352
A unlque sequence of Instrumental methods has been employed to obtain concentratlons for 44 elements In marlne blvalve tlssue. The techniques used were (1) X-ray fluorescence, (2) prompt gamma acthratkn analysts, and (3) neutron actlvation analysls. I t Is posslble to use a slngle subsample and follow lt nondestructhrelythrough the three Instrumental analysls techntques. A final radlochemlcal procedure for tln was also appUed after compktlng the Instrumental analyses. Cornparkon of results for elements determlned by more than one technlque In sequence showed good agreement, as dld results from certHled reference material samples analyzed along wlth the samples. The concentratlons found In the bivalve samples ranged from carbon at more than 50% dry welght down to gold at several mlcrograms per kilogram.
When studying the elemental composition of biological and environmental samples, analytical chemists are frequently confronted with the limitations of a particular analytical technique when applied to unique or small samples. Generally,
any analytical technique or even several techniques can only determine a fraction of the elements contained in the sample. It can be assumed that most of the elements in the periodic table occur at various levels in every biological or environmental matrix; therefore, an analytical technique or combination of procedures that covers all elements may be desirable. This goal, on the one hand, is nut attainable because of the cost involved, the analytical sophistication and skill required, and frequently the amount of sample necessary to determine low levels of trace elements or a sizable number of elements in destructive analysis procedures. On the other hand, knowledge about the role and fate of trace elements is limited to about 30 elements. These include the major and minor constituenta in biological matrices, mineral elements, essential trace elements, and a small number of elements that have known adverse effects in biological and environmental systems at trace levels. Commonly, all of these elements are not determined simultaneously or even nearly completely in several aliquota of a given material, let alone the elements that are rarely or never determined. Consequently, the majority of the elements are little-known and are not considered in
0003-2700/88/0360-2760$01.50/00 1988 American Chemical Society