Selection of representative wavelength sets for ... - ACS Publications

Waters Chromatography Division of Millipore Corporation, 34 Maple Street, Milford, Massachusetts 01757. Michael F. Delaney. Department of Chemistry, B...
2 downloads 0 Views 902KB Size
$890

Anal. Chem. 1987, 59, 1890-1896

Selection of Representative Wavelength Sets for Monitoring in Liquid Chromatography with Multichannel Ultraviolet-Visible Detection F. Vincent Warren, Jr.,* and Brian A. Bidlingmeyer Waters Chromatography Division of Millipore Corporation, 34 Maple Street, Milford, Massachusetts 01 757 Michael F. Delaney Department of Chemistry, Boston University, 590 Commonwealth Avenue, Boston, Massachusetts 02215

Four alternatives for the selection of representative wavelength sets from a co#ectkn of 101 ultravklet-vlslble (UV-VIS) spectra are compared, wlth a view toward appllcatlons In liquid chromatography wlth multlchannel UY-vir detectkn. The cumulative fnformathm content, corrected for COrrelaUons between selectlons, is used to evaluate the set of 10 representative wavelengths selected by each procedure. Cluster analysis Is a more effective selectlon approach than the use of a crlterlon which balances lnformatlon content wlth the average correlation to prevloudy selected wavelenglhs. The key set factor analysis technique provldes the best overall performance.

The use of programmable, multichannel ultraviolet-visible (MUV) detectors for liquid chromatography (LC) has increased steadily during recent years. The capability of these detectors for the convenient measurement of spectra, absorbance ratios, and multiple chromatograms offers new possibilities for the identification and quantitation of solutes eluting from a chromatographic column. For example, recent publications have discussed the use of multichannel detectors for the deconvolution of overlapped peak profiles (1-31, as well as the identification of chromatographic peaks on the basis of absorbance ratio plots ( 4 4 3 , manual interpretation of spectra (3, and computerized searching of a stored reference library (8). Many applications of MUV detectors in liquid chromatography (LC-MUV) require that a small set of representative wavelengths be selected for monitoring. Thus, Osten and Kowalski ( 1 ) monitored six wavelengths in their application of self-modeling curve resolution for the deconvolution of overlapped chromatographic peaks. Drouen et al. (9)showed the value of plotting multiple chromatograms a t selected wavelengths as an aid to recognizing the number of overlapping peaks which contribute to a complex elution profile. The monitoring of absorbance ratios (4-6,lO-13) clearly requires the judicious selection of a limited set of wavelengths as well. There is a need for an objective approach for choosing an optimal set of representative wavelengths, particularly if such a method could be rapidly and easily executed and give results which agree with the conclusions of a trained spectroscopist. At present, the process of selecting a representative wavelength set is often approached subjectively, on the basis of a visual inspection of available spectra (9)or by reference to tabulated spectral properties of the solutes of interest. In some cases, wavelengths that have conveniently been available with fixed-wavelength detectors (e.g., 254 or 280 nm) are selected to facilitate comparison with existing ratio data (11-13). Frequently, the rationale for the selection of wavelengths to be 0003-2700/87/0359-1890$0 1.5010

monitored is not discussed at all in literature reports.

BACKGROUND AND THEORY When spectra are available for the components of interest, the process of selecting a set of representative wavelengths is simplified. In single-component spectroscopic analyses, a well-establishedpractice (14, 15) involves the selection of the wavelength of maximal absorbance for the analyte. For sufficiently simple multicomponent cases, this practice may continue to be useful. The selection of absorbance maxima was advised by Yost et al. in their early work on absorbance ratioing (10). Unfortunately, for spectrally similar compounds the absorbance maxima may not be useful choices, and it may be necessary to look for other wavelengths at which the spectra are maximally dissimilar (15). For two-component analyses, this can be approached by plotting a ratio spectrum (14). For more complex situations, an appropriate measure of the variation among the spectra can be calculated at each wavelength. This agrees with the approach of Drouen et al. in selecting “wavelengths where the absorbance shows a large variation” (9). The selection of wavelengths is of direct importantance for multicomponent spectrophotometric analyses (MSA) and the needs in this area should be closely related to those of LCMUV. Several groups have considered the influence of wavelength selection on MSA results. For purposes of quantitation, it is generally agreed that increasing that number of wavelengths to be monitored has a beneficial effect on accuracy and precision, up to a point. Susteks work (16) indicates that a 3- to 4-fold overdetermination is desirable. With regard to the selection of specific wavelengths for monitoring, some works advocate the use of evenly spaced wavelengths (17-19), while others (16, 20, 21) attempt to achieve desired levels of accuracy and precision by selecting a small set of selective wavelengths. For qualitative purposes of peak recognition in LC-MUV, we expect the latter approach, based on selective wavelengths, to be most valuable. The problem of selecting a representative set of wavelengths is an example of the general problem of “preferred set” selection (22),which analytical chemists have most often encountered in the areas of chromatography and spectroscopy. For example, Eskes et al. (23)used cluster analysis in combination with a criterion derived from information theory to select the most informative set of gas chromatographic stationary phases. De Clerq and Massart (24) took a similar approach in determining an optimal set of eluents for TLC analyses. Interest in the storage of efficiently coded reduced spectral representations for computerized library searching has led to additional needs for the selection of optimal subsets. Thus, van Marlen and Dijkstra (25) considered both correlation and information content in selecting 120 of 300 available masses per spectrum such that the information contained in 0 1987 American Chemical Society

ANALYTICAL CHEMISTRY, VOL. 59, NO. 15, AUGUST 1, 1987

a library of mass spectra was completely retained. Chemists have approached the preferred set selection problem in several ways, as indicated in recent reviews (22, 26). Each strategy must address both the selection of candidate sets of wavelengths and the quantitative evaluation of the quality of each set. The selection aspect is necessary to avoid an exhaustive search of all possible wavelength sets of a given size. While an exhaustive search will ensure that the optimal set is found, the time required for such a search is prohibitive in all but very limited cases. For this reason, most approaches employ a sequential selection process which begins with the selection of a single wavelength and then adds further choices until a user-defined limit is reached or the criterion for quality levels off. Others (27) have taken the opposite approach, dropping wavelengths sequentially from the total set until an optimal subset is found. Several criteria have been used to evaluate the quality of candidate sets. Many early efforb (28-30) relied on calculation of the information content according to Shannon’s formula (31). For cases in which the influence of correlation between selections can be neglected, direct application of the Shannon equation is reasonable, since the total information content for a set of selections is simply the sum of the individual information contents. When correlation becomes significant, however, a simple summation of this sort provides an upper limit but may significantly overestimate the overall information content (22). For UV-vis spectra a strong correlation exists between intensity values in adjacent wavelength channels, and this must be considered in the application of information theory. An equation derived by Dupuis and Dijkstra (32) for the calculation of total information content explicitly corrects for correlation among the selections and should be applicable to the present studies. Another common criterion finds its roots in Kaiser’s concept of “selectivity” (33, 34) for the absorptivity matrix used in MSA calibration

d = A*c

(1)

In eq 1,d is an absorbance data vector (i.e., mixture spectrum), c is a concentration vector, and A is the matrix of molar absorptivities for the compounds to be analyzed. According to Kaiser, a set of wavelengths with high selectivities for each sample component would lead to an absorptivity matrix (A) having large diagonal elements and low off-diagonal elements. For a square matrix A, the selectivity can be monitored by calculation of the determinant of A. A high selectivity corresponds to a large value for the determinant, giving a criterion which should be maximized in searching for an optimally selective set of wavelengths. To extend Kaiser’s formulation to the case of a nonsquare (i.e., overdetermined) absorptivity matrix, Junker and Bergmann (27)calculated the determinant of the covariance matrix formed by premultiplication of A with its transpose (A’) and demonstrated this criterion to be equivalent to that of Kaiser. More recently, Jochum et al. (20) and others (19,35) have used the condition number of A as a measure of selectivity. Kalivas (35)suggested that the circumstances that maximize the determinant of A (or A’A) will also minimize the condition number. In this paper, several alternative procedures for wavelength selection are surveyed to determine their utility for use in conjunction with LC-MUV. The first procedure is that of Dupuis and Dijkstra (32),in which selections are made sequentially by application of the AI criterion, which balances the information contribution of each prospective choice against its correlation with previous choices. The total information content, corrected for correlations, is calculated after each selection as a cumulative measure of quality. The seond, procedure is closely related, substituting cluster analysis for

1891

the AI selection procedure. The third procedure follows the approach of Junker et al. (27), sequentially discarding the worst choices until a small preferred set remains. The quality criterion in this case is the determinant of the covariance matrix (IA’AI). The final procedure is Malinowski’s key set factor analysis (KSFA) technique (36) which locates a mutually orthogonal set of rows (or columns) of the absorbance data matrix. KSFA has previously been applied to select wavelengths which would facilitate isolation of pure spectra from a mixture spectrum (37), and it was expected that wavelengths selected by KSFA might also be useful for peak recognition in LC-MUV.

EXPERIMENTAL SECTION A spectral library consisting of 220 spectra was digitized from a published atlas of UV-vis spectra (38). Every fifth entry in the atlas was selected for inclusion in the digitized collection,leading to a varied library composition with respect to compound type, solvent used, wavelength range spanned, etc. This library was not designed to be representative of the solute spectra which would commonly be encountered by chromatographers but rather to serve as a test set for the evaluation of various alternatives for wavelength selection. The spectra were digitized with a Hipad digitizer (Houston Instruments, Austin, TX). Spectra were sampled unevenly, with more data points taken in regions of fine structure. The number of points collected for each spectrum dependended on the degree of structure present as well as the wavelength range spanned. For each spectrum, the sampling was sufficient to allow a visually satisfactory reproduction based on linear interpolation between collected data points. The original wavenumber axis was converted to a wavelength axis for greater compatibilitywith common usage of MUV detectors for LC. From the starting set of 220 spectra, a 101-member sublibrary was selected such that a common wavelength range (220-310 nm) and resolution (1 nm) was shared by all members. Linear interpolation was used to determined data points at values between the originally digitized data points. Use of a sublibrary having a common wavelength axis facilitates the application of information theory, cluster analysis, and eigenanalysis, and simulates a situation which is expected to be typical for LC-MUV users. All programs were written either in PRO/BASIC or FORTRAN-77 and executed on a DEC Professional 350 (or 380) microcomputer (Digital Equipment Corp., Maynard, MA) which was configured as part of a Model 840 chromatographic control station (Waters, Division of Millipore, Milford, MA). Determinants were calculated according to the subroutine published by Bevington (39). The cluster analysis routine SOKAL was taken from Zupan’s text (40) and used without modification. The KSFA software was part of the TARGET package (E. R. Malinowski, Stevens Institute of Technology,Hoboken, NJ) and was modified to work with larger data sets. Some of the graphics and data processing was performed with the RS/1 integrated data analysis system (BBN Software Products Corp., Cambridge, MA). All other programs were specifically written for use in this study.

RESULTS AND DISCUSSION Spectral Representation. The wavelength axis for each library member consisted of 91 entries over the wavelength range 220-310 nm a t 1-nm resolution. Initially the intensity axis had units of log molar absorptivity, but it was not clear that this would be the most appropriate representation for all selection approaches. In particular, the calculation of information content with explicit correction for correlation between selections assumes a multivariate normal distribution of intensities for the spectral library. Some attention must therefore be given to the intensity axis, particularly regarding the distribution of intensity values. While a rigorous evaluation of the validity of an assumption of multivariate normality is not practical (22),testing for univariate normality within each wavelength channel is a feasible alternative (41). The Kolmogorov-Smirnoff test (42) for normality was used in this work.

1892

ANALYTICAL CHEMISTRY, VOL. 59, NO. 15, AUGUST 1, 1987

__-

b = -1.0 b = -0.5 b = 0 .0 [ l o g ] L b = +0.5 b = +I.O [parent] -b = +2.0 __ C u t o f f , N=101

_________

Wavelength

FIgure 1. Kolmogorov D statistic for each wavelength channel of six transformations of the full spectral library. The exponents referred to in the legend are values of the parameter b from eq 2. The double line indicates the critical D statistic for a sample of size 101.

In addition to testing the normality of each wavelength channel for the original (logarithmic) version of the library, transformation of the intensity axis was considered systematically according to the method of Box and Cox (43). This approach is based on application of eq 2 where Xi is an in-

Yi(b) = Xib

(2)

tensity value from the parent library and Yiis the corresponding transformed intensity value. The exponent b can take on positive or negative values, with the value b = 0 reserved for the special case of a log transformation. For this study, the parent library had units of molar adsorptivity vs. wavelength and b took on values of -1 to 2. Figure 1presents the results from the transformation study. Each curve in the figure corresponds to a transformed library and indicates the Kolmogorov D statistic obtained for each wavelength channel. The horizontal double line shows the pass/fail cutoff value of the Kolmogorov-Smirnoff test for a sample of size 101. Kolmogorov D values falling below the line corresponding to wavelength channels having a normal distribution of intensity values. Inspection of Figure 1reveals that only two transformed libraries have any passing channels at all: the square root ( b = +l/,) and logarithmic ( b = 0) libraries. Of these, the square root library has a higher percentage of normal channels (81.3%) than the logarithmic library (60.4%). Only these options should be considered for representation of the spectral library if the assumption of multivariate normality is to be most nearly satisfied. Despite the slight advantage offered by the square root transformation, we chose to represent intensity as the logarithm of molar absorbtivity throughout these studies. The logarithmic library is a more natural choice due to the common usage of this representation for absorbance spectra. The use of log absorbance spectra has the additional advantage of converting the concentration and path length effects on absorbance to additive contributions. This facilitates the comparison of concentration-dependent absorbance spectra when a logarithmic transformation is used.

Information Content of Individual Wavelength Channels. The AZ selection procedure (32) begins with the selection of the single wavelength channel having the largest information content. Shannon’s formula is used to calculate the information content for a single wavelength channel N

Ih(d

= -CP(i,A log, [P(ij)l j=1

(3)

where p ( i j ) is the probability of finding an intensity value a t level j in channel i and Ih(i) is the information content for channel i. Direct use of eq 3 involves setting up an intensity histogram in each wavelength channel, based on assignment of a limited number of discrete intensity levels. For this work, 26 levels were used to span the observed intensity range based on an intensity axis increment of 0.02 in log molar absorptivity units. The number of allowed intensity levels (N) sets an upper limit of log, N on the information content per channel. When there is evidence for a Gaussian distribution of intensities, a simpler equation (32) can be applied for the calculation of information content

I&) = -1 log, 2

(5)

(4)

where s is the standard deviation for the intensity distribution in channel i and dx is the intensity axis resolution. On the basis of the results of the transformation study (Figure l), application of eq 4 is reasonable for the logarithmic library being used. For a given library representation, it should be noted that s is the only variable on the right side of eq 4. Thus, information content parallels the breadth of the intensity distribution, giving maximal response for channels in which the intensity values show the greatest variation. Equations 3 and 4 generally reveal the same trends in information content (see Figure 2), although I&) will be offset from Ih(i)by an amount that depends on the correctness of the variance estimates as well as the validity of the Gaussian model for each channel. Selection of the wavelength having

ANALYTICAL CHEMISTRY, VOL. 59,.NO. 15, AUGUST 1, 1987

1893

Table I. Cumulative Information for Wavelength Sets from Various Procedures procedure

1

2

3

4

5

6

7

8

9

10

AI

310 4.23

AF

310 4.23

253 8.09 253 7.42

220 11.45 220 10.36

309 11.44 309 10.13

254 10.96 254 9.20

223 12.24 223 10.10

308 10.69 308 8.35

307 8.30 307 5.34

AG

310 4.23

253 7.42

307 11.68

223 12.59

254 11.68

309 9.96

AGZ

310 4.23 310 4.23

253 7.42

220 10.36 220 10.36

255 9.23 255 6.49 255 7.98

296 12.97 291 12.74

279 16.48

310 4.23 310 4.23

251 7.44 248 7.37 248 7.37

267 17.43 228 14.56 279 16.88

262 17.60

237 10.03 291 10.16

239 15.45 253 12.47

245 19.75

221 6.10 221 2.70 308 2.70 308 14.14 262 18.83 253 18.83

227 10.26 227 10.26

277 13.11

295 17.25 220 17.34

220 19.12 262 19.12

304 22.16

287 23.09

cum info

239 22.16

287 23.09

wavelength cum info

CMPL

CMPL (reordered) KSFA KSFA

(reordered)

310 4.23

251 7.44

237 12.74

277 13.11

228 14.90 262 14.96 295 15.46

the highest information content can be made on the basis of eq 3 or 4. However, the smoother Ig(i) curve provides a more stable selection. Figure 2 indicates that 310 nm will be selected first on the basis of either I&) or Ig(i). A I Selection Procedure. Given the selection of the first wavelength, which corresponds to the highest individual information content (eq 4),the A I procedure adds additional wavelength choices sequentially. The wavelength channel that maximizes the AZ criterion is selected

262 18.42

309 16.98 245 16.60 301 19.64 239 20.62 304 20.67

221 5.42 221 16.08 301 17.84

wavelength cum info wavelength cum info

wavlength cum info

wavelength cum info wavelength cum info

wavelength cum info

wavelength

A

4 . 4

4.1

4.0

1,

-4 VI

3.9

u

f! 3.0

,:

3.7

where I&) is the information content for channel i and R ( i j ) is the correlation coefficient between channel i and j . This criterion is maximal for channels which combine high information content and low average correlation to previous wavelength selections. Three variations of the AI approach were also implemented. The “Wvariation utilizes the “f“ correction factors recommended by van Marlen and Dijkstra (25). Each f value is a linear correction for the offset between I h ( i ) and Ig(i)(see Figure 2)

The AI and AF procedures lead to identical wavelength selections but give different results for the calculated overall information content (see below). Another variation, “AG”, used the Gauss-Jordan elimination to reorder the wavelength selections in such a way that the diagonal elements of the triangularized covariance matrix descend in order. This variation was suggested by Massart et al. (22) and ensures that each wavelength contributes maximally to the cumulative information content. A final variation, ‘AG2”, substitutes a different measure of average correlation for the arithmetic mean used in eq 5. Due to the close relationship between the correlation coefficient ( R ) and the variance of a sample (42), we expect R2 values, rather than R values, to be additive. An average correlation value can therefore be calculated as the the square root of the mean R2 value. Unlike the other variations on the AI approach, the AG2 procedure can result in a different set of wavelength selections. Table I summarizes the 10 representative wavelengths selected by all four variations of the AI procedure. The selections of the AGz procedure are seen to be substantially different from the other wavelength sets. In addition, the AG;

Wavelength

Figure 2. Information content according to eq 3 (lower curves) and eq 4 (upper curves) for each wavelength channel of the spectral library.

selections are distributed throughout the wavelength axis, unlike the wavelengths chosen by the other procedures, which tend to occur in a few clusters. Information Content for Wavelehgth Sets. In order to compare the quality of the wavelength sets given in Table I, the total information content for each set may be calculated. For a set of M wavelength selections, the total information content is less than the sum of the individual information contributions due to correlation among the selections. The total information content is therefore calculated in a manner which corrects for correlation, using the equation derived by Dupuis and Dijkstra (32) Ig(l,2,.*.,MI= 2 1 log,

(~ ) M l c o v ,

(7)

where ICOV( is the determinant of the covariance matrix for the M selections and the other variables are as in eq 4. Note

ANALYTICAL CHEMISTRY, VOL. 59, NO. 15, AUGUST 1, 1987

1894

/

-delta -+-delta .--A---. d e l t a -qdelta

1

2

3

4

s

6

7

a

8

I F

G 62

10

SELECTION I

Figure 3. Cumulative information content (eq 7) for wavelength sets selected by the AI procedure and three variations (see text).

that eq 7 is related by use of ICOVl to the approach of Junker and Bergman (27) mentioned above. Equation 7 is applied in a sequential fashion, to monitor the cumulative information content. This allows a determination of the number of selections which are needed for purposes of recognition of library members, as the cumulative information is expected to level off once the most significant wavelengths have been considered. For each of the four variations of the AI procedure, the cumulative information content is plotted in Figure 3. The information content is maximal after the selection of six or seven wavelengths. Beyond this point, the calculated information content drops off, as the correction for correlation begins to dominate the results from eq 7. In practice, the use of more than the minimum number of wavelength selections is not expected to “remove” information, and the curves is Figure 3 should therefore reach a plateau rather than curve downward. The observed drop in the calculated information content may reflect a failure to fulfill the assumed multivariate distribution of intensities. As anticipated, the AI and AF curves in Figure 3 have essentially the same shape, the AF curve lying below the AI curve due to the use of the f correction factors. The AG procedure is effective at reordering the AF selections to produce a sequence which rises directly to the maximum information value and shows a smooth overall trend. It is interesting that the AG2 procedure provides a dramatic improvement over the AG results. The reasons for this become apparent upon inspection of Table I. The selections from the AG procedure occur in three tight clusters in the wavelength regions 220-223,253-255, and 307-310 nm. The AG2 selections include these regions but also contain selections at 296, 239, and 267 nm which contribute significantly to the improved information content for the AG2 set. Figure 3 should be useful for determining the minimum number of wavelength selections that can be used to discriminate among the library members. The AG2 curve indicates that the information content begins to level off after about six wavelength selections. If the AGZselections are the best available set of wavelengths, the use of additional wavelength selections will not significantly enhance discrimination of the 101 library members. Selection by Cluster Analysis. Hierarchical cluster analysis has been used previously to select representative members of highly correlated data sets (24, 40). With the

1 300

W a v e l e n g t h [nml Flgure 4. Dendrogram produced by application of the complete linkage procedure to the full spectral library. The double line indicates the level at which the linkages among the ten most significant clusters are severed.

wavelength axis grouped into a small number of highly correlated clusters, a dendrogram such as shown in Figure 4 is produced. The selection process consists of choosing one member of each of the M most significantly clusters. The process of dissecting the dendrogram begins with breaking the linkage at the highest level of cohesion (100% in Figure 4)to form two clusters. From each of these clusters the wavelength having the highest information content is chosen. Of these two choices, the wavelength with the higher information content becomes the first selection. The third and subsequent selections are made by breaking, in turn, the next lower linkage. Two new clusters are formed, one of which contains a previously selected wavelength. The new wavelength selection is taken from the other cluster. In Figure 4, the double line at a 22% level of cohesion cuts across nine vertical lines, producing ten clusters from which the selections indicated in Table I are taken. A number of agglomerative hierarchical clustering procedures exist (40), of which seven were considered in this study. One of these, the single linkage method, failed to produce useful dendograms due to excessive “chaining”, a behavior that has been noted previously (40). Figure 5 show the cumulative information content for the wavelength sets selected according to each of the remaining six algorithms. Note that these wavelength sets have not been reordered as with the AG procedure. This allows comparison of the ability of each method to select the most useful wavelengths at an early stage. Because Figure 5 provides no strong guidance for the selection of the best clustering approach, the complete linkage method was selected on the basis of other considerations. This procedure is conceptually simple and tends to produce easily interpreted dendrograms (Figure 4)due to its monotonicity (40). To facilitate comparison of the quality of the clustering selections with the results in Figure 3, the selections were reordered as in the AG procedure. In Figure 6, the cumulative information plots for the AGz and clustering selections are compared, along with the results for key set factor analysis

AtrlALYTICAL CHEMISTRY, VOL. 59, NO. 15, AUGUST 1, 1987

I

1

q

CMPL UPGA WPGA UPGC 4- WPGC WARD 3-

i.

--

z

u

...Q..-.

+-

10

-#

I

0 1

2

3

4

5

6

7

---t--i----c 8

9

10

SELECTION #

Figure 5. Cumulative information content (eq 7) for the wavelength sets selected by six agglomerativehlerarchical clustering procedures. No reordering procedure has been applied to the selections. Key: CMPL, complete linkage method: UPGA, unweighted pair-group average; WPGA, weighted pair-group average; UPGC, unweighted pair-group centroid: W E , weighted pair-group centroid; WARD, Ward’s method.

2o

A,’

t

4- d e l t a 02

o.

Clustering

(CMPL)

---A---- K e y Set FA

SELECTION #

Figure 6. Cumulative information content (eq 7) for the wavelength sets selected by the AG2, clustering (CMPL) key set factor analysis procedures. AH wavelength selections were subjected to the same reordering procedure (see text).

(see below). The clustering selections are superior to the AGz results, with the information content reaching a higher level overall and leveling off after eight or nine wavelength selections.

Sequential Rejection of Least Selective Wavelength. The concept of selectivity introduced by Kaiser (33,34)and modified Junker and Bergman (27) applies to calibration in MSA according to eq 1. The focus is on the selectivity of the molar absorptivity matrix (A). As described above, the selectivity of A may be monitored by calculation of the determinant of A or the condition number of A, provided that A k a square matrix. Since the libraries used in this work consist of log molar absorptivity spectra, we form the A matrix simply by assembing the spectral library. We therefore attempted to execute the selection procedure suggested by Junker and Bergman (27),using as an example a sublibrary composed of

1895

the spectra for 39 aromatic compounds. To select a subset of maximally selective wavelengths, Kaiser required an exhaustive search of all possible combinations. Thus, to select 10 wavelengths, we would need to evaluate {91!/(81!-10!)1or 2.3 X 10” combinations, a task that is entirely unreasonably since we fully expect the degree of interchannel correlation to be high, making many sets of adjacent wavelengths poor candidates. To avoid excessive calculations, the procedure of Junker and Bergman looks first at all subsets formed by dropping one wavelength from the full set. The subset having the highest selectivity is retained, resulting in the exclusion of one wavelength of low selectivity. Selectivity is monitored by calculation of the determinant of the covariance matrix (A’A). The process is repeated until a small enough subset remains. For the aromatic’sublibrary, the high degree of correlation present in the library led to a determinant of zero for all early subsets. The selection process cannot proceed on this basis since all possible subsets have an equal (zero) selectivity. To apply the procedure successfully, it would be necessary to reduce the correlation by an appropriate preselection technique such as cluster analysis. Such an approach would be redundant, however, and from a practical standpoint, we chose not to pursue the sequential rejection procedure. Note that the determinant of the covariance matrix is already a part of the calculation of information content according to eq 7. Thus the selectivity concept is implicitly considered in the evaluation of all the methods being considered. Key Set Factor Analysis (KSFA). Principal components analysis (PCA) is frequently used to generate simplified representationsof complex data sets on the basis of a relatively small number of underlying “factors”. In general, this procedure begins with an eigenanalysisstep to determine a basis set of eigenvectors which span the multidimensional data space. The key set factor analysis technique (36) selects the set of rows (or columns) from the data matrix that can most nearly be substituted for the significant eigenvectors. These selections will therefore be maximally orthogonal to one another such that they ideally span the data space as well as the principal eigenvectors do. An important aspect of PCA and related techniques is the indication of the underlying dimensionality of the data. For highly correlated data sets such as the spectral library used here, the actual dimensionality will generally be considerably less than either dimension of the starting data matrix. The eigenvectors that emerge from the eigenanalysis step consist of a primary set of significant eigenvectors (one for each underlying factor) as well as a secondary set. The secondary set consists of eigenvectors that are associated with noise in the data matrix and that are not needed for reproduction of the data. The key to achieving a simplified but accurate data representation with PCA lies in determining the size of the primary set of eigenvectors (i.e., the number of factors). In the application of KSFA, it is important to select a number of wavelengths which do not exceed the number of underlying factors. Continuing to match data rows (or columns) to eigenvectors beyond the primary set would lead to selections which match eigenvectors describing the noise in the data. An estimate of the number of underlying factors for each library is therefore required. A variety of procedures have been used to determine the cutoff between the primary and secondary eigenvector sets ( 4 4 ) ,many of which depend on having a reliable estimate of the noise associated with the entries in the data matrix. Because the spectra used in this study came from a variety of sources, it was difficult for us to apply procedures of this type. Another category of methods for determining the number of factors operates without an estimate of the noise ( 4 4 ) .

1896

ANALYTICAL CHEMISTRY, VOL. 59, NO. 15, AUGUST 1, 1987

Malinowski’s indicator function (45) and Wold’s cross validation procedure (46)are examples from this category. We applied the indicator function to the eigenvectors determined and found that more than 10 factors were required. Since we only desire 10 wavelength selections (to allow comparison with the results of the previously described procedures), this finding is sufficient. We can seek the desired number of wavelength selections without further investigation of the actual number of underlying factors. The wavelength sets generated by application of the KSFA procedure are given in Table I. Inspection of the table indicates that the KSFA selections are well distributed across the wavelength axis. The selection aspect of KSFA is not supplemented with an explicit evaluation of quality of the results. We therefore calculated the cumulative information content for the wavelength sets selected by KSFA. The results, presented in Figure 6, indicate that the KFSA wavelength selections have the highest overall information content of any of the procedures investigated. In fact, for each of the wavelength sets generated by application of KSFA the information content does not reach a plateau level with 10 selections. This is in agreement with the finding that the number of factors exceeds 10. Comparison of Methods. A superior trend in information content according to eq 7 implies (1)high variability among the library spectra at each selected wavelength and (2) low correlation among the wavelengths selected. For qualitative purposes of compound recognition, it appears that these are useful goals. According to the results summarized in Figure 6, the KSFA technique generates superior sets of representative wavelengths. Comparison of the cumulative information content before and after application of the reordering procedure indicates that the wavelengths selected by the KSFA procedure also emerged in the most nearly optimal sequence. The AI and clustering selections benefitted more from reordering. In addition to its superior performance, the KSFA technique is uniquely able to provide guidance regarding the dimensionality of the data set. To take advantage of this information, however, additional effort is required to deduce the number of factors. One caution associated with KSFA should also be mentioned (37). Wavelength channels in which the spectra show little variability consist essentially of noise and must be excluded from consideration by the KSFA procedure. During the selection process, a scaling is carried out which gives “noise channels”equal consideration with all other wavelengths. This can lead to spurious selections in some cases. For the libraries studied here, prescreening was not necessary. The spectral variability was reasonably high across the entire wavelength axis, leading to high information contents (see Figure 2) and absence of noise channels. The key set factor analysis technique is preferred on the basis of these studies, due to the higher information content of the selections obtained. A separate publication (47) describes our efforts to check the practical value of this technique for the selection of wavelengths for use in absorbance ratioing for purposes of solute recognition in LC-MUV.

original idea for the modified AGz procedure was suggested by S. Rhode. We thank D. Chapman and K. Crocker for providing the digitized spectral library used in this study.

LITERATURE CITED (1) Osten, D. W.; Kowalski, B. R. Anal. Chem. 1984, 56, 991. (2) Vandeginste, 8.; Essers, R.; Bosman, T.; Reijnen, J.; Kateman. G. Anal. Chem. 1985, 57, 971. (3) Kim, R. Ph.D. Thesis, Massachusetts Institute of Technology, June 1985. (4) Webb, P. A.; Ball, D.; Thornton, T. J. Chromatogr. Sci. 1983, 27. A47 ....

(5) Wegner, J. W. M.; Grunbauer, H. J. M.; Fordam, R. J.; Karcher, W. J. Liq. Chromatogr. 1984, 7 , 809. (6) Drouen, A. C. J. H.; Billiet, H. A. H.; DeGalan. L. Anal. Chern. 1984, 56. - - , 971. - . ..

(7) Zech, K.; Huber, R.; Elgass. H. J. Chromatogr. 1983, 282, 161. ( 8 ) Fell, A. F.; Clark, B. J.; Scott, H. P. J. Chromatogr. 1984, 376,423. (9) Drouen, A. C. J. H.; Billiet, H. A. H.; DeGalan, L. Anal. Chem. 1985, 57,962. (IO) Yost, R.; Stoveken, J.; MacLean, W. J. Chromatogr. 1977, 734,73. (11) Baker, J. K.; Skelton, R. E.; Ma, C. J. Chromatogr. 1979, 768,417. (12) Krstulovic, A. M.: Brown, P. R.; Rosie, D. M. Anal. Chem. 1977, 49, 2237. (13) Hartwick, R. A.; Assenza, S.P.; Brown, P. R. J. Chromatogr. 1979, 786,647. (14) Willard, H. H.; Merritt, L. L.; Dean, J. A.; Settie, F. A. Instrumental Methcds of Analysis, 6th ed.; Van Nostrand: New York, 1981: Chapter 3. (15) Skoog, D. A.; West, D. M. Principles of Instrumental Analysis; Holt. Rinehart and Winston: New York, 1971; Chapter 4. (16) Sustek, J. Anal. Chem. 1974, 46, 1676. (17) McDowell, A. E.; Harner, R. S.; Pardue, H. L. Clin. Chem. (WinsfonSalem, N.C.) 1976, 22, 1862. (18) Kisner, H. J.; Brown, C. W.; Kavarnos, G. J. Anal. Chem. 1983, 55, 1703. (19) Otto, M.; Wegscheider, W. Anal. Chern. 1985, 57, 63. (20) Jochum, C.; Jochum, P.; Kowalski, B. R. Anal. Chem. 1981, 53,85. (21) Thijssen, P. C.; Vogels, L. J. P.; Smit, H. C.; Kateman, G. Fresenius’ 2.Anal. Chem. 1985, 320,531. (22) Massart, D. L.; Dijkstra, A.; Kaufrnan, L. Evaluation and Optimization of Laboratory Methods and Analytical Procedures ; Elsevier: Amsterdam, 1978; Chapters 8 and 17. (23) Eskes, A.; Dupuis, F.; Dijkstra, A,; DeClerq, H.; Massart. D. L. Anal. Chem. 1975, 47, 2168. (24) DeClerq, H.; Massart, D. L. J. Chromatogr. 1975, 775, 1. (25) van Marlen, G.;Dijkstra, A. Anal. Chem. 1978, 48,595. (26) Kateman, G.;Pijpers, F. W. Quality Control in Analytical Chemistry; Wiley: New York, 1981; Chapter 4. (27) Junker, A.; Bergmann, G. Z.Anal. Chem. 1974, 272,267. (28) Souto. J.; De Valesi, A. G. J. Chromatogr. 1970, 46, 274. (29) Grotch, S.L. Anal. Chem. 1970, 42, 1214. (30) Massart, D. L. J. Chromatogr. 1973, 79, 157. (31) Shannon, C. E.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana, IL, 1949. (32) Dupuis, F.; Dijkstra, A. Anal. Chem. 1975, 47, 379. (33) Kaiser, H. 2.Anal. Chem. 1972, 260,252. (34) Kaiser, H. Spectrochim. Acta, Part 8 1978, 338, 551. (35) Kallvas, J. H. Anal. Chem. 1983, 55, 565. (36) Malinowski, E. R. Anal. Chlm. Acta 1982, 734, 129. (37) Malinowski, E. R.; Cox, R. A.; Haldna. U. L. Anal. Chern. 1984, 56, 778. (38) D.M.S. UVAtlas of Organic Compounds; Plenum: New York, 1968. (39) Bevington, P. R. Data Reduction and Error Analysis for the Physical Sclences; McGraw-Hill: New York, 1969. (40) Zupan, J. Clustering of Large Data Sets; Research Studies Press: Chichester, 1982. (41) D’Agostino, R. Boston University Department of Mathematics, Boston, MA, personal communlcation, 1984. (42) Kreysig, E. Introductory Mathematical Statistics : Wiley: New York, 1970; Chapters 15 and 18. (43) Box, 0. E. P.; Cox, D. R. J. R . Stat. SOC. 1984, 826, 211. (44) Mallnowski, E. R.; Howery, D. G. Factor Analysis in Chemistry; Wiley: New York, 1980; Chapter 4. (45) Malinowski, E. R. Anal. Chem. 1977, 49, 612. (46) Wold, S.Technometrb 1978, 20,397. (47) Warren, F. V., Jr.; Bidlingmeyer, B. A,; Delaney, M. F., Anal. Chem., following paper in this issue.

ACKNOWLEDGMENT The authors thank J. Hallowell, D. Mauro, S. Rhode, R. King, G. Verghese, and R. Kim for helpful discussions. The

RECEIVED for review November 7, 1985. Resubmitted December 2, 1986. Accepted March 17, 1987.