936
Anal. Chem. 1991, 63,936-944
In rs = In
-
y In (4Dt/r,2) - 47
-
2 In ( 4 D t / r O 2 ) 37
If 4Dt/r,2 >> e47,i.e., t >> 10 ps, the last term can be neglected, yielding
which is equivalent to eq 7 . In our case, we use microstep electrodes and not hemicylindrical electrodes. It has, however, been found that the characteristics of a band electrode can be described by those of an equivalent hemicylindrical electrode, with an equivalent radius of r, = W / T or = w / 4 , depending on the detailed conditions (15). Anyhow, this is of less importance since the growth of the depletion layer, which determines the current, with time according to eq A7 is independent of the size of the electrode, and hence of the form. LITERATURE CITED (1) Wightman, R. M. Anal. Chem. 1981, 53, 1125A. (2) Wightman, R. M. Sc/ence 1988, 240, 415. (3) Wehmeyer, K. R.; Deakin, M. R.; Wightman, R. M. Anal. Chem. 1985. 5 7 , 1913-1916.
(4) Thorman, W.; Van den Bosch, P.; Bond, A. M., Anal. Chem. 1985, 5 7 , 2764-2770. (5) Bond, A. M.; Henderson, L. E.; Thormann, W. J. J. Phys. Chem. 1988, 90, 2911-2917. (6) Morita, M.; Longmire, M. L., Murray, R. W. Anal. Chem. 1988, 60, 2770-2775. (7) Kittlesen, G. P.; White, H. S.;Wrighton, M. S., J. Am. Chem. SOC. 1984, 106, 7389-7396. (8) . . Seibold. J. D.; Scott, E. R.; White, H. S.. J. .€/echoanal. Chem. 1989, 264, 281-289. (9) Kovach, P. M.; Caudill, L.; Peters, D. G.; Wightman, R. M., J. Nectroanal. Chem. 1985, 185, 265-295. (10) Berteit, J. E.; Deakin, M. R.; Amatore, C.; Wightman, R. M., Anal. Chem. 1988. 60. 2167-2169. (11) von Stackeiberg,‘M.; Piigram, M.; Toome, V. 2.Nectrochem. 1953, 57. 342. (12) Szabo, A.; Cope, D. K.; Taliman, D. E.; Kovach, P. M.; Wightman, R. M., J. Electroanal. Chem. 1987, 217, 417-423. (13) Deakin, M. R.; Wightman, R . M.; Amatore, C. A. J. Nectroanal. Chem. 1988, 215, 49-61. (14) Amatore, C. A.; Deakin, M. R.; Wightman, R. M. J. Electroanal. Chem. 1988, 206, 23-36. (15) Amatore, C. A.; Fosset, B.; Deakin, M. R.; Wightman, R. M. J. Nectroanal. Chem. 1987, 225, 33-48.
RECEIVED for review June 18, 1990. Revised manuscript received October 1,1990. Accepted January 28,1991. This work has been made possible by a grant from The National Swedish Board for Technical Development (STU).
Development and Optimization of Piecewise Linear Discriminants for the Automated Detection of Chemical Species Thomas F. Kaltenbach and Gary W. Small*
Department of Chemistry, The University of Iowa, Iowa City, Iowa 52242
A pattern recognition technlque based on plecewlse linear discriminant analysis (PLDA) is described. Algorithms for the calculation and optimization of piecewise h e a r dlscriminants are presented. A simplex optimization of the individual discrlmlnants Is described, and a new method to optimize a piecewise linear dlscrimlnant Is proposed and shown to produce slgntficantly improved results over the nonoptlmized method. This methodology is demonstrated through the use of a set of Fourier transform infrared interferograms collected by a remote sensor. The discriminant analysis methods produce a yeslno declsion about the presence of a target anaiyte. The results obtained from the PLDA technique are compared with previous results from single linear dlscriminants and drown to be superior wtth respect to the separatlon statistics and the signal-to-noise ratio of the response.
INTRODUCTION Pattern recognition techniques provide automated capabilities for classifying unknown observations into predefined categories or classes. In analytical chemistry applications, these techniques are used most often in qualitative analyses. In this context, the observations typically consist of digitized instrumental responses, and the categories are defined as the possible targets of the analysis being performed. The observations can thus be considered as points in a multidimensional data space. The location of a given observation in this data space is indicative of its category. As an example,
pattern recognition techniques have been widely used to interpret spectral data automatically (1-3). Here, the spectrum of an unknown compound is categorized as representative of a particular chemical structural class. One prominent pattern recognition technique is linear discriminant analysis (LDA), which allows observations to be placed into two or more classes separated by multidimensional linear surfaces. The separating surfaces are termed discriminants, as they define boundaries in the data space that allow the classes to be discriminated. For applications in which nonlinear separating surfaces are appropriate, piecewise linear discriminant analysis (PLDA) can be used. The piecewise linear discriminant consists of multiple linear discriminants that collectively form a piecewise approximation of a nonlinear separating surface. Several methods for computing piecewise linear discriminants have been reported. Duda and Fossum ( 4 ) have described several of these algorithms, while Isenhour et al. (5) have applied piecewise linear discriminants to the interpretation of mass spectral data. Chang (6) proposed a method for computing piecewise linear convex and concave surfaces. Mangasarian (7) has described a recursive method utilizing multiple parallel discriminants, and Takiyama (8)has adapted Mangasarian’s method to an iterative training procedure. Each of these methods calculates the individual discriminants in a stepwise manner. A “one-at-a-time” calculation, however, often does not produce an optimum piecewise linear discriminant, since the approximation of the separating surface utilizes the set of linear discriminants collectively.
0003-2700/91/0363-0936$02.50/0 0 1991 American Chemical Society
ANALYTICAL CHEMISTRY, VOL. 63,NO. 9, MAY 1, 1991 70
037
0.15
g o
€
Pz
-0.02
600
-0.154
t
no0
1200
no0
I75
WAVENUMBERS
200
225
250
I75
INTERFEROGRAM POINT
'
200
225
I
250
INTERFEROGRAM POINT
I
600
a00
I200
WAVENUMBERS
I500
I75
200
225
INTERFEROGRAM W l N T
250
I75
200
225
250
INTERFEROGRAM POINT
Flgure 1. Graphical representation of the application of a band-pass digital filter to a singlabeam spectrum and corresponding interferogram. The upper three plots contain anaiyte information, while the lower three plots contain no analyte information.
In this paper, a method is described that combines simplex pattern recognition techniques and a novel recalculation technique to optimize discriminants on an individual and collective basis. The techniques are described through the use of an application involving remote sensing data collected with a passive Fourier transform infrared (FTIR) spectrometer. EXPERIMENTAL SECTION The data analysis described here was performed by use of software written in Pascal, FORTRAN-77, and Assembler languages. The calculations were executed on a Hewlett-Packard Vectra RSIZOc, a 20-MHz 80386 IBM PC-compatible microcomputer with 4-Mbyte RAM (Hewlett-Packard,Inc., Sunnyvale, CA). The MS-DOS 3.3 operating system was used. The compilers, assembler, and operating system were manufactured by Microsoft, Inc. (Redmond, WA). The software was executed under the Desqview-386 multitasking environment (Quarterdeck Office Systems, Santa Monica, CA). The computation time required for the set of five discriminant vectors was 10.7 h including a total of 15000 iterations in the simplex optimization. Each simplex iteration required 1.3-2.2 s, depending on the number of patterns in the mixed-class subset. The FTIR data used for this research were collected with a passive FTIR sensor built by Honeywell Corp. to the specifications of the US.Army Chemical Research, Development, and Engineering Center, Edgewood, MD. This instrument has been used for previous work and is described there (9, IO). The spectrometer design is based on a flex-pivot interferometer coupled with a liquid nitrogen cooled Hg:CdTe detector that responds in the range 8-12 rm. The collected data consisted of 1024-point interferograms, with a corresponding spectral resolution of approximately 4 cm-l. The data collection was performed with the instrument mounted on moving vehicles. Approximately 49% of the data was collected with the spectrometer mounted on a helicopter, and the remaining 51% was collected with the spectrometer mounted on a truck. The data were collected at various speeds of the two vehicles and at various altitudes of the helicopter and included several passes by ground sources of SF, (MG Industries, Jessup, MD, 97% purity). SF6 was selected as a target because of its use as a standard test compound in pollution monitoring. It has a single strong absorption at 940 cm-'. Due to the great variety of infrared
backgrounds observed, the collected data contained both SFB absorption and emission bands. RESULTS AND DISCUSSION Description of Test Data. The data used in the development of the PLDA methodology consisted of interferograms collected by the FTIR remote sensor described above. These sensors are FTIR spectrometers designed to collect infrared background emission in the outside environment. The infrared emission signals are subsequently analyzed for the presence of the characteristic spectral bands of one or more target analytes. The sensor can thus serve in a variety of applications as an environmental monitoring device. The analysis of remote sensing data is complicated by the absence of a stable infrared background measurement for use in computing ratioed spectra. To overcome this limitation, we have developed an analysis scheme based on the application of band-pass digital filters directly to interferogram data (9, IO). These filters work similarly to spectral band-pass filters, effectively removing frequencies that lie outside of a band of interest. Figure 1 shows a graphical representation of the action of a band-pass digital filter in both the spectral and interferogram domains. The plot in the upper left is a single-beam spectrum with an absorption band a t 940 cm-l for the analyte, SFs. The lower left plot depicts an analogous single-beam spectrum with no SF6present. Superimposed on both spectra is a Gaussian-shaped frequency response function of a digital filter. The frequency response has a width at half-maximum of 45.4 cm-* and is centered on the SF6 absorption band. This filter can be applied in the spectral domain by multiplying the frequency response function by the single-beam spectrum. The resulting filtered spectrum is zeroed outside of the filter band-pass, and the SF6 absorption is superimposed on the filter band-pass function. The same filtering procedure can be performed in the interferogram domain. Here, the corresponding operation is the convolution of the interferogram and the time-domain representation of the frequency response function. A variety of techniques are available for performing this operation rapidly
938
ANALYTICAL CHEMISTRY, VOL. 63, NO. 9, MAY 1, 1991
(11). The approach used here is termed a finite impulse response matrix (FIRM) filter. This filtering scheme was developed in our laboratory to provide optimum performance with FTIR interferograms (10). In the interferogram, the filtering operation suppresses those sinusoidal signals whose frequencies lie outside of the filter band-pass. The fitered interferogram is thereby reduced to two features: (1) the interferogram representation of the Gaussian frequency response function and (2) the corresponding representation of the analyte band. As the Gaussian feature is wider than the absorption band, its interferogram representation damps at a faster rate. Thus, beyond the point in the filtered interferogram where the representation of the Gaussian feature has damped to zero, the dominant information is a sinusoidal signal whose amplitude is related to the height of the analyte absorption band. The center and right plots in Figure 1 illustrate these concepts. The center plots depict points 175-250 (relative to the centerburst) in the unfiltered interferograms corresponding to the single-beam spectra discussed above. The right plots depict the corresponding segments in the filtered interferograms. The filtering operation clearly produces a signal that encodes the presence of the analyte band. These 76-point sinusoidal signals form the test data used in thii work in the development of the PLDA methodology. The computed piecewise linear discriminants implement an automated detection algorithm for determining analyte presence. The test data set was constructed by selecting 3000 patterns from a pool of 12141 interferograms collected during 17 different experimental runs. Based on visual inspections of transformed spectra, 1425 of these interferograms were determined to contain SF6 information. The remaining interferograms were randomly selected from among those interferograms judged to contain no SF6information. The 3000 interferograms were filtered by use of a FIRM filter that approximated the Gaussian frequency response function plotted in Figure 1. The resulting set of 3000 76-point fiitered interferogram segments formed the test data set for the pattern recognition work. Pattern Recognition Analysis of Interferogram Signals. Pattern recognition techniques are ideally suited to the problem of classifying interferogram signals (“patterns”)into predefined classes. The application described here is a twoclass problem, where class 1 denotes those interferograms that contain SF6 information (termed SF,-active) and class 2 denotes those that contain no SF6 information (termed SF6-inactive). T o investigate the feasibility of developing a pattern recognition method for this application, principal components analysis (12)was performed on the 76-dimensional data set. The NIPALS algorithm (13,141 was used to compute the first three principal components for the mean-centered data. Together, these components explained 99.0% of the total variance in the data. Figure 2 is a plot of the 3000 76-dimensional patterns projected onto the first three principal components. In the figure, the class 1 (SF6-active) patterns are represented by open circles, while the class 2 patterns are displayed as solid triangles. The plot shows that the two classes are distributed differently in the data space and suggests that a pattern recognition technique may be able to classify the patterns based on their different clustering characteristics. The overlap of the two classes in the figure implies that, at the limit of detection, the signal from a weakly SF6-activepattern will be within the variation observed among the non-SF6 patterns. In an initial study, LDA was used to implement a detection algorithm for the digitally filtered interferogram data (15). Figure 3 is a graphical example of the use of LDA with two-
PC I
\ Figwe 2. Principal components score plot of 3000 filtered interferogram segments. SF,-active interferograms are represented by open circles.
e
Figure 3. Graphical representation of linear discriminant analysis. The discriminant is represented by the line separating the two classes of symbols.
dimensional data. The two data classes (circles and squares) are divided by a separating surface, represented in the plot by the line positioned between the classes. Mathematically, this surface is defined by the locus of points lying orthogonal to an n-dimensional vector termed a weight vector or discriminant, where n is the dimensionality of the pattern data. The weight vector, w, is calculated such that WTX,
>0
(1)
WTX”
I0
(2)
where x, represents a pattern from class 1 and x, represents a pattern from class 2. The dot products in eqs 1 and 2 are termed discriminant scores. To offset the separating surface from the origin, the x, and x, vectors are typically augmented with a constant element. In the present example, the resulting pattern vectors are of dimension 77. There are both statistical and empirical methods for computing linear discriminants (16-19). The statistical methods assume that the data are sampled from a known statistical distribution and are therefore less desirable. For example, the distribution of the SF6-active points in Figure 3 is such that a single mean vector and variance-covariance matrix cannot adequately describe the SF6-active cluster. The empirical methods make no assumptions about the distribution of the data and are based on iterative calculations. Figure 3 indicates that LDA is a viable technique for a two-class pattern recognition problem when a single linear
ANALYTICAL CHEMISTRY, VOL. 63,NO. 9, MAY 1, 1991 PC2
939
0
\
8 0
$
0
0
II
\
I
Flguro 4. Principal components score plot of 3000 filtered interferogram segments after taking the absolute value of each segment coefficient. SFflctive Merferogams are represented by open circles.
surface can adequately separate the data classes. The principal components plot in Figure 2 suggests, however, that the class boundaries existing in the filtered interferogram data are too complex to be represented by a single linear discriminant. SF6 emission and absorption signals are observed on either side of the non-SFBsignals. In addition, phase variations in the data produce the conical shape of the plot. In the previous LDA study, an attempt was made to minimize these problems by taking the absolute value of each point in the filtered interferogram segments. The corresponding principal components plot of the transformed data is presented in Figure 4. Effectively, the data transform folds the SF6-active points in Figure 2 onto the same side of the data space, relative to the non-SF6 points. This procedure produced good pattern recognition results, although a comparison of Figures 2 and 4 suggests that some discrimination between the class 1and class 2 data points is lost in performing the transform. Overview of Piecewise Linear Discriminant Analysis. The above discussion motivates the investigation of a more complex (i.e., nonlinear) separating surface for the nontransformed patterns. As noted above, the PLDA method for approximating a nonlinear separating surface is based on the use of multiple linear discriminants to form a piecewise approximation of a nonlinear surface. The linear discriminants are calculated sequentially, with each discriminant separating a portion of the patterns in the data set. The PLDA algorithm used for this work involves calculating discriminants that have a pure-class subset on one side of the discriminant and a mixture of the two classes on the other side (20). A discriminant of this type is called "single-sided". After a single-sided discriminant has been calculated, those patterns on the pure-class or single-side of the discriminant are removed from the calculation, and another discriminant is computed in the same manner. The result is a set of discriminants in which each discriminant separates a different pure-class subset. Collectively, the set of discriminants defies a separating surface. The piecewise linear discriminant is often termed a committee classifier, since the classification of an unknown pattern requires the entire set of discriminants. This committee of discriminants consists of several members that are applied in the order in which they were computed. Figure 5 shows a graphical representation of such a classifier in which each discriminant has the same class on its pure-class side. To classify an unknown pattern, each discriminant is applied to the pattern, producing one of two possible results: (1) the
Figure 5. Pictorial representation of a piecewise linear discriminant. The discriminant approximates a nonlinear surface through the use of
multiple linear discriminants.
pattern falls on the pure-class side of the discriminant; or (2) the pattern falls on the mixed-class side of the discriminant. The first discriminant that classifies the pattern onto the pure-class side determines the class of the pattern. The last discriminant determines the class if the unknown pattern is never classified on the pure-class side. Calculation of Piecewise Linear Discriminants. A novel multistep procedure was devised to calculate and optimize the set of discriminants comprising the piecewise linear separating surface. Each of the calculation steps is described separately below. Initialization. A Bayes classification algorithm is used to calculate an initial approximation to the discriminant (21). The Bayes classifier is a statistical method that estimates the position of the discriminant based on the sample variancecovariance matrix and the assumption that the data belong to the multivariate normal (Gaussian) distribution. As noted above, assumptions of this type may be invalid, and consequently, the Bayes discriminant is used solely as a starting approximation for the final vector. The Bayes weight vector is then evaluated to determine if it is single-sided, as required by the PLDA algorithm. If this initial vector does not reflect a single-sided discriminant, it is corrected to be single-sided via a fixed-increment perceptron algorithm (22). This algorithm is implemented such that it iteratively moves the discriminant through the data space toward the misclassified class 2 patterns (given that the class 1 patterns are on the pure-class side of the discriminant). The discriminant is moved a fraction of the distance toward all of the class 2 patterns, and the discriminant is evaluated for the single-sided criterion. The process is repeated until all class 2 patterns have been correctly classified. Discriminant Optimization. The implementation of PLDA used for this work includes two methods of optimization. Both methods can be used at two different levels, as the discriminants can be optimized either individually or collectively. At the individual level, discriminants can be optimized using translation and simplex optimization. At the collective level, an algorithm has been developed that utilizes a recalculation procedure to effect an optimization of the discriminants as a set. The following discussion addresses first the two types of individual optimizations and then the collective optimization. Translation. Discriminants are translated by use of the fixed-increment perceptron algorithm that moves the discriminant toward all of the class 2 patterns in a manner similar to that described above. With each increment, the discriminant is moved a fraction of the distance toward each class 2 pattern. This generally places additional class 1patterns on the pure-class side, while retaining a single-sided discri-
940
ANALYTICAL CHEMISTRY, VOL. 63, NO. 9, MAY 1, 1991
minant. If the resulting discriminant is not single-sided, the previous discriminant is kept and the distance fraction is reduced by an order of magnitude. The process is then repeated until the fraction reaches a specified minimum value. In practice, this procedure is only useful when the initial discriminant computed by the Bayes algorithm is single-sided (without any perceptron correction). Simplex Optimization. Simplex optimization (23, 24) is a generalized algorithm for use in finding the optimum values for a set of experimental variables. The simplex algorithm has been applied to the optimization of both single linear discriminants (25, 26) and multiple discriminants ( I ) . The implementation used here is termed the super-modified simplex algorithm (24). The simplex algorithm computes a new weight vector w by moving the original w in an optimal direction in the data space. The algorithm consists of a set of rules that governs this movement based on a numerical response function that reflects the performance of the weight factor. For the current work, many variations of response functions were implemented and evaluated. T o be effective, the response function must encode several characteristics related to the performance of each weight vedor, including the number of patterns separated and whether the discriminant is single-sided. Since singlesided discriminants are critical in the implementation of this algorithm, those discriminants that are not single-sided are penalized by use of a “purity” factor, P, defined as P = N,/N, (3) where N , is the number of class 1 patterns separated and Nt is the total number of patterns placed on the single side of the discriminant. The purity factor ranges from 0 to 1.0,where P = 1.0 reflects that a true pure-class subset is separated by the discriminant. To prevent a discriminant that is not single-sided from having a larger score than a single-sided discriminant that separates fewer patterns on its pure-class side, the purity factor is raised to the power cy. Appropriate values of cy have been determined empirically to be in the range 10-200, depending on the magnitude of N,. For this work, a fixed value of 200 was used. In future analyses, this parameter could be dynamically adjusted as needed during the iterative calculation of the discriminants. In addition to single-sidedness, the simplex response function must also reflect the number of patterns correctly separated by the discriminant. These two factors are combined in the single-sided response, S, where
S =PN,
(4)
For a single-sided discriminant, S is equal to the number of class 1 patterns separated by the discriminant (i.e., P = 1.0). Otherwise, S reflects the penalty for not separating a pure-class subset. The function represented by S is discrete whenever the discriminants are single-sided. As a result, the movement of the discriminants during the optimization frequently results in local optima. This is due simply to the inability of the response function to differentiate among discriminants that classify the same number of class 1patterns correctly. Rather than defining a continuous profile, the response surface for the optimization consists of discrete steps. To produce a smoother response surface for the optimization, a continuous response function was implemented. The design of this function is motivated by the desire that the computed discriminants be positioned to define the limit of detection of the analyte. The function used is somewhat analogous to a signal-to-noise ratio, where the signal component is represented by S in eq 4. The noise is based on the standard deviation of the discriminant scores evaluated for those patterns in class 2. A smaller standard deviation value
produces a larger (i.e., more optimum) value of the response function. It was hypothesized that by minimizing the variation of the class 2 discriminant scores, the resultant discriminant would be more nearly aligned with the interface between the class 1 and class 2 patterns. For the data used here, the value of the standard deviation is typically on the order of 10” and consequently must be scaled to reduce its influence on the response function. Furthermore, since the “signal” is actually the number of correctly classified class 1 patterns, the numerator can vary widely, requiring that the scaling of the noise must be relative to the signal for the noise to have a similar influence on each discriminant. Many variations on this premise were implemented and evaluated. The overall best equation for scaling the noise was found to be
R = [1.0 - s’/’]S where f is the scaling factor, N , is as defined above, R is the final scaled response function value, s is the standard deviation of the discriminant scores for the class 2 patterns, and S is the single-sided score from eq 4. The value of R can be interpreted as the value of S (i.e., number of correct class 1 patterns) that has been penalized based on the degree of variation among the discriminant scores for the class 2 patterns. In testing, the response function in eq 6 was significantly more resistant to the effects of local optima than the discrete response function in eq 4. For this reason, the continuous response function was used in computing the piecewise linear discriminants reported here. The existence of an unlimited number of response function values was observed to slow the progress of the optimization somewhat, although no significant increase in the number of iterations was required. While the method outlined above is general in nature, two of the parameters described are dependent on the magnitudes of the data values comprising the patterns. The s parameter is dependent on the vector magnitudes of the class 2 patterns, necessitating that the f parameter be adjusted for the specific data used. While the calculation o f f must include the N , parameter to allow separation of differently sized subsets, the exact scaling of N, is dependent on s. For example, an increase in the vedor magnitudes of the class 2 patterns would produce an increase in the magnitude of s, causing the discriminant score to be more influenced by the noise than by the number of patterns separated in the pure subset. This imbalance in the scaling would necessitate a greater number of discriminants to separate the same number of patterns. Thus, for the response function described here to be used in another application, eq 5 must be modified to achieve the correct balance between the signal and noise terms. Current work in our laboratory is focusing on the design of an analogous response function that is independent of the pattern magnitudes. Recalculation. The simplex optimization described above is an effective means of optimizing each weight vector. However, optimizing each weight vector individually may not produce the optimum piecewise linear discriminant, since the discriminant consists of a set of weight vectors. Some method of performing a collective optimization of all weight vectors is needed to obtain a truly optimal discriminant. Unfortunately, no algorithms for a collective optimization of a piecewise linear discriminant exist in the literature. One method for producing a collective optimization was developed for this study. The procedure used here is motivated by considering that the calculation of the initial set of weight vectors is hierarchical in nature. The calculation of each weight vector is influenced by the performance of weight vectors that have been previously computed. Each of these
ANALYTICAL CHEMISTRY, VOL. 63, NO. 9, MAY 1, 1991
Calculate weight vectors
941
*
\
0 Figure 7. Pictorial depiction of the recalculation of a two-vector piecewise linear discriminant. The recalculationallows the discriminants to be repositioned in order to separate the two classes. I
YES
Table I. Classifications Produced by Individual Discriminants
vector 1 2 3 4 6
total Figure 8. Flow chart showing the process for calculating and recalculating (optimizing) piecewise linear discriminants.
vectors is computed such that it separates as many of the remaining patterns as possible. In order to effect a collective optimization, a method must be developed to allow subsequent weight vectors to influence the calculation of previous weight vectors. Our research has produced an algorithm to perform a recalculation of a set of weight vectors, allowing for the existence of all weight vectors that comprise the piecewise linear discriminant. The recalculation is performed in a manner identical with the calculation described above, but the data set of patterns is altered to reflect the presence of other vectors in the set. Prior to recalculating a given weight vector, those patterns classified by later weight vectors are removed from the data set. This procedure allows the earlier weight vectors to be repositioned based on the classification performance of the later weight vectors. The recalculation algorithm can be outlined as follows: (1) the data set initially consists of all class 2 patterns and any class 1 patterns not separated by the entire set of weight vectors; (2) the class 1 patterns separated by the first weight vector are then added back into the data set, and the first weight vector is recalculated; and (3) the class 1 patterns separated by the recalculated first weight vector are removed. Steps 2 and 3 are then repeated for each of the remaining weight vectors in the set. This process is summarized in the flow chart in Figure 6. For this application, it has been found that repeated recalculations will achieve improved results but that eventually an optimum is reached and further recalculations are ineffective. Another way to perform the recalculation is to modify the algorithm for regenerating the data set. In this approach, before the patterns are added back into the data set in step 2 above, the other discriminants in the set are applied to the patterns, and only those patterns that are not correctly classified by other weight vectors are added back into the data set. A graphical representation of a two-vector discriminant before and after this recalculation is shown in Figure 7. In the left plot of the figure, the shaded area indicates the area of overlap-both discriminants would separate the class 1
class 1 patterns separated" recalculated reg duringb aftelS 521 772 18 34 18 1363
2 395 29 502 392 1320
431 766 9 97 23 1326
1425 class 1 patterns total. Number of class 1 patterns separated during calculation. cNumber of class l patterns separated after discriminant reapplied.
patterns in this area. By eliminating these patterns in the recalculation of the first discriminant, the first discriminant can be placed in an optimal position. The placement of the optimized discriminants is depicted in the right plot of the figure. This recalculation method has been found to approach the optimal discriminant more quickly than the first method outlined above. For this reason, the latter method was used in computing the results presented below. Pattern Recognition Results. For the results described below, a discriminant was calculated that consisted of a set of five weight vectors. After each vector was approximated via the Bayes calculation, the simplex algorithm was initialized and 1500 iterations were performed. The best vector was saved, and the simplex procedure was then reinitialized with this vector and another 1500 iterations were performed. This process was repeated until five vectors had been calculated. The set of five vectors was then recalculated, and each vector was optimized as before. The recalculation was repeated 4 times, with the best results being obtained from the first recalculation. For simplicity, only the results from the original five-vector discriminant and the discriminant from the first recalculation will be presented. Table I displays the number of patterns separated by the individual vectors in the two discriminants. For the recalculated discriminant, two separation results are shown. The first separation reflects the performance of the discriminant as it was calculated, while the second reflects the performance after the sequential application of the vectors as a committee classifier. The number of patterns correctly classified often increases when the discriminant is reapplied, since the recalculation algorithm effectively hides patterns from weight vectors appearing earlier in the set. The classification statistics achieved in computing the two discriminants are presented in Table 11. For reference, the single linear discriminant from previous work is also shown (15). This discriminant represents the best performance observed from a single-vector discriminant. The piecewise linear discriminants both separated 97% or greater of the patterns
942
ANALYTICAL CHEMISTRY, VOL. 83, NO. 9, MAY 1. 1991
-0.02
0
200
400
100
800
i000
uoo
y00
i100
i100
INTERFEROGRAM NUMBER
0.07
0.01 Y
0
8
0.05
s
2z
0.0.
g
0.03
1
8
0.02
0.0,
m
0.00
-0.01
-0 02
0
200
100
100
100
1000
uoo
UOO
1600
1100
INTERFEROGRAM NUMBER
'1
Ill
-0.02
0
200
.oo
LOO
100
1000
1200
y00
1100
1800
2 10
INTERFEROGRAM NUMBER
Figure 8. Discriminant score plots showing prediction resub for three differentdiscriminants: a single linear discriminant (upper), the initial piecewise linear discriminant (center), and the recalculated piecewise discriminant (lower).
ANALYTICAL CHEMISTRY, VOL. 63, NO. 9, MAY 1, 1991
Table XI. Classification Results for Discriminant Development
overall correct single linear piecewise linear
recalcd PLD
missed alarmsb
943
Table 111. Classification Results for Predicted Data Sets
overall correcta no. %
false alarmsc
no.
'70
no.
%
no.
%
2795 2938 2901
93.2 97.9 96.7
203 62 99
14.2 4.4 6.9
2 0 0
0.1 0.0 0.0
*
3000 interferograms total. 1425 interferograms judged to contain SF6 information. 1575 interferograms judged to contain
no SF, information.
single linear piecewise linear
2491 2506 2506
recalcd PLD
missed alarms*
98.0 98.5 98.5
no.
%
28 18 21
13.1 8.4 9.8
false darmsc no. % 24 19 16
1.0 0.8 0.7
a 2543 interferograms total (excluding indeterminate interferograms). 214 interferograms judged to contain SF6 information. 2329 interferograms judged to contain no SF, information.
Table IV. Signal-to-Noise Ratios for Prediction Set
in the data set, while the single linear discriminant separated slightly more than 93% of the data set. While the difference between a single-vector discriminant and a five-vector discriminant is only 4% of the data set, that 4% represents those patterns with very weak analyte signals. Consequently, the piecewise discriminants should exhibit higher sensitivity for analyte detection. The best test of any analytical method is to apply it to data that were not included in the development of the method. To evaluate the prediction performance of the discriminants, three data sets were employed that were not represented among the 3000 interferograms in the original data set. The prediction data were collected with the remote sensor mounted on vehicles that made a total of 10 passes by a source of SFG. Together, the three data sets contained 2568 interferograms, corresponding to 214 SF6-active interferograms, 2329 SF6inactive interferograms, and 25 interferograms that were judged indeterminate in terms of SF6 presence. The judgement of class membership was performed through visual interpretation of the corresponding single-beam spectra, ratioed to various known background spectra. The indeterminate interferograms were not used in computing any performance statistics. The application of a piecewise linear discriminant to a set of unknown patterns is performed by computing the discriminant score for each pattern. In a graphical representation, the results can be displayed as a plot of the discriminant scores vs pattern number. Since multiple weight vectors are used, there are multiple discriminant scores that could be plotted. For the purposes of this analysis, the most positive discriminant scores obtained by applying all weight vectors to each pattern were used. For class 1 patterns, the signal is then maximized, and for class 2 patterns, the plotted values reflect the minimum distance from the pattern to the nonlinear separating surface. Figure 8 shows the resulting plots of discriminant scores for the three prediction data sets. The upper, middle, and lower plots correspond, respectively, to the single linear discriminant from the previous study, the initial five-vector discriminant, and the five-vector discriminant from the first recalculation. The vertical dashed lines in the plot indicate the three prediction data sets. For presentation purposes, excess baseline was removed from the plots, leaving 650 discriminant scores from each data set, for a total of 1350 plotted points. The peaks in the plots correspond to passes of the sensor by the source of SF6. Table 111 displays statistics describing the prediction performance of each discriminant. While the overall prediction percentages are excellent for each of the three discriminants, the piecewise linear discriminants are clearly more sensitive than the single linear discriminant, producing fewer missed detections. Both piecewise discriminants incorrectly classify only 37 of the 2568 patterns. The results in Table I11 are useful, but they do not fully represent the increased sensitivity of the piecewise linear
single linear piecewise linear recalcd PLD
set 1"
set 2b
set 3c
meand
21.9 40.2 72.7
7.3 13.0 13.5
12.4 27.2 28.8
12.9 25.0 34.5
a Helicopter data set containing 684 known interferograms. *Truck data set containing 1022 known interferograms. Helicopter data set containing 837 known interferograms. Weighted mean signal-to-noiseratio.
discriminant that is visually apparent in Figure 8. This is due to the small number of weak SF6signals in the three prediction data sets. T o quantify this enhanced sensitivity, a signalto-noise ratio was computed on the discriminant scores, employing all the interferograms containing SF6information to represent the "signal" and all others to represent noise. These ratios were computed for the three prediction data sets and are shown in Table IV. The average improvements in signal-to-noise observed were 194% for the initial five-vector discriminant and 267% for the recalculated discriminant, relative to the best single linear discriminant. These results show clearly that the recalculation procedure is highly effective in placing the piecewise discriminant in an optimum orientation relative to the interface between the data classes. CONCLUSIONS The multistep procedure described above for optimizing the placement of piecewise linear discriminants is a general approach that is not limited to use with interferogram data. The techniques developed in this work should be applicable to any pattern recognition problem in which the interface between the data classes is complex. The only modification required is the adjustment of eq 5 to account for patterns of different relative magnitudes. Other areas of research in which an automated detection algorithm of this type should be applicable include mass spectrometric or FTIR detection in chromatography. The optimized discriminants are particularly suited to problems in which it is important that the discriminants define the limit of detection of a species. Work is continuing in our laboratory on the problem of collective optimization of the weight vectors comprising a piecewise linear discriminant. We are exploring the possibility of operating the simplex optimization with a response function based on the performance of all weight vectors simultaneously. ACKNOWLEDGMENT Robert Kroutil and co-workers of the U S . Army Chemical Research, Development, and Engineering Center are acknowledged for providing the passive remote sensing data used in this work. Silvio Emery is acknowledged for providing the three-dimensional plotting software used in the generation of the principal components plots. LITERATURE CITED (1) Harrington, P. de B.: 729-734.
Voorhees, K. J. Anal. Chem. 1990, 62,
944
ANALYTICAL CHEMISTRY, VOL. 63,NO. 9, MAY 1, 1991
(2) Devaux, M. F.; Bertrand. D.; Robert. P.; Qannarl, M. Appl. Spectrosc. 1888, 42. 1015-1019. (3) Zl~pel.M.; Mowltz, J.; Koehler, I.; Opferkuch, H. J. Anal. Chim. Acta 1882, 740, 123-142. (4) Dude. R. 0.; Fossum, H. I€€€ Trans. Electron. Compuf. 1966, 15, 220-232. (5) Frew. N. M.: Wangen, L. E.; Isenhour, T. L. Pattern Recognif. 1970, 3, 281-296. (6) Chang, C. I€€€ Trans. Electron. Compuf. 1973, 22, 859-862. (7) Mangasarian, 0. L. I€€€ Trans. Inf. Theory 1966, 74, 801-807. ( 8 ) Takiyama. R. Pattern Recognif. 1960, 72, 75-82. (9) Small, G. W.; Kroutll, R. T.; Ditillo, J. T.; Loerop, W. R. Anal. Chem. 1088, 60, 264-269. (10) Small. G. W.; Harms, A. C.; Kroutil, R. T.; Ditiilo, J. T.; Loerop, W. R. Anal. Chem. 1990, 62, 1768-1777. (1 1) Chliders, D.; Durling. A. Digital Filfering and Signal Processing; West Pub.: St. Paul, MN, 1975. (12) Hotelling, H. J . Educ. Psycho/. 1933, 24, 417. (13) Wold, H. Multlverlefe Analysls; Krishnaiah, P. R., Ed.; Academic Press: New York, 1966; pp 391-420. (14) Martens, H.: Naes, T. MuMvariate Callbration; Wiley: New York, 1989; p 111. (15) Small, G. W.;.Carpenter. S. E.; Kaltenbach, T. F.; Kroutll, R. T. Anal. Chim. Acta, in press. ~~
~~~
(16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26)
Anderson, T. W.; Bahadur, R. R. Ann. Math. Sfat. 1962, 33, 420. Pietrantonlo, P. C.; Jurs, P. C. PattemRecognlt. 1972, 4, 391. Kaminuma, T.; Watanabe. S. Pattern Recognlf. 1972, 4 , 289. Moriguchl, I . ; Komatsu, K.; Matsushlta, Y. J . Med. Chem. 1980, 23, 20. Lee, T.; Richards, J. A. Pattern Reccgnlf. 1984, 77, 453-464. Tou, J. T.; Gonzalez, R. C. Pattern Recognlfhn Principles; AddlsonWesley: Reading, MA, 1974; pp 119-123. Tou, J. T.; Gonzalez, R. C. Pattern Recognifion Principles; AddisonWesley: Reading, MA, 1974; pp 158-169. Nelder, J. A.; Mead, R. Compuf. J. 1965, 7 , 308. ROuth, M. W.; Swartz, P. A.; Denton, M. E. Anal. Chem. 1977, 49, 1422-1428. Ritter, G. L.; Lowry. S. R.; Wlikins, C. L.; Isenhour, T. L. Anal. Chem. 1975, 47, 1951-1956. Brissey, G. F.; Spencer, R . E.; Wiikins, C. L. Anal. Chem. 197S, 57, 2295-2297.
RECEIVED for review October 25,1990. Accepted January 25, 1991. This work was supported by the U. S. Army Chemical Research, Development, and Engineering Center, Edgewood, MD, under Contract DAAA15-89-C-0010.