Improved Probabilistic Neural Network Algorithm for Chemical Sensor

An improved probabilistic neural network (IPNN) algorithm for use in chemical sensor array pattern recognition applications is described. The IPNN is ...
0 downloads 0 Views 82KB Size
Anal. Chem. 1999, 71, 4263-4271

Improved Probabilistic Neural Network Algorithm for Chemical Sensor Array Pattern Recognition Ronald E. Shaffer* and Susan L. Rose-Pehrsson

Chemistry Division, Naval Research Laboratory, Code 6116, 4555 Overlook Avenue, SW, Washington, D.C. 20375

An improved probabilistic neural network (IPNN) algorithm for use in chemical sensor array pattern recognition applications is described. The IPNN is based on a modified probabilistic neural network (PNN) with three innovations designed to reduce the computational and memory requirements, to speed training, and to decrease the false alarm rate. The utility of this new approach is illustrated with the use of four data sets extracted from simulated and laboratory-collected surface acoustic wave sensor array data. A competitive learning strategy, based on a learning vector quantization neural network, is shown to reduce the storage and computation requirements. The IPNN hidden layer requires only a fraction of the storage space of a conventional PNN. A simple distance-based calculation is reported to approximate the optimal kernel width of a PNN. This calculation is found to decrease the training time and requires no user input. A general procedure for selecting the optimal rejection threshold for a PNN-based algorithm using Monte Carlo simulations is also demonstrated. This outlier rejection strategy is implemented for an IPNN classifier and found to reject ambiguous patterns, thereby decreasing the potential for false alarms. In recent years, the use of signal processing methods for detecting and identifying analyte vapor signatures from an electronic nose or chemical microsensor array has become widespread. Chemical microsensor arrays are typically composed of multiple partially selective detectors or sensors (e.g., polymercoated surface acoustic wave chemical (SAW) sensors,1 dyeimpregnated polymer films with fiber-optic fluorescence detection,2 metal oxide sensors,3 and carbon-black polymer composite chemiresistors).4 Each individual detector in the array is constructed such that it responds differently to various classes of chemical vapors.5 A chemical microsensor array consisting of just a few semiselective detectors can potentially respond to thousands of different * Corresponding author: (e-mail) [email protected]; (phone) 202-4043361; (fax) 202-404-8119. (1) Grate, J. W.; Rose-Pehrsson, S. L.; Venezky, D. L.; Klusty, M.; Wohltjen, H. Anal. Chem. 1993, 65, 1868-1881. (2) Johnson, S. R.; Sutter, J. M.; Engelhardt, H. L.; Jurs, P. C.; White, J.; Kauer, J. S.; Dickinson, T. A.; Walt, D. R. Anal. Chem. 1997, 69, 4641-4648. (3) McCarrick, C. W.; Ohmer, D. T.; Gilliland, L. A.; Edwards, P. A.; Mayfield, H. T. Anal. Chem. 1996, 68, 4264-4269. (4) Doleman, B. J.; Lonergan, M. C.; Severin, E. J.; Vaid, T. P.; Lewis, N. S. Anal. Chem. 1998, 70, 4177-4190. (5) McGill, R. A.; Abraham, M. H.; Grate, J. W. CHEMTECH 1994, 24, 2737. 10.1021/ac990238+ Not subject to U.S. Copyright. Publ. 1999 Am. Chem. Soc.

Published on Web 08/31/1999

vapors. Vapor identification is possible when the collection of sensor responses for a given vapor numerically encodes different types of chemical information.6,7 Treated together, the collection of detector responses is often termed a chemical fingerprint and serves as the basis for applying pattern recognition methods to chemical microsensor array data. An array of m sensors or detectors produces an m-dimensional pattern vector. With judiciously chosen detectors, chemically different vapors generate fingerprints that mathematically cluster far away from each other in m-dimensional pattern space. This numerical separation allows them to be distinguished from each other mathematically using multivariate pattern recognition algorithms.4 The choice of which pattern recognition algorithm to use for a chemical sensor array application can be critical. Selecting the wrong pattern recognition algorithm can result in lower detection rates or an increased number of false alarms. One of the challenging aspects of this application is the presence of multimodal and overlapping distributions of chemical fingerprints. Difficulties also arise from the need to distinguish among chemically similar species and the inability to collect thorough training sets for building classification rules. This situation is further complicated by the need to collect data in controlled environments (e.g., laboratory setting) for training the pattern recognition algorithm and then ultimately deploying the sensor in an uncontrolled environment (e.g., a battlefield). In such a complex environment, there are numerous opportunities for the sensor to be exposed to multiple vapors, or even to vapors that it was not trained to detect or discriminate.1 It is well known that there is no single algorithm that is always best for a particular application. However, within a class of applications with similar attributes, some algorithms may consistently outperform others. On the basis of this premise, in previous work we compared seven different pattern recognition algorithms evaluating their ability to meet the criteria for an ideal chemical sensor array pattern recognition algorithm.7 This comparison was performed with four chemical microsensor array data sets using five qualitative measures (speed, training difficulty, memory requirements, robustness to outliers, and ability to produce a measure of uncertainty) and one quantitative measure (classification accuracy). It was concluded that the artificial neural networkbased approaches produced the most accurate classifications. After considering the five qualitative features, two algorithms, learning (6) Zellers, E. T.; Park, J.; Hsu, T.; Groves, W. A. Anal. Chem. 1998, 70, 41914201. (7) Shaffer, R. E.; Rose-Pehrsson, S. L.; McGill, R. A. Anal. Chim. Acta 1999, 384, 305-317.

Analytical Chemistry, Vol. 71, No. 19, October 1, 1999 4263

Table 1. Data Set Class Statistics data set

no. of no. of sensors classes training prediction

SIM1 SIM2 SAW1

4 6 3

6 4 3

480 480 196

480 480 216

SAW2

5

7

427

237

comments equal class sizes equal class sizes nerve (227), blister (134), and nonagent (51) GA (105), GB (93), GD (104), VX (66), HD (97), DMMP (99), and CEE (100)

vector quantization (LVQ) neural networks and probabilistic neural networks (PNN), stood out as superior for this application. The study also pointed out the shortcomings of each algorithm. In this work, we describe a new pattern recognition method, the improved probabilistic neural network (IPNN). The IPNN algorithm consists of a modified PNN with three innovations designed to reduce the computation and memory requirements, speed training, and decrease the false alarm rate. The success of the new IPNN algorithm is demonstrated using chemical sensor array data. EXPERIMENTAL SECTION Four data sets, representative of typical chemical sensor array data, were employed in this study. These data sets were also used in the previously reported comparison study.7 Table 1 describes the four data sets. The number in parentheses in the comments column contains the number of patterns in each output data class. Two of these data sets are simulated data (SIM1, SIM2), while the remaining data (SAW1, SAW2) were collected using SAW chemical sensor systems. The two simulated data sets were generated using mean vectors and variances designed to mimic SAW sensor array data and are described in detail elsewhere.8 The SAW1 data set was collected using a four-sensor SAW array with a preconcentrator sampling system.1 SAW data for two types of chemical warfare agents (blister and nerve agents), simulants, and potential interferences (nonagents) were collected under a variety of experimental conditions (e.g., changing humidity, concentrations, and mixtures). These data were collected as part of a study to determine the feasibility of using SAW sensor arrays for chemical warfare agent detection. In this study, the nerve agent class included dimethyl methyl phosphonate (DMMP), pinacolyl methyl phosphofluoridate (GD), and O-ethyl-S-(2-isopropylaminoethyl) methyl phosphonothiolate (VX). The blister agent class contained bis(2-chloroethyl) sulfide (HD) and dichloropentane (DCP). Toluene, water, diesel exhaust, jet fuel exhaust, dichloroethane, cigarette smoke, bleach, ammonia, sulfur dioxide, and 2-propanol were in the nonagent class. Because one of the sensors in the array saturated (stopped oscillating) at high humidity it was removed, resulting in a three-dimensional pattern vector. A description of the SAW system, the methods of collecting the data, the procedures for extracting pattern vectors from the raw SAW data, and a description of how the training and prediction sets are formed is given elsewhere.1,7,9 (8) Anderson, M. A.; Venezky, D. L. Naval Research Laboratory Memorandum Report 6170-96-7798, 1996. (9) Shaffer, R. E.; Rose-Pehrsson, S. L.; McGill, R. A. Naval Research Laboratory Formal Report 6110-98-9879, 1998.

4264 Analytical Chemistry, Vol. 71, No. 19, October 1, 1999

The SAW2 data set consists of 664 pattern vectors obtained from SAW data collected using a six-sensor array with a preconcentrator sampling system.7,9,10 SAW data for seven chemical warfare agents or simulant vapors were collected. These data were originally collected to study the ability of SAW sensor systems to discriminate between specific chemical warfare agents. In addition to some of the vapors from the SAW1 data set, SAW2 includes the nerve agents ethyl N,N-dimethylphosphoramidocyanidate (GA) and isopropyl methylphosphonofluoridate (GB) and a simulant for blister agent, chloro ethyl ether (CEE). The methods for extracting pattern vectors from the raw SAW data and the division of the data into training and prediction sets is found elsewhere.7,9,10 Similar to the SAW1 array, one of the sensors in the SAW2 array did not provide any additional information and was removed. This resulted in a five-dimensional pattern vector. All the calculations were done using software written using MATLAB (version 4.2c, Mathworks, Inc., Natick, MA) on a Dell Pentium-120 personal computer running Windows 3.1 (Microsoft Corp., Redmond, WA). The LVQ code was taken from the Neural Network toolbox (version 2.0A, Mathworks, Inc., Natick, MA).11 Routines from the PLS_Toolbox (version 1.5, Eigenvector Technologies Inc., Manson, WA) were also used. THEORY Probabilistic Neural Networks. PNNs are a class of neural networks that combine some of the best attributes of statistical pattern recognition methods and feed-forward neural networks.12,13 They have been described as the neural network implementation of kernel discriminant analysis and were first introduced into the neural network literature by Specht in the late 1980s.14 Orginially developed for radar classification, in recent years the use of PNN has spread to other applications,15,16 including chemistry.8,10,17-19 The PNN operates by defining a probability density function (PDF) for each data class. During the training phase, the pattern vectors (chemical fingerprints) in the training set are simply copied to the hidden layer of the PNN. Unlike other types of artificial neural networks, the basic PNN only has a single adjustable parameter. This parameter, termed sigma (σ) or kernel width, along with the members of the training set defines the PDF for each data class. Other types of PNNs that employ multiple kernel widths (e.g., one for each output data class or each input dimension) have become popular recently16 but were not considered in this work. In a PNN, each PDF is composed of Gaussian(10) Shaffer, R. E.; Rose-Pehrsson, S. L.; McGill, R. A. Proceedings of the 1996 ERDEC Scientific Conference on Chemical and Biological Defense Research; ERDEC SP-048, 1996; pp 939-945. (11) Demuth H.; Beale, M. Neural Network Toolbox User’s Guide; Mathworks Inc.: Natick, MA, 1995. (12) Masters, T. Practical Neural Network Recipes in C++; Academic Press Inc.: Boston, MA, 1993. (13) Masters, T., Advanced Algorithms for Neural Networks; John Wiley: New York, 1995. (14) Specht, D. F. Neural Networks 1990, 3, 109-118. (15) Blue, J. L.; Candela, G. T.; Grother, P. J.; Chellappa, R.; Wilson, C. L. Pattern Recognit. 1994, 27, 485-501. (16) Specht, D. F. In Computational Intelligence: A Dynamic System Perspective; Palaniswami, M., Attikiouzel, Y., Marks, R. J., Fogel, D., Fokuda, T., Eds.; IEEE Press: New York, 1995. (17) Chtioui, Y.; Bertrand, D.; Barba, D. Chemom. Int. Lab. Syst. 1996, 35, 175186. (18) Chtioui, Y.; Bertrand, D.; Devaux, M. F.; Barba, D. J. Chemom. 1997, 11, 111-129. (19) Magelssen, G. R.; Ewing, J. W. J. Chromatogr., A 1997, 775, 231.

Figure 1. Contour plot illustrating the probability density function for each class from a simulated PNN.

shaped kernels of width σ located at each pattern vector and essentially determines the boundaries for classification. The kernel width is critical because it determines the amount of interpolation that occurs between adjacent pattern vectors. As the kernel width approaches zero, the PNN essentially reduces to a nearestneighbor classifier. This point is illustrated by the contour plot in Figure 1. These plots show four, two-dimensional pattern vectors for two classes (A, B). The PDF for each class is shown as the circles of decreasing intensity. The probability that a pattern vector will be classified as a member of a given output data class increases the closer it is to the center of the PDF for that class. In this example, any pattern vectors that occur inside the innermost circle for each class would be classified with nearly 100% certainty. As σ is decreased (upper plot), the PDF for each class shrinks. For very small kernel widths, the PDF consists of groups of small circles scattered throughout the data space. A large kernel width (lower plot) has the advantage of producing a smooth PDF and good interpolation properties for predicting new pattern vectors. Small kernel widths reduce the amount of overlap between adjacent data classes. The optimized kernel width must strike a balance between a σ that is too large or too small. Prediction of new chemical fingerprints using a PNN is more complicated than the training step. Each member of the training set of pattern vectors (i.e., the patterns stored in the hidden layer of the PNN and their respective classifications), and the optimized kernel width are used during each prediction. As new pattern vectors are presented to the PNN for classification, they are serially propagated through the hidden layer by computing the dot product, d, between the new pattern and each pattern stored in the hidden layer. The dot product scores are then processed through a nonlinear transfer function, exp(-(1 - d)/σ2), which approximates a Gaussian kernel. Because each pattern in the hidden layer is used during each prediction, the execution speed

of the PNN is considerably slower than other algorithms. The storage requirements can also be quite large since every pattern in the hidden layer is needed for prediction. The summation layer consists of one neuron for each output class and simply collects the outputs from all hidden neurons of each respective class. The products of the summation layer are forwarded to the output layer where the estimated probability of the new pattern being a member of each class is computed. In the PNN, the sum of the output probabilities equals 100%. Learning Vector Quantization. LVQ combines some of the features of nearest-neighbor pattern recognition and competitive learning ANNs.20 A detailed description of the algorithm and its successful use in chemical sensor array pattern recognition can be found elsewhere.2,7 The basic premise of LVQ is that the underlying PDF for each class can be estimated by using a small number of reference vectors that span the same space as the original training set patterns. In this training scheme, the reference vectors are considered hidden neurons, are randomly assigned a classification, and are forced to compete against each other to learn the structure of the pattern space. After initialization, the patterns in the training set are repeatedly presented to the hidden layer in random order. At each iteration, the hidden neuron closest in terms of Euclidean distance to the current pattern, termed the “winning” neuron, is moved toward the current pattern if the classification for the winning neuron matches the classification of the pattern. Otherwise, if the classification for the winning neuron does not match the classification of the pattern vector, it is moved away from the pattern. The distance that the winning neuron is moved is determined by the learning rate, which is slowly lowered over the course of training to decrease the likelihood of finding local optimums. Competitive learning attempts to reinforce correct classifications by making the winning neuron more similar to the presented pattern. Eventually, the reference vectors become stabilized and training is complete. The classification of new patterns is done by computing the Euclidean distance between the new pattern and each of the reference vectors. The new pattern is then assigned the classification of the closest reference vector. An important parameter for the correct implementation of the LVQ classifier is the initial number of neurons. If too many neurons are selected, training becomes unnecessarily lengthy and potentially produces so-called “dead neurons”. Since the training phase only rewards or penalizes neurons that win a given competition (i.e., “winning neuron”), it is quite possible that some neurons will never move or simply stop winning during the training phase. These neurons do not provide any new information and only serve to consume extra memory and computational resources. Conversely, if too few neurons are selected, then the existing neurons are burdened to learn too much of the pattern space and poor classification performance may result. Improved Probabilistic Neural Network. The IPNN algorithm implements three key developments: (1) a modified PNN algorithm that utilizes an LVQ-like competitive learning scheme to reduce the size of the hidden layer, (2) a faster method of optimizing σ, and (3) a general protocol for developing an automated outlier rejection scheme within the PNN paradigm. As (20) Kohonen, T. Self-Organization and Associative Memory; Springer-Verlag: Berlin, 1988.

Analytical Chemistry, Vol. 71, No. 19, October 1, 1999

4265

a group, these improvements attempt to make the entire training process autonomous. However, these three modifications to the PNN algorithm can also be treated separately if desired. Several research groups have tackled the problem of reducing the size of the hidden layer for the PNN. A smaller hidden layer reduces the memory requirements and the number of distance calculations greatly. One solution, suggested first by Burrascano, was to use the optimized neurons of the LVQ hidden layer as the hidden layer of the PNN.21 As pointed out earlier, the LVQ hidden layer forms a good approximation of the data space using a minimal number of reference vectors, thereby making it a good candidate to replace the entire training set in the hidden layer of the PNN. This reduction method is based on the assumption that the LVQ neurons converge to centroids having the same distribution as the full training set. A combined LVQ-PNN approach simply replaces the nearest-neighbor prediction step of the conventional LVQ with a PNN prediction decision. In the comparison study,7 it was noted that the PNN approach had slightly better predictive performance than a nearest-neighbor classifier. Thus, replacing the nearest-neighbor prediction step in the LVQ with a PNN classifier should potentially result in improved classification performance. Burrascano found that, for manufactured data sets, a combined LVQ-PNN approach reduced the memory requirements and improved the speed of operation of the PNN, while keeping the training times short and retaining the ability to produce a confidence measure for each classification.21 An alternative approach to reducing the size of the hidden layer of a PNN was proposed Chtioui et al.17 For the analysis of color image data, principal component analysis (PCA) was used first for dimensionality reduction, followed by reciprocal neighbors (RN) hierarchical clustering, which is similar in operation to LVQ. The IPNN algorithm described here uses a hidden layer reduction method for a PNN similar to the one used by Burrascano. Most of the differences occur because of our desire to make the entire training process autonomous. Ideally, new patterns would be simply loaded into the software, IPNN training would initiate, and without further user interaction a trained IPNN would be created. However, this type of training process is hindered by the need to make several assumptions that may or may not be valid for all types of data. On the basis of our experience with chemical sensor array data, we were able to develop several “rules of thumb” or guidelines that streamline the IPNN training process. It has been established that the number of hidden layer neurons is an important parameter in LVQ training. Rather than make this a user-selected option, a procedure was developed to estimate the optimal value for this parameter. The LVQ training step begins with a large number of hidden layer neurons. During the competitive learning phase, a time bias is used to increase the chances that “losing” neurons will be selected as the winner. Large numbers of “dead neurons” may result if precautions are not taken to force some poorly placed neurons to move. The time bias procedure implemented in the MATLAB LVQ function used in this work modifies the score used to select the winning neuron to encourage poorly placed neurons to be more competitive.11 Once LVQ training is complete, the network structure is interrogated by passing each pattern in the training and monitoring subset through the hidden layer of the LVQ once and storing (21) Burrascano, P. IEEE Trans. Neural Networks 1991, 2, 458-461.

4266 Analytical Chemistry, Vol. 71, No. 19, October 1, 1999

each winning neuron. The neurons in the hidden layer that never become a winning neuron are considered “dead neurons” and are removed from the hidden layer. Despite the time bias adjustment used during LVQ training, a large number of dead neurons will be found that do not contribute any new information. Pruning these neurons from the network does not hurt network performance greatly. In addition, the dead neuron removal step can be done without prompting the user. It also has the benefit of further decreasing the size of the hidden layer passed to the PNN. If no dead neurons are removed, then all of the neurons are significant. If this occurs and training results are not satisfactory, then the initial number of neurons can be increased and training repeated. Another important issue is when to stop training. One method widely used in the neural network community is termed “train and test”. This protocol was used previously in the comparison study to train LVQ as well as other types of artificial neural networks.7 First, the entire training set is subdivided randomly into separate training and monitoring subsets. Ideally, the monitoring subset covers the same pattern space as the entire training set. During the LVQ training step of the IPNN, the monitoring subset is withheld from the competitive learning phase of LVQ. Periodically, the classification of the patterns in the monitoring subset is predicted by the current LVQ hidden layer. When the prediction performance of the monitoring set starts to degrade, LVQ training is stopped. This allows the LVQ training phase to proceed without any user intervention and helps to guard against overtraining. The IPNN utilizes an efficient method of optimizing σ. For a typical PNN application, the optimal kernel width is selected by minimizing training error using a combination of cross-validation and univariate optimization (CV-OPT).13 However, the time needed to optimize σ increases exponentially with the number of training set patterns. Research into the theory of probability density functions has led to various methods for approximating the optimal kernel width for kernel discriminant algorithms such as the Parzen classifier22 and for LVQ-PNN.23 Recently, Kraaijveld reported a general approach for Parzen classifiers using a simple distance based calculation.24 The derivation of this equation is based on maximum likelihood principles and assumes independence of the samples in the training set. The equation for σ optimization studied in this work follows directly from the one developed by Kraaijveld,

σopt ) cf

x

1

n

∑|| * -  ||

nm j)1

j

j

2

(1)

where cf is a correction factor, m is the number of sensors in the array, n is the number of patterns in the hidden layer of the PNN or IPNN, and j* represents the nearest neighbor of pattern j. Kraaijveld suggested a correction factor (cf) in the range of 1.21.5.24 In this work, a correction factor of 1.44 was found to work well with sensor array data. The basic assumption used in this approximation is that because the PDF for each class is estimated (22) Hand, D. J. Kernel Discriminant Analysis; Research Studies Press: Letchworth, UK, 1982. (23) Voz, J. L.; Verleysen, M.; Thissen, P.; Legat, J. D. From Natural to Artificial Neural Computation; Springer-Verlag: Berlin, 1995; pp 404-411. (24) Kraaijveld, M. A. Pattern Recognit. Lett. 1996, 17, 679-689.

Figure 2. Conceptual picture of a two-dimensional pattern space consisting of two data classes represented by X’s and O’s, respectively, and two outlier patterns, A and B.

as the sum of the individual Gaussian kernels, the density estimate at a specific location in the pattern space is determined primarily by the nearest kernel. All other kernels are assumed to be so far away that their contribution to the density estimate is minimal. This concept can be visualized using Figure 1. Thus, the approximation to the optimal kernel width should be based on the mean distance between nearest neighbors adjusted for the number of sensors. The IPNN algorithm also features a general method for automatically rejecting outlier patterns. For chemical sensor array systems employed in real-world environments, the ability to reject ambiguous sensor signals is critical. Ambiguous patterns are encountered when an analyte belonging to a class of compounds that was not included in the training set is encountered or when rapid changes in the environment surrounding the sensor system cause random fluctuations in the sensor signal. The pattern recognition technique must be “smart” enough to know when it does not know how to classify something. Pattern recognition algorithms usually require that the pattern being classified be placed into one of the existing classes (i.e., winner take all), thereby increasing the chances for a false alarm to occur. Teaching the algorithm to classify the pattern as an unknown is sometimes a difficult task.7 In terms of statistical pattern recognition, an ambiguous pattern is treated as outlier. The goal is to identify any new pattern that does not fall within the statistical distribution of the current training set. Pattern recognition algorithms using a distance metric to make classification decisions can identify outliers by developing a distance threshold (sometimes called a measure of proximity). If the closest distance is greater than some preset threshold, the new pattern is classified as an outlier. Another approach to outlier rejection with the PNN is to use the posterior probability.7 If the probability of a pattern being a member of any class if less than 90% (or some probability that is application dependent) then it can be rejected as an outlier. This approach is compatible with both two-class and multiclass problems. One disadvantage of using the probability to detect outliers can be illustrated by way of example. Figure 2 shows a simple two-dimensional data space with two data classes represented by X’s and O’s for classes 1 and 2, respectively. In this example, there are two outliers, A and B. Outlier B would be easily found as an outlier because the PNN would assign it a 50% probability of being a member of either class. However, outlier A presents a more difficult situation. Because this outlier is so far away from class 1 in the data space, despite clearly being an outlier, it would be assigned a high probability

of being a member of class 2 because the Bayesian rule requires that the summation of class probabilities equal 1. Thus, in addition to the probability criterion for outlier detection, a second approach based on distance is needed. For a PNN, a distance-based outlier rejection approach was employed by Bartel and co-workers in the analysis of nuclear power plant transient diagnostics.25 The challenge to outlier rejection in a PNN is the choice of the cutoff criterion. In PNN outlier rejection, the cutoff threshold operates on the aggregate score of the summation neurons. If this value is less than the cutoff threshold, the pattern is considered an outlier. One of the goals of this research was to develop a general protocol for setting the optimal rejection level for the IPNN algorithm. Monte Carlo simulation studies were used in this work to determine outlier rejection threshold levels. Monte Carlo methods comprise a branch of experimental mathematics that is concerned with experiments on random numbers.26 In this work, random pattern vectors were generated and presented to the PNN. Many of these patterns (e.g., outliers A and B in Figure 2) were not located near any of the “true” data. Thus, they did not fall within the given distribution of the existing patterns. By generating thousands of random patterns, a cutoff level that will reject a majority of these patterns can be determined. If the rejection threshold is set too high, the probability of rejecting a pattern falling within the statistical bounds of the distribution increases. Selecting a rejection level too small decreases the chances that a potential outlier will be flagged as an outlier. Because we are interested primarily in developing protocols for distance-based outlier detection, the probability-based outlier detection was not included in these studies. Although, for actual implementation on sensor systems in the field, the combination of both methods is recommended. The optimal rejection threshold will certainly be data set and application dependent. Each pattern space will have a certain distribution thereby potentially requiring a different rejection criterion. In the IPNN, the optimal rejection level will be based upon the neurons in the hidden layer and the kernel width. Any changes made to either the hidden layer or the kernel width will require a new choice for the rejection threshold. The rejection threshold selection is also application dependent. By changing the target rejection goal (discussed below), the rejection can be made more or less stringent depending upon the application. For example, if the patterns generated for a sensor system are extremely stable over time, then a more stringent target rejection goal can be used. RESULTS AND DISCUSSION Kernel Width Optimization. The best metric for comparing kernel width optimization methods is the prediction classification performance. Initially, the performance of the fast kernel optimization procedure defined by eq 1 was tested using an unmodified PNN. Table 2 lists the kernel widths and percentages of patterns correctly classified in the prediction subset using both the CVOPT procedure and eq 1 for the four data sets employed in this study. These results suggest that eq 1 is a valid approximation to the optimal kernel width. For one data set (SAW2) the prediction results improved, while for the two simulated data sets the (25) Bartal, Y.; Lin, J.; Uhrig, R. E. Nucl. Technol. 1995, 110, 436-449.

Analytical Chemistry, Vol. 71, No. 19, October 1, 1999

4267

Table 2. Optimized Kernel Widths and Prediction Classification Results for PNN data set

CV-OPT kernel width

PNN prediction (%)

eq 1 kernel width

PNN prediction (%)

SIM1 SIM2 SAW1 SAW2

0.0358 0.0678 0.0138 0.0062

93.13 82.50 90.74 94.09

0.0274 0.0316 0.0317 0.0076

92.29 79.79 90.74 94.51

Table 3. Classification Performance of LVQ and IPNN for SIM1 run no.

LVQ training (%)

winning neurons

1 2 3 4 5

95.31 94.79 94.79 95.31 95.05

69 82 87 86 71

1 2 3 4 5

88.54 88.54 88.02 88.28 88.28

1 2 3 4 5

97.44 98.08 98.08 98.08 97.44

IPNN training (%)

IPNN prediction (%)

kernel width

SIM1 94.79 94.79 94.27 95.31 94.53

93.33 94.17 94.17 94.38 93.33

0.0394 0.0365 0.0339 0.0344 0.0397

89 88 85 66 90

SIM2 87.50 88.54 87.50 87.50 87.76

83.75 82.29 82.71 82.71 83.13

0.0267 0.0273 0.0259 0.0265 0.0269

29 36 39 33 37

SAW1 96.15 96.79 94.87 94.87 96.79

93.52 94.44 93.52 93.52 94.44

0.0605 0.0457 0.0492 0.0473 0.0445

classification was not as successful. For two of the four data sets, the kernel width found by the two methods was similar and thus produced similar classification results. However, the speed and ease of using eq 1 justifies its further study. The CV-OPT method requires some user input (e.g., to set boundaries and to know when to stop), while the use of eq 1 is totally autonomous. IPNN. The next set of experiments was designed to test the combination of LVQ for reducing the size of the PNN hidden layer and using eq 1 to determine the kernel width. To determine the stability of the LVQ training step, each experiment was performed five times with a different seed for the random number generator. The initial number of hidden neurons was set to 100 and 50 for the simulated (SIM1, SIM2) and the real (SAW1, SAW2) data sets, respectively. The LVQ training procedure was performed for 10 000 epochs stopping every 500 to predict the patterns in the monitoring subset. The hidden neurons at the minimum training and monitoring set classification error were employed as the input to the PNN. Equation 1 was used to compute the optimal kernel width. Table 3 and the top half of Table 4 list the optimized kernel width, size of the hidden layer, percentages of patterns correctly classified in the training subsets for the standard LVQ (same implementation as in ref 7), and IPNN algorithms for each data set and training run. For each group of replicate training runs, the set of hidden neurons and kernel widths that produced the best LVQ and IPNN training results was selected as optimal. Prediction performance cannot be used to choose the best configuration without biasing the conclusions. The percentage of 4268 Analytical Chemistry, Vol. 71, No. 19, October 1, 1999

patterns correctly classified by LVQ in training was considered the first criterion and IPNN training performance was only used to break ties. Considering both the LVQ and IPNN training performance results in much better correlation with prediction performance (especially for the SAW2 data set). In Tables 3 and 4, the training run (i.e., hidden layer and kernel width) that was judged optimal by this criterion is highlighted by italic lettering. These results demonstrate that the IPNN approach produces a very good classifier of chemical sensor array data. The best IPNN prediction performances (94.38, 82.29, 94.44, and 96.20%, for SIM1, SIM2, SAW1, and SAW2, respectively) compare favorably with the results from the comparison study (94.38, 83.54, 95.37, and 94.94%, for SIM1, SIM2, SAW1, and SAW2, respectively).7 The IPNN classification performance is better than or equal to the best performance from the seven pattern recognition methods from the comparison study (which includes the basic LVQ and PNN) for SIM1 and SAW2 and only slightly worse than the best algorithm on the other two data sets. When compared to just the conventional PNN, in addition to better predictive performance, this new approach greatly reduces the memory and computation requirements. The number of patterns in the hidden layer was reduced on average to 15% of its original size (e.g., for SAW2 the size of the PNN hidden layer was reduced from 427 to 32 hidden neurons). The IPNN approach is also an improvement over the plain LVQ algorithm due to its slightly better prediction performance and simple method for generating statistically significant confidence measures. As expected, the IPNN classification performances are repeatable. The prediction results vary (1% across the five training runs for three of the four data sets. However, as evident in the top half of Table 4, the percentages of patterns correctly identified in the prediction subset for the different training runs for the SAW2 data set had great variation in them. There are two potential causes for this phenomenon: (1) it could be due to the LVQ training step not always producing a good set of hidden neurons for the PNN classifier or (2) because the procedure for computing the optimal kernel width (eq 1) was not valid for this data set. To further investigate this issue, two additional sets of experiments were performed. Five additional training runs, identical to the first five except for the seed value for the random number generator, were done using the SAW2 data set. The LVQ training classification performance, IPNN training and prediction classification performance, number of winning neurons, and calculated kernel width are shown in the lower half of Table 4. To compare the first five training runs with the second group of five training runs, the average number of winning neurons and IPNN prediction classification performance are also given. The second group of replicate runs also shows a large degree of variance in IPNN prediction performance, although the mean values for the two groups are nearly identical. Similar to the observation made from the first five training runs, the combination of LVQ and IPNN training classification ability provides a reliable indicator of IPNN prediction classification performance. The second experiment involves the use of the CV-OPT method for determining the optimal kernel width. This alternative kernel width optimization approach was used on the 10 optimized hidden layers from the LVQ training step for the SAW2 data set. The final two columns of Table 4 list the optimal kernel widths

Table 4. Classification Performance of LV, IPNN, and CV-OPT PNN for SAW2 run no 1 2 3 4 5

LVQ training (%) 95.86 96.45 97.34 96.15 96.45

mean 6 7 8 9 10 mean

winning neurons 31 34 32 32 30

IPNN training (%)

IPNN prediction (%)

kernel width eq 1

CV-OPT PNN prediction (%)

kernel width CV-OPT

96.96 92.04 92.97 96.72 96.96

87.34 96.20 96.20 86.50 94.51

0.0454 0.0442 0.0465 0.0525 0.0469

86.50 91.98 90.30 94.94 91.98

0.0694 0.0948 0.3275 0.0207 0.0235

92.15

0.0471

91.14

0.1071

91.56 91.56 86.50 100.0 91.98

0.0514 0.0540 0.0487 0.0463 0.0534

91.56 91.14 91.56 85.65 91.56

0.0648 0.3778 0.1800 0.3277 0.1282

92.32

0.0508

90.29

0.2157

31.8 97.34 97.04 96.15 97.34 97.34

31 30 30 32 32

91.80 91.33 95.32 95.08 92.51

31

selected by this method and the percentages of patterns correctly classified by the PNN using those kernel widths. These results demonstrate that the poor prediction classification performance seen for run 1 using eq 1 to select the kernel width was not due to a poorly chosen kernel width since the prediction results did not improve using CV-OPT. Also, the kernel widths selected by both methods were similar (0.0694, 0.0454). For the LVQ hidden layer from runs 4 and 8, the PNN did see a major improvement in predictive performance (87-95 and 87-92%), indicating that the poor performance of IPNN was due to a poorly chosen kernel width. In these cases, the kernel widths selected using eq 1 were much different from those selected by CV-OPT. Thus, it appears that in a few isolated cases using eq 1 to select the kernel width would hinder prediction performance. For runs 3 and 9, excellent PNN prediction performance was observed when the kernel width selected by eq 1 was used. However, the kernel width selected by CV-OPT produced very poor prediction performance. When we compare the prediction performance across all 10 runs, the PNN trained using CV-OPT actually performed worse for 7 of the 10 LVQ hidden layers. The great variation seen in the kernel widths selected by CV-OPT and the inability of CV-OPT to find the optimal kernel width (as illustrated by the poorer prediction classification performances) can be attributed to either the presence of local minimums in the search method used by this algorithm or the inability for leave-one-out cross-validation to be representative when only a small fraction of the data set remains. Thus, it is concluded that the few isolated cases of poor predictive performance of the IPNN approach are more likely due to a poor match between the LVQ selected hidden layer and the PNN classifier than a poorly chosen kernel width. Perhaps this might occur less frequently if some aspects of the PNN were included in the LVQ hidden layer competition. However, this would further increase the training times, and since this inconsistency appears for only one of the four data sets, is most likely unnecessary. To better understand how the LVQ training step works, the optimized hidden layer for run 9 in SAW2 was studied since it produced the best classification results. Figure 3A shows a principal component scores plot of the 427 patterns in the SAW2 training set projected onto their first two principal components (captures 92.19% of the variance). To highlight the class distributions in this figure, lines have been drawn around the patterns in each class. These lines are for illustrative use only and are not meant to be interpreted with any statistical significance. A principal

component scores plot of the prediction set can be found elsewhere.7 Figure 3B is a plot of the hidden layer reference vectors from run 9 projected on to the first two principal components of the SAW2 training subset. Comparing parts A and B of Figure 3, it easy to observe that the 32 reference vectors in the LVQ hidden layer form a good approximation to the full training set (427 patterns). In cases where tight clustering is observed using the full training set (e.g., DMMP), only a few LVQ hidden neurons are required to approximate the cluster, while for less structured clusters (e.g., VX) or highly overlapping clusters (e.g., GB and GD), more LVQ neurons are required. Thus, an optimized LVQ hidden layer is a good hidden layer for the PNN. Good coverage of the pattern space provides additional justification for the excellent classification performance of the IPNN. LVQ pattern spaces for the cases in which the IPNN does not classify very well do not look very much different from the one seen in Figure 3B. It appears that the high degree of overlap between the GB and GD clusters is the primary cause of the prediction classification performance variation. A small change in the LVQ hidden layer can result in poor IPNN classification performance. These studies illustrate that the IPNN produces better than or equivalent results to either the LVQ or PNN (CV-OPT or eq 1) algorithms alone. The classification accuracy of the IPNN was also compared with results obtained in previous work using five other pattern recognition algorithms. These results demonstrate that IPNN is an excellent pattern recognition algorithm for chemical sensor array systems. In terms of the criteria for an ideal pattern recognition algorithm for chemical sensor array systems discussed in ref 7, the IPNN approach scores well on all points. The IPNN algorithm features several improvements over the basic PNN and LVQ methods and is well-suited for implementation on a microcontroller for operation on sensor systems in the field. IPNN training can also be implemented in an autonomous fashion thereby greatly increasing its usability. However, there is still more work to be done on this algorithm. The coupling of LVQ, PNN, and eq 1 is sometimes not a good match. In a few of these isolated incidences, eq 1 does not appear to be a very good method for selecting the optimal kernel width. Further understanding of these phenomena is needed. Also, the scalability of eq 1 is not known. In the four data sets studied here, the dimensionality of the pattern vectors was very small, ranging from 3 to 6. Although eq 1 has a correction term for the Analytical Chemistry, Vol. 71, No. 19, October 1, 1999

4269

Figure 3. (A) Principal component scores plot of the 427 patterns in the training subset of SAW2. The lines surrounding each class of patterns are drawn for illustrative purposes only and do not represent any statistical significance. (B) Plot of the 32 reference vectors in the optimized LVQ hidden layer for run 3 in SAW2 projected on the first two principal components in the training subset. Patterns from the classes GA, GB, GD, VX, HD, CEE, and DMMP are represented as open squares, open circles, solid triangles, *, solid diamonds, +, and ×, respectively.

dimensionality (m), it is not known whether it will still be a good approximation for higher dimensional data (e.g., spectroscopy). Automated Outlier Rejection. To study methods for determining the best rejection threshold, the SAW2 data set was employed. Pattern vectors for toluene and octane were available for this data set. These organic vapors are not similar to any of the organosulfur and organophosphorus vapors included in the SAW2 data set and thus should be flagged as outliers. 4270 Analytical Chemistry, Vol. 71, No. 19, October 1, 1999

We empirically determined that a rejection threshold (r) of σ4 was a good initial estimate. The optimized IPNN (run 9 in Table 4) was presented with 10 000 random normalized pattern vectors. The strengths of the summation neurons were tested. If the aggregate summation neuron strength was less than r, the pattern was labeled an outlier. The target rejection goal for this implementation was to find between 85 and 90% of the random patterns as outliers. If less than 85% of the patterns were rejected, then r

was not strict enough and was increased by 20% and the simulation was repeated. If greater than 90% of the random pattern vectors were determined to be outliers, then r was too strict and was decreased by 20% and the simulation was repeated. The choice of 85-90% for the target rejection goal is somewhat arbitrary and definitely application dependent. Using the Monte Carlo simulation procedure, the rejection threshold for the SAW2 data set, using the optimized LVQ hidden layer for the PNN, and a kernel width of 0.0463, was found to be 0.0000 45 8 (4.58 × 10-6). For this threshold, the IPNN approach rejected 89.53% of the random patterns in the simulation. The variation in the percentage of random patterns rejected was determined to be approximately (0.5% based on replicate experiments. Of course, with a large number of random patterns (n ) 10 000), the variation would be expected to be quite small.26 To test this threshold on real data, two experiments were performed. First, the entire SAW2 data set was presented to the same IPNN configuration to determine whether any of the original patterns would be flagged as outliers. Two patterns in the original data set (2 out of 664 ) 0.3%) were found to be outliers (aggregate summation neurons scores on the order of 10-9). These same two patterns (VX data class) can be seen as outliers from a PC scores plot (ref 7, Figure 6, outliers can be seen located at approximately 0.2, -0.2). Because these patterns were present in the prediction subset, the LVQ training procedure could not properly account for them. Thus, labeling these two patterns as outliers is acceptable, although not ideal. The second experiment involved the pattern vectors for toluene and octane. After these patterns were presented to the same IPNN configuration as above, they were both correctly flagged as outliers, with aggregate summation neuron scores less than 1 × 10-25. (26) Guell, O. A.; Holcombe, J. A. Anal. Chem. 1990, 62, 529A-542A. (27) Gottuk, D. T.; Hill, S. A.; Schemel, C. F.; Strehlen, B. D.; Rose-Pehrsson, S. L.; Shaffer, R. E.; Tatem, P. A.; Williams, F. W. Naval Research Laboratory Memorandum Report, NRL/MR/6180-99-8386, June 18, 1999. (28) Hart, S. J.; Shaffer, R. E.; Rose-Pehrsson, S. L.; McDonald, J. R., to be submitted to IEEE Trans. Geosci. Remote Sensing.

The results from this part of the study indicate that the Monte Carlo-based procedure for determining the optimal rejection threshold is valid. A general procedure for selecting the rejection threshold is critical for applying a PNN-based pattern recognition algorithm to chemical sensor array applications. By combining the distance-based and the probability-based outlier rejection schemes, the chances of a false alarm occurring will decrease. CONCLUSIONS This work has shown that the IPNN algorithm has several advantages compared to the standard PNN, including reduced computation and memory requirements, faster training, and lower false alarm rate. Although the work described here has focused on chemical sensor array pattern recognition, there are certainly other applications where this method will prove valuable. Many of the attributes seen in chemical sensor array pattern recognition, such as highly overlapping multimodal classes and low dimensionality, can be found in other applications. For example, we have had tremendous success using both the IPNN and the standard PNN for fire detection27 and unexploded ordnance (UXO) identification.28 Research in our laboratory is underway to further characterize the performance of PNN and IPNN for other types of pattern recognition problems. We are also studying PNN algorithms utilizing multiple kernel widths. ACKNOWLEDGMENT Portions of this research were performed while R.E.S. was a National Research Council-Naval Research Laboratory Postdoctoral Research Associate. Dr. R. Andrew McGill is thanked for providing the SAW2 data set. Dr. Mark Anderson is thanked for helpful discussions of PNN and providing the two simulated data sets. Dr. David Venezky is thanked for his interest in this work. Received for review March 2, 1999. Accepted July 29, 1999. AC990238+

Analytical Chemistry, Vol. 71, No. 19, October 1, 1999

4271