Artificial Neural Networks for the Automated Detection of

Artificial neural networks are applied to the automated classification of ... laboratory has developed an analysis strategy based on the use of short ...
0 downloads 0 Views 93KB Size
Anal. Chem. 2000, 72, 1680-1689

Artificial Neural Networks for the Automated Detection of Trichloroethylene by Passive Fourier Transform Infrared Spectrometry Cheryl L. Hammer† and Gary W. Small*

Center for Intelligent Chemical Instrumentation, Department of Chemistry and Biochemistry, Clippinger Laboratories, Ohio University, Athens, Ohio 45701-2979 Roger J. Combs, Robert B. Knapp, and Robert T. Kroutil

U.S. Army, SBCCOM, Edgewood Chemical Biological Center, Aberdeen Proving Ground, Maryland 21010-5424

Artificial neural networks are applied to the automated classification of trichloroethylene (TCE) signatures from passive Fourier transform infrared remote sensing interferogram data. Through the use of three data collection methods, a combination of laboratory and field data is acquired that allows the methodology to be evaluated under a variety of infrared background conditions and in the presence of potentially interfering compounds such as sulfur hexafluoride, methyl ethyl ketone, acetone, carbon tetrachloride, and ammonia. To maximize the computational efficiency of the network optimization, experimental design techniques are employed to develop a training protocol for the network that takes into account the relationships among five variables that are related either to the network architecture or to the training process. This protocol is implemented for the case of a back-propagation neural network (BNN) and is used to develop an optimized network for the detection of TCE. The classification performance of the network is assessed by comparing both TCE detection capabilities and false detection rates to similar classification results obtained with the technique of piecewise linear discriminant analysis (PLDA). When applied to prediction data withheld from the optimization of both the BNN and PLDA algorithms, the BNN method is observed to outperform PLDA overall, with TCE detection rates in excess of 99% and false detection rates less than 0.5%. Passive Fourier transform infrared (FT-IR) remote sensing measurements are based on the use of an FT-IR spectrometer to collect naturally occurring infrared radiation in the outdoor environment.1,2 The resulting data are analyzed for the presence of absorption and emission signatures of compounds of interest. † Present address: Clorox Services Co., 7200 Johnson Dr., Pleasanton, CA 94588. (1) Beer, R. Remote Sensing by Fourier Transform Spectrometry; Wiley: New York, 1992. (2) Kroutil, R. T.; Ditillo, J. T.; Small, G. W. In Computer-Enhanced Analytical Spectroscopy; Meuzelaar, H. L. C., Ed.; Plenum Press: New York, 1990; Vol. ii, pp 211-243.

1680 Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

These measurements can be used in a variety of environmental monitoring scenarios and can be implemented from both groundbased and airborne platforms. Two characteristics of the passive remote sensing experiment have a significant impact on the data collection and data handling strategies that can be employed. First, since the sample is not constrained within the field of view (FOV) of the spectrometer, signal averaging is of limited utility unless the spectrometer scan speed is very fast. Adding interferograms in which the sample has moved out of the FOV of the spectrometer only serves to reduce the analyte signal in the collected data. This problem is accentuated in measurements performed from moving aircraft since in this case both the spectrometer and the sample are moving. Second, in the passive measurement, no control over the infrared background radiance is possible. The background can change constantly, thus eliminating the possibility of making a stable background measurement for use in extracting analyte signals from the collected data. These considerations dictate that an ideal data analysis algorithm for use with passive FT-IR remote sensing data must be insensitive to changes in the background radiance and must facilitate rapid data collection. To address both issues, our laboratory has developed an analysis strategy based on the use of short segments of the interferogram data collected by the FTIR spectrometer.3-9 By optimal selection of the interferogram segment location, a significant amount of information representing the infrared background can be removed. Application of bandpass digital filters to the selected segment then allows extraction of analyte information on the basis of characteristic spectral (3) Small, G. W.; Kroutil, R. T.; Ditillo, J. T.; Loerop, W. R. Anal. Chem. 1988, 60, 264-269. (4) Small, G. W.; Kaltenbach, T. F.; Kroutil, R. T. Trends Anal. Chem. 1991, 10, 149-155. (5) Small, G. W.; Carpenter, S. E.; Kaltenbach, T. F.; Kroutil, R. T. Anal. Chim. Acta 1991, 246, 85-102. (6) Kroutil, R. T.; Combs, R. J.; Knapp, R. B.; Small, G. W. Appl. Spectrosc. 1994, 48, 724-733. (7) Bangalore, A. S.; Small, G. W.; Combs, R. J.; Knapp, R. B.; Kroutil, R. T. Anal. Chim. Acta 1994, 297, 387-403. (8) Kaltenbach, T. F.; Small, G. W. Anal. Chem. 1991, 63, 936-944. (9) Bangalore, A. S.; Small, G. W.; Combs, R. J.; Knapp, R. B.; Kroutil, R. T.; Traynor, C. A.; Ko, J. D. Anal. Chem. 1997, 69, 118-129. 10.1021/ac991075s CCC: $19.00

© 2000 American Chemical Society Published on Web 03/04/2000

frequencies. For the automated identification of target analytes, a pattern recognition algorithm can then be applied to the filtered interferogram data. This approach has been shown to perform well when compared to corresponding analyses of spectral data.9 By performance of the analysis with a short interferogram segment, the data collection requirements for the measurement are reduced and the spectrometer scan speed can be increased significantly. Furthermore, since the required optical retardation of the interferometer is decreased, it may be possible to adopt alternative interferometer designs that are more rugged. The pattern recognition algorithm used in previous work was piecewise linear discriminant analysis (PLDA).8 This technique is based on the construction of a collection of linear discriminant functions that operate in the multidimensional data space formed by the filtered interferogram data. Taken together, these discriminants form piecewise linear approximations to nonlinear boundaries in the space between data categories. A boundary constructed to isolate the region of the data space occupied by the filtered interferogram segments corresponding to a target analyte can thus be used to classify unknown interferograms as either analyte-active or analyte-inactive. The PLDA method is limited in its ability to model nonlinearities in the boundaries between data categories. In the analysis of filtered interferogram data from passive FT-IR remote sensing measurements, these nonlinear boundaries are caused by the likely occurrence of both absorption and emission signals from the analyte.5 To address this limitation, the work presented here explores the use of an artificial neural network (ANN) to implement an automated classifier for the detection of analyte signatures from filtered interferogram data. The ANN is a true nonlinear modeling method and, relative to PLDA, has a greater ability to model complex relationships in the data space. The effective use of the ANN method requires the design of a protocol for use in optimizing and testing the computed networks, however. This paper describes the use of experimental design techniques in the development of such a protocol in the context of the automated detection of trichloroethylene (TCE) from passive FTIR remote sensing measurements. Once developed, this protocol is used to optimize an ANN for the detection of TCE, and comparisons are made to corresponding results obtained with PLDA. EXPERIMENTAL SECTION Instrumentation. The majority of the data used in this work have been employed in a previous study.9 The data collection will be summarized here. Interferograms were acquired with a Brunswick emission spectrometer (model M21, Brunswick Technical Group, DeLand, FL). The FOV of the spectrometer was 1.5° and could be reduced to 0.5° with an antireflection-coated germanium refractive optic telescope designed for open-air monitoring applications. The spectrometer was interfaced to a Dell 486P/50 computer (Dell Computer, Austin, TX), which operated under MS-DOS (Microsoft, Redmond, WA). Data acquisition was accomplished by use of the MIDAS package.10 Interferogram points were collected every eighth zero crossing of the helium-neon (HeNe) (10) Combs, R. J.; Kroutil, R. T.; Knapp, R. B. Unclassified Technical Report ERDEC-TR-298; Edgewood Research, Development, and Engineering Center: Aberdeen Proving Ground, MD, April 1996.

reference laser, causing the maximum spectral frequency to be 1975 cm-1. Interferograms consisting of 1024 points were sampled with a corresponding spectral point spacing of approximately 4 cm-1. Each interferogram collected was a single scan; i.e., no signal averaging was performed. Methods. Interferogram data were collected by three different procedures in which the infrared radiation source was at nearambient temperature. These three procedures were termed passive cell laboratory, passive cell terrestrial, and open-air terrestrial experiments. The terms “cell” and “open-air” denote whether the TCE was contained in a cell or released into the open air. Similarly, the terms “laboratory” and “terrestrial” indicate whether the infrared background was a laboratory blackbody or a naturally occurring infrared background. The passive cell laboratory and passive cell terrestrial experiments utilized a cell with a 62 cm2 aperture and an 8.2 cm path length. The windows of the cell were composed of low-density polyethylene (0.0005 in. thickness), and it was aligned along the principal optical axis determining the spectrometer FOV. The source for the passive cell laboratory measurements, which provided an adjustable temperature from 5 to 50 °C, was a 4 × 4 in. extended blackbody (Model SR-80, CI Systems, Agoura, CA). The source temperature was accurate to (0.03 °C and precise to (0.01 °C. During the passive cell terrestrial collections, the blackbody was removed and the cell was held by a bracket attached to the spectrometer housing. To simulate the conditions found in open-air monitoring applications in the passive cell laboratory measurements, the concentrations of TCE were controlled by an equilibrium vapor cell technique which depended on a diluent of carbon tetrachloride (CCl4).11 Mixtures of reagent grade TCE (Aldrich, Milwaukee, WI) and reagent grade CCl4 (MCB Manufacturing, Cincinnati, OH) were evaporated to generate the various TCE concentrations. The volume fraction of TCE in CCl4 ranged from 1/2 to 1/64 corresponding to vapor pressures of 34.1 and 0.95 Torr at 25 °C, which were calculated by use of the Wilson equation.12 While the passive cell terrestrial experiments were performed, pure TCE vapor was introduced into the cell and various terrain backgrounds were viewed through the cell. Since these measurements were made outside the laboratory, it was not convenient to control the TCE concentrations. Concentrations were also not controlled for the open-air terrestrial experiments, where TCE vapor was released with an evaporative emission source. In this case, the spectrometer FOV was restricted to help ensure that the vapor cloud filled the FOV. The spectrometer was positioned 5-25 m from the vapor emission source. The backgrounds consisted of terrain, water, low-angle sky, or some combination of both sky and terrain for both openair and passive cell terrestrial experiments. Terrain backgrounds resembled laboratory blackbody backgrounds, whereas sky backgrounds were more complex because of the presence of trace gases encountered along the long atmospheric path viewed. Assembly of Data. The data employed in this work consisted of sets of interferograms labeled training, monitoring, and prediction sets. The training set was used to compute the neural (11) Field, P. E.; Combs, R. J.; Knapp, R. B. Appl. Spectrosc. 1996, 50, 13071313. (12) Gmehling, J.; Onken, U.; Arlt, W. Vapor-Liquid Equilibrium Data Collection; DECHEMA: Frankfurt, Germany, 1984; Vol. 1, Part 8, p 81.

Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

1681

Table 1. Data Set Composition interferogram type TCE-actives passive cell laboratory open-air/passive cell terrestrial TCE-inactives passive cell laboratory open-air/passive cell terrestrial total

training set

monitoring set

prediction set 1

prediction set 2

1 000 2 500

6 000 14 000

6 123 6 944

10 605 0

1 000 5 500

13 000 27 000

24 549 22 902

79 347 0

10 000

60 000

60 518

89 952

networks, while the monitoring set was used to test the computed networks for the purpose of determining the values of several variables related to network optimization. The prediction sets were not used during the training or optimization of the networks and thus served as data sets for use in validating the network classification performance. The training and monitoring sets were assembled by Bangalore et al.9 Their study utilized PLDA to implement the pattern recognition component of the TCE detection, thus providing a useful set of results for comparison to the ANN work described here. In the description of the data set assembly, interferograms collected with TCE in the FOV of the spectrometer will be termed TCE-actives. For the open-air terrestrial interferograms, there was no guarantee that TCE vapor was present in the FOV of the spectrometer for any given interferogram because of the dispersal effects of wind and the fact that no signal averaging was performed. It was therefore necessary to determine the actual presence or absence of TCE spectral bands visually in order to create a data set suitable for use with the supervised pattern recognition algorithms employed in this work. Each of the open-air terrestrial interferograms was examined for the presence of TCE by visual inspections of both spectral bands at 845 and 938 cm-1.9 First, the interferograms were Fourierprocessed to single-beam spectra; then, a Fourier-processed background, known to contain no TCE, was subtracted. If both TCE bands were observed at an approximate signal-to-noise ratio of 3.0 or greater (i.e., at or greater than the conventional limit of detection), the interferogram was deemed TCE-active and placed in a pool of TCE-active interferograms. Those spectra that were collected when TCE was known to be present but which exhibited no visible spectral bands were removed from the data set to ensure that accurate classifications were available for use in training and testing the pattern recognition algorithms. Finally, background interferograms collected when no TCE was present were inspected for data integrity and then placed in a separate pool. The training and monitoring sets were built from the pools of TCE-active and background interferograms by use of a subset selection procedure developed by Carpenter and Small.13 Details regarding the composition of the training and monitoring sets are presented in Table 1. The training and monitoring sets consisted of 10 000 and 60 000 interferograms, respectively. Some of the interferograms were collected with blackbody, water, low-angle sky (elevations 20 and 45° to the horizontal), and terrain backgrounds, which included manmade objects such as buildings and vehicles in the spectrometer FOV. Also, acetone, methyl ethyl ketone (MEK), and sulfur hexafluoride (SF6) were present with (13) Carpenter, S. E.; Small, G. W. Anal. Chim. Acta 1991, 249, 305-321.

1682 Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

some of the field background interferograms to serve as potential interfering compounds. The 938 cm-1 band of TCE overlaps with the 945 cm-1 absorption band of SF6. The primary absorption bands of acetone (1217 cm-1) and MEK (1175 cm-1) do not overlap with either of the absorption bands of TCE; instead they provide a means for evaluating the spectral selectivity of the interferogram-based detection algorithm. All of the passive cell laboratory TCE-active data had CCl4 present during the collection, and some of the collected backgrounds contained pure CCl4. The CCl4 band centered at 790 cm-1 does not overlap with either of the TCE spectral bands, although it allows for the further testing of the frequency selectivity of the filtering procedure. Two prediction sets were generated for use in testing the optimized pattern recognition algorithms. The interferograms in prediction set 1 were collected during the same time period as the data comprising the training and monitoring sets. By contrast, the interferograms in prediction set 2 were collected nearly 9 months later, after repairs had been made to the spectrometer. For prediction set 1, interferograms were collected by the same three methods: passive cell laboratory, passive cell terrestrial, and open-air terrestrial. Samples with TCE and without TCE were collected. As before, interferograms were collected with blackbody, low-angle sky, and terrain backgrounds. Prediction set 2 consisted solely of passive cell laboratory interferograms. For this data set, acetone and ammonia (NH3) were included with the backgrounds for use as potential interferences. As noted previously, both the ANN and PLDA pattern recognition algorithms were developed with training and monitoring sets that included acetone data. Neither algorithm was exposed to NH3 data during development, however. Both prediction sets are described in Table 1. Optimization of Digital Filters and Interferogram Segments. Bangalore et al. optimized the band-pass filter position, band-pass filter width, interferogram segment length, and segment location in their work with PLDA.9 These values were also used in the work presented here. Table 2 lists the best-performing filter/ segment combinations for each spectral band and interferogram segment length. These eight filter/segment combinations were selected for use in developing the ANN methodology described in this paper. While there is no guarantee that the segments chosen as optimal for use with PLDA are necessarily the best for use with an ANN, these segments should be representative and should allow a fair comparison between the PLDA and ANN pattern recognition methods. Data Analysis. Analysis of the data sets was performed with a Silicon Graphics Indigo2 IMPACT computer employing a single processor (Silicon Graphics, Mountain View, CA). This computer

Table 2. Optimal Filters and Interferogram Segments filter position (cm-1) 845 cm-1 band 843.1 850.8 848.9 850.8 938 cm-1 band 931.8 947.3 939.5 939.5

fwhha (cm-1)

interferogram segment locationb

interferogram segment size

82.6 86.9 78.3 69.7

131/180 131/200 131/220 71/180

50 70 90 110

127.3 142.7 123.4 123.4

71/120 151/220 131/220 111/220

50 70 90 110

a Width at half-height of the filter band-pass. b Starting/ending points relative to the centerburst.

used the Irix operating system (version 6.2, Silicon Graphics). The ANN software was original code written in C and compiled with version 7.0 of the Silicon Graphics C compiler (optimization level 3). Analysis of variance computations, normal score calculations, and the construction of the main effects and interaction plots were accomplished with the Minitab statistical software package (version 11.13, Minitab, Inc., State College, PA) executed on a Dell Dimension XPS Pro 200n computer (Dell Computer) operating under Microsoft Windows NT Workstation (version 4.0, Microsoft). RESULTS AND DISCUSSION Overview of Data Analysis Methodology. The data analysis methodology employed in this work was based on the application of band-pass digital filters to short segments of the interferogram data collected by the FT-IR spectrometer. Windowing the interferogram to select an optimal segment allows the separation of information on the basis of spectral bandwidth. Since the characteristic features of the infrared background radiance are typically very broad (e.g., the temperature-dependent Planck function that characterizes the background) or very narrow (e.g., the characteristic spectral bands of water vapor), selection of an interferogram segment slightly displaced from the centerburst allows the rejection of much of the background information. This windowing step relies on the fact that the interferogram representation of a broad spectral feature is very short, while the corresponding representation of a narrow feature is very long. Since by Parseval’s relation the interferogram and spectral representations are proportional in area, a short interferogram segment located after the point at which the representations of the broad background features have largely damped to zero contains little contribution from either the broad or the narrow spectral features. By the design and application of a band-pass digital filter to isolate the modulated interferogram frequencies corresponding to a characteristic spectral band of the target analyte, the information corresponding to interfering chemical species can be further rejected. This allows the removal of contributions of spectral bands that have widths similar to that of the analyte band but which are located at different frequencies. The digital filtering method used in this work was an implementation of time-varying

finite impulse response filters.14 The filters used here were designed to isolate the interferogram information corresponding to either the 845 or 938 cm-1 spectral band of TCE. It is also possible to use multiple filters to incorporate information about several spectral bands into the interferogram-based analysis.15 The band-pass positions and widths of the filters used here are presented in Table 2. The filtered interferogram intensities in the chosen segment were used as the inputs to an ANN classifier. ANNs are a general category of nonlinear modeling methods that can be applied to classification problems.16-21 The key advantage of ANNs in this application derives from their ability to construct nonlinear decision boundaries between the data classes. The ANN investigated in this research was the back-propagation neural network (BNN).16 Yang and Griffiths have used a BNN previously with spectral data collected during FT-IR remote sensing measurements.21 The network implemented here had two processing layers, consisting of a hidden layer and an output layer. In this design, each of the p inputs (i.e., the intensities of the p points in the filtered interferogram segment) is connected to the units in the hidden layer by interconnects or synapses, each of which is characterized by a weight or strength. The hidden units are also connected to the output nodes by interconnects. The network was thus fully interconnected; i.e., every node in one layer was connected to every node in the adjacent layer. Each node has three basic elements: a set of synapses, an adder for summing the input signals weighted by their respective synapses, and an activation function for limiting the amplitude of the output of the node. A bias, θ, is added which has the effect of lowering the input to the activation function. A node can be described in mathematical terms as p

∑w x

(1)

y ) φ(v)

(2)

v)

j j

j)0

and

where x1, x2, ..., xp are the inputs, w1, w2, ..., wp are the weights, v is the linear combination of the inputs and weights, φ is the activation function, and y is the output from the node. Here the zeroth term is the bias, whose input is x0 ) -1 and whose weight is w0 ) θ. The activation function defines the output of the node (14) Small, G. W.; Harms, A. C.; Kroutil, R. T.; Ditillo, J. T.; Loerop, W. R. Anal. Chem. 1990, 62, 1768-1777. (15) Idwasi, P. O.; Small, G. W. In Fourier Transform Spectroscopy: 11th International Conference; de Haseth, J. A., Ed.; American Institute of Physics: Woodbury, NY, 1998; pp 227-230. (16) Haykin, S. Neural Networks: A Comprehensive Foundation; Macmillan College Publishing: Englewood Cliffs, NJ, 1994. (17) Anderson, J. A. An Introduction to Neural Networks; MIT Press: Cambridge, MA, 1995. (18) Masters, T. Practical Neural Network Recipes in C++; Academic Press: Boston, MA, 1993. (19) Mehrotra, K.; Mohan, C. K.; Ranka, S. Elements of Artificial Neural Networks; MIT Press: Cambridge, MA, 1997. (20) Zupan, J.; Gasteiger, J. Neural Networks for Chemists: An Introduction; VCH: Weinheim, Germany, 1993. (21) Yang, H.; Griffiths, P. R. Anal. Chem. 1999, 71, 751-761.

Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

1683

as a function of the activity level of its input. Here, the activation function was the sigmoid function

φ(v) )

1 1 + e-v

(3)

This function assumes a continuous range of values from 0 to 1 and is differentiable, which facilitates the algorithm used for optimizing the weights. The back-propagation training algorithm begins by initializing the synaptic weights to small uniformly distributed random numbers. Then the network is presented with the training patterns one-by-one. The forward computation begins with the presentation of the first training pattern. Equations 1-3 are used for the forward computation. An error signal is then computed:

ek(n) ) dk(n) - ok(n)

(4)

where ok(n) and dk(n) are the output and desired output, respectively, from output node k at epoch n. For classification problems, the desired network outputs for node k are 1 or 0, corresponding to membership or no membership, respectively, in class k. One epoch finishes after each pattern has been presented to the network once. The ultimate purpose of training is to minimize a cost function based on the error signal, ek(n). A common cost function is the mean squared error. This involves correcting the weights in such a way as to minimize the error. The classical implementation of the BNN was used here in which the generalized delta rule was used to move along the error surface in a stepwise manner toward the global minimum. Weight correction for the synapse connecting node j to node k is calculated by

wkj(n + 1) ) wkj(n) + R[wkj(n) - wkj(n - 1)] + ηδk(n) yj(n) (5) where η is the learning rate, δk(n) is the local gradient of the error surface at node k, R is the momentum, and yj(n) is the output from node j. The learning rate controls how quickly the weights will be corrected, while the momentum term is included to increase the rate of learning without adding instability, since the increase is based on the previous weight correction. This procedure is repeated until the mean squared error has reached an acceptable level. Description of Network Variables Studied. The computation of an optimal BNN requires the optimization of five variables. Two of these variables, the number of inputs and the number of hidden units, define the basic functional form or architecture of the network. The other three variables, learning rate, momentum, and the number of epochs, are control variables that allow the network to be trained in an optimal manner. The learning rate affects the rate of change of the weights. If it is too great, the global minimum on the error surface may be missed, though when its value is too small, optimization can take a significant length of time. The momentum can be used to accelerate the weight-updating process without causing instability, due to its dependence on previous updates. 1684

Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

The number of hidden units is important for modeling the nonlinearity of the problem. More hidden units are needed as the nonlinearity in the data set increases. In addition, the number of inputs to the network needs to be optimized to ensure that only the input points that are essential for pattern recognition are included. Too many input points will slow the network, and too few may not include all the necessary information. The number of epochs is another variable that needs to be optimized. Training results always improve as epochs are added, though prediction results usually improve with added epochs to a certain level, and then the results worsen as the number of epochs increases. This is a consequence of overfitting, i.e., allowing the weights to fit the noise in the training patterns. Each of the five variables may also have joint effects on the network. Optimization of the five network variables involves choosing appropriate levels for each, followed by the construction and testing of each network. The key to computational efficiency lies in the ability to minimize the number of networks that must be constructed and tested. This must be done carefully, however, as the relationships among the variables determine the degree to which the optimal value of one variable depends on the value of another. Two-way, three-way, and higher order relationships may exist among the five network variables. Computational efficiency would dictate that only those interactions that are significant should be studied. An elaborate study of two variables that are not closely related would be a waste of computational resources, as those variables could very easily be optimized independently. Therefore, an understanding of the relationships among the variables is critical to the development of an optimization procedure that will both generate an overall optimal network design and maximize computational efficiency. A principal goal of this study was to establish a protocol for network optimization by examining the significance of the relationships among the five network variables. Experimental Design. An experimental design was performed to study the relationships among the five variables. This work was inspired by a previous study conducted by Shaffer and Small that explored the relationships among digital filtering and interferogram segment variables used in conjunction with PLDA.22 Both the main effects and the interaction effects of the experimental variables can be examined by employing an experimental design. A main effect is assigned to each variable, which shows the effect of a change in the value of that variable on the experimental result. An interaction effect is assigned to each combination of variables, which shows the effect of simultaneously changing the values of all the variables in the combination. Since all main and interaction effects should be completely explored, each combination of variables must be studied. This experimental design is designated a full factorial design. A full factorial experimental design employs a response function that describes the overall performance or effectiveness of the variable settings. This function has the form

R ) f(p1, p2, ..., pn)

(6)

where R is the value of the response function for a particular (22) Shaffer, R. E.; Small, G. W.; Combs, R. J.; Knapp, R. B.; Kroutil, R. T. Chemom. Intell. Lab. Syst. 1995, 29, 89-108.

Table 3. Variables and Levels Used in Factorial Design variable

levels

no. of inputs (in) no. of hidden units (hu) learning rate (lr) momentum (mo) no. of epochs (ep)

50, 110 10, 20 0.1, 0.3, 0.5, 0.7, 0.9 0.1, 0.3, 0.5, 0.7, 0.9 250-2000 in steps of 250

setting of the n variables, pi. The value of the response function is related to how the variable settings affect the optimal result. In this application, the response function used was based on the classification performance of the network when presented with the patterns in the monitoring set. The function used was

R)

(

)

100 acorrect icorrect + 2 atotal itotal

(7)

where R is the value of the response, acorrect is the number of TCEactive patterns classified correctly, atotal is the total number of TCEactive patterns, icorrect is the number of TCE-inactive patterns classified correctly, and itotal is the total number of TCE-inactive patterns. This response function gives equal weighting to the performance of the network in classifying TCE-active and TCEinactive patterns. The optimal result corresponds to the largest value of the response function. When combined with the values of the n variables, the response function defines a multidimensional surface whose shape illustrates the manner in which the individual variables and their joint effects impact the classification performance of the network produced through the use of those variables. Each main and interaction effect is expressed in response function units. A full factorial experimental design study was performed with the levels of the variables specified in Table 3. Two interferogram segments based on the optimal filtered segments found for the TCE spectral band of 938 cm-1, two different numbers of hidden units, five values for both the learning rate and momentum, and eight values for the number of epochs were studied. The levels of learning rate and momentum were varied over the range 0.01.0, and the longest and shortest interferogram segments were used. The choice of the number of epochs was found from a preliminary study showing the rate of convergence of the network, and the levels for the number of hidden units were chosen on the basis of the dimensionality of the input patterns. As defined in Table 3, this full factorial experimental design study corresponded to a total of 2 × 2 × 5 × 5 × 8 ) 800 possible combinations of the variables. To obtain a value for the experimental error, three replicate training procedures were performed for each combination of variables, resulting in a total of 2400 evaluations of the response function. Replication was achieved by changing the seed for the random number generator that was used to assign the initial weights of the network. This caused the weights to be updated differently, and as a consequence, the final weights had different values. Analysis of Variance. Analysis of variance (ANOVA) techniques were used to assess the main and interaction effects from the BNN results. The response function values corresponding to

the variable settings were fit to a least-squares model that separated the variance in the response function into assignable main and interaction effects and random variation.23-25 The model employed for this work had the form

Y ) µ + in + hu + lr + mo + ep + in × hu + in × lr + in × mo + in × ep + hu × lr + hu × mo + hu × ep + lr × mo + lr × ep + mo × ep + in × hu × lr + in × hu × mo + in × hu × ep + in × lr × mo + in × lr × ep + in × mo × ep + hu × lr × mo + hu × lr × ep + hu × mo × ep + lr × mo × ep + in × hu × lr × mo + in × hu × lr × ep + in × hu × mo × ep + in × lr × mo × ep + hu × lr × mo × ep + in × hu × lr × mo × ep + e (8)

where Y is the response function value for the specific combination of variables, µ is the overall mean of the response, and e is the error term that estimates random variation. The other terms are the main and interaction effects based on the combinations of number of inputs (in), number of hidden units (hu), learning rate (lr), momentum (mo), and number of epochs (ep). For example, hu × lr is the two-way interaction effect involving the number of hidden units and the learning rate. Fitting the model described by eq 8 produces a sum of squares and mean square for each main and interaction effect, along with a sum of squared errors and mean squared error. F tests can be utilized to test the significance of each main and interaction effect, where each mean square is compared to the mean squared error. A probability can be computed from the F test results, which reports whether a certain effect is significant. The null hypothesis associated with each probability is that the corresponding mean square is equal to the mean squared error, taking into consideration the sampling variability, which is encoded in the degrees of freedom. When the probability is near zero, the null hypothesis can be rejected with a high degree of confidence, demonstrating that the effect is significant. The validity of any conclusions drawn from the F values and associated probabilities is based on the assumptions that the model in eq 8 is correct and that the residuals of the model are drawn from a normal distribution. To evaluate these assumptions, the residuals from the model were studied. Figure 1A plots the residuals vs the corresponding fitted response function values predicted by the model. The fan-shaped profile observed in the figure suggests that the variance in the response function values is not constant. The probabilities computed from the F values may not be accurate in this case. Normality was investigated by constructing a normal probability plot of the residuals. These results can be seen in Figure 2A. The expected residuals were found by employing Blom’s method to estimate the residuals that would be found from the case of a normal distribution.26 If the plot of residuals vs expected (23) Massart, D. L.; Vandeginste, B. G. M.; Deming, S. N.; Michotte, Y.; Kaufman, L. Chemometrics: A Textbook; Elsevier: Amsterdam, 1988. (24) Box, G. E. P.; Hunter, W. G.; Hunter, J. S. Statistics for Experimenters; Wiley: New York, 1978. (25) Mason, R. L.; Gunst, R. F.; Hess, J. L. Statistical Design and Analysis of Experiments; Wiley: New York, 1989. (26) Blom, G. Statistical Estimates and Transformed Beta Variables; Wiley: New York, 1958.

Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

1685

Figure 1. Plots of the residuals calculated from the least-squares fit of the ANOVA model vs the corresponding fitted values predicted by the model: (A) results obtained with the response function defined by eq 7; (B) results obtained with the transformed response function in eq 9.

residuals were nearly linear, the assumption that the residuals are normally distributed would be correct. However, visual inspection of Figure 2A reveals that the residuals are not derived from a normal distribution. The correlation coefficient for this plot is 0.882, which for 2400 trials is not statistically significant. Since the assumptions of normality and constant variance were shown to be incorrect, measures were taken to address these concerns. One possibility is to perform a transformation of the response function values, with the goal of computing a more normally distributed response function. Two transformations were investigated, and the ANOVA study was repeated with each set of transformed response function values. The Box-Cox transformation27,28 was found to produce no improvement in the degree of nonconstant variance or in the linearity of the plot of residuals vs expected residuals. Also evaluated was the logit transformation,28 defined as

R′ ) ln(R/(100 - R))

(9)

where R′ is the transformed response function (termed the logit mean response) and R is the original response function defined by eq 7. Figures 1B and 2B are the residual and normal probability plots constructed after application of the logit transformation and recalculation of the model in eq 8. (27) Box, G. E. P.; Cox, D. R. J. R. Stat. Soc. B 1964, 26, 211-252. (28) Neter, J.; Kutner, M. H.; Nachtsheim, C. J.; Wasserman, W. Applied Linear Statistical Models, 4th ed.; Irwin: Chicago, IL, 1996.

1686 Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

Figure 2. Normal probability plots of the residuals calculated from the least-squares fit of the ANOVA model vs the estimated residuals: (A) results obtained with the response function defined by eq 7 (correlation coefficient 0.882); (B) results obtained with the transformed response function in eq 9 (correlation coefficient 0.956).

Comparison of parts A and B of Figure 1 reveals that the logit transformation reduces the degree of nonconstant variance somewhat, although a marked fan-shaped profile still exists in the plot. A similar comparison of parts A and B of Figure 2 confirms that the transformation produces an improvement in linearity, with the correlation coefficient now increased to 0.956. However, this value is still not significant for 2400 observations. On the basis of these findings, we concluded that the probabilities calculated from the F values are not reliable for use in assessing the significance of the effects. As an alternate strategy, we used the relative values of the F statistic as the criterion for the significance of each effect and its influence on the performance of the BNN. Since the use of the logit transformation produced residuals with somewhat improved behavior, the F values obtained with the transformed response function values were employed. The ANOVA table obtained with the transformed response function is provided in Table 4. The first column of the table is the source of variance (main and interaction effects), and the second column contains the corresponding degrees of freedom. The values in the third column are the sum of squares, while each value in the fourth column is the mean square for each main and interaction effect. The computed F values are given in the fifth column of the table. Inspection of the F values in Table 4 reveals that three main effects and two two-way interactions are clearly the most significant effects (F > 30). Two main effects, the number of inputs and the momentum value, appear very significant (F > 600). The number of epochs appears somewhat significant. The two-way

Table 4. Analysis of Variance Table source in hu lr mo ep in × hu in × lr in × mo in × ep hu × lr hu × mo hu × ep lr × mo lr × ep mo × ep in × hu × lr in × hu × mo in × hu × ep in × lr × mo in × lr × ep in × mo × ep hu × lr × mo hu × lr × ep hu × mo × ep lr × mo × ep in × hu × lr × mo in × hu × lr × ep in × hu × mo × ep in × lr × mo × ep hu × lr × mo × ep in × hu × lr × mo × ep

degrees of sum of freedom squares 1 1 4 4 7 1 4 4 7 4 4 7 16 28 28 4 4 7 16 28 28 16 28 28 112 16 28 28 112 112 112

Fa

2442.55 2442.55 12655.68 2.08 2.08 10.76 8.34 2.09 10.80 507.05 126.76 656.25 45.87 6.55 33.92 5.90 5.90 30.54 6.85 1.71 8.87 364.97 91.24 472.36 1.29 0.18 0.95 3.16 0.79 4.09 3.63 0.91 4.70 2.57 0.37 1.90 10.91 0.68 3.53 1.79 0.06 0.33 3.61 0.13 0.67 4.59 1.15 5.94 2.57 0.64 3.33 1.25 0.18 0.93 10.02 0.63 3.24 1.95 0.07 0.36 3.04 0.11 0.56 16.87 1.05 5.46 2.53 0.09 0.47 2.48 0.09 0.46 9.71 0.09 0.45 21.22 1.33 6.86 3.51 0.13 0.65 3.63 0.13 0.67 8.58 0.08 0.40 8.85 0.08 0.41 8.30 0.07 0.38

error

1600

309.06

total

2399

3828.71

a

mean square

Figure 3. Main effects plots illustrating the mean transformed response function value (R′ in eq 9) for each variable setting. The overall mean transformed response function value is indicated by the dashed line. The values on the vertical axis are in units of the transformed response function. The tic marks on the horizontal axis correspond to the levels for each variable.

0.19

Effects judged significant are indicated in boldface type.

interaction of inputs and momentum is very significant (F > 400). The two-way interaction of inputs and hidden units appears somewhat significant. This implies that the number of inputs must be optimized in conjunction with the momentum and the number of hidden units. The smaller F value for the two-way interaction of hidden units and momentum suggests that the interaction between momentum and inputs can be studied with the number of hidden units held constant. After optimization of the momentum value, the interaction between inputs and hidden units can then be studied. Other two-way, three-way, four-way, and five-way interactions are deemed to be statistically insignificant according to this criterion. These observations imply that the learning rate can be optimized independently of the other variables. Figures 3 and 4 further illustrate these points. Inspection of Figure 3, the main effects plot, confirms that the number of inputs and the momentum value are the most significant factors. The transformed response function values are strongly dependent on these factors, as evidenced by the large change in R′ across the different levels of the factors. Figure 4, the interaction plot, illustrates the effects of the two-way interactions between the variables. In this plot, the two-way interaction of momentum and inputs is shown to be very significant, as evidenced by the divergence in the traces for the input levels of 50 and 110. This confirms that a change in the number of inputs affects the manner in which the momentum value affects R′. By the same criterion, the two-way interaction of inputs and hidden units appears to be

Figure 4. Interaction plots illustrating the mean transformed response function values (R′ in eq 9) for every possible pairing of the five variables. Within each box, the vertical axis is in units of the transformed response function, whereas the horizontal axis units correspond to the variable specified for that column. The numbers appearing on each response curve are the values for the levels of the row variable (see Table 3). The most significant joint effects are indicated by divergent curves.

somewhat significant. The traces for the other two-way interactions are almost parallel, indicating that a change in the level of one factor has little effect on the manner in which a change in the other factor affects R′. These observations agree with the results obtained previously through the inspection of the F values. Protocol for Network Optimization. The results of the experimental design study suggest a protocol for use in developing a BNN classifier for filtered FT-IR interferogram data. It is clear that significant efforts must be expended to optimize the number of inputs. The number of hidden units must also be studied in conjunction with the number of inputs. A complicating factor is that there is clear evidence that the momentum value is significant in determining the success of the training and therefore must be optimized together with the number of inputs. Once the optimal momentum value is determined for a given number of inputs, the momentum can be held at that fixed level and the optimal number of hidden units can then be investigated. The experimental design indicated that the learning rate could be optimized separately from the other variables. In the study described here, a relatively small value of the learning rate such as 0.1 was observed to perform Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

1687

Table 5. Variables and Levels Used in Network Optimization variable no. of inputs (in) no. of hidden units (hu) learning rate (lr) momentum (mo) no. of epochs (ep)

Table 6. Optimized Network Parameters and Monitoring Results

momentum

epochs

monitoring set classification results (%)a

levels 50, 70, 90, 110 5, 7, 10, 12, 14 0.1 0.1, 0.3, 0.5, 0.7, 0.9 250-4000 in steps of 250

adequately. This value was employed without further optimization. Finally, the effect of the number of epochs used in training the network weights must be evaluated during the optimizations. A workable procedure is to use a fixed number of levels for the epochs and then to employ a monitoring set of patterns to test for the optimal level of training. This monitoring step would be employed during the investigation of each selected combination of the other variables. Neural Networks for the Detection of TCE. This study utilized prediction sets 1 and 2, as well as the calibration and monitoring sets. The patterns in the training set were presented to the network for optimizing the weights, and the monitoring set was presented to the network for prediction. Results from that prediction were then compared to results from other network designs. The network that exhibited the best results was chosen as the optimal network. Hence, the monitoring set was independent in that it was not used to train the network. However, it was not independent in the sense that it was used for the purpose of selecting the optimal network. Prediction sets 1 and 2 were not used at all during the optimization of the network and thus were completely independent test sets. Network Optimization. The results of the experimental design study defined an optimization protocol for the five variables associated with training a BNN. As previously stated, analysis of variance results showed that the learning rate had no significant interaction with the other factors, which allowed it to be optimized independently. The smallest level that was tested, 0.1, was chosen. Small values for the learning rate are beneficial because they allow the network to train slowly enough not to bypass optima on the response surface of the optimization. The levels tested for the other variables are listed in Table 5. For each combination of levels tested, three replicate networks were constructed by changing the seed of the random number generator used to initialize the network weights. The numbers of inputs were fixed at 50, 70, 90, and 110 points to maintain compatibility with the previous work of Bangalore et al.9 Hidden unit values were chosen on the basis of the initial experiments with 10 units from the experimental design study described previously and then were increased and decreased from there. Since the momentum needs to be set between 0 and 1, even increments of 0.2 between 0.1 and 0.9 were chosen. The levels for the number of epochs were found by preliminary experiments. The key here was to ensure that the maximum number of epochs did indeed encompass the maximum in the response function score. The previous results revealed that the momentum and inputs should be studied together, while the number of hidden units is held constant. This experiment found the optimal momentum for each input level at a fixed number of hidden units (10). The 1688 Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

band

inputs

hidden units

845 cm-1

50 70b 90 110

12 10 10 12

0.3 0.7 0.7 0.1

2250 1250 1750 1750

98.61 99.03 98.47 93.71

938 cm-1

50 70 90 110

12 10 10 10

0.1 0.7 0.5 0.7

1250 3500 4000 1750

97.41 97.70 98.48 98.50

a Calculated with eq 7. b This segment with the indicated parameters was found to give the best monitoring results and was used in subsequent experiments.

optimal settings were found by presenting the trained network with the monitoring set. The response function used was the same function employed previously in the experimental design (eq 7). The mean response across the three replicate networks was calculated for each combination of inputs and momentum. For a given replicate network, the weights used corresponded to the number of epochs that produced the best value of the response function. The momentum that gave the best mean response was said to be optimal for that number of inputs. After optimization of the momentum, the numbers of inputs and hidden units were studied. A procedure similar to that used in the study of momentum was employed, with three replicate networks again being trained. The mean response for the monitoring set was found for each combination of hidden units and inputs. Again, the specific weights chosen to be optimal corresponded to the number of epochs that produced the best value of the response function. From this work, the single replicate that produced the overall best response was chosen to be the final network for testing with prediction sets 1 and 2. Table 6 lists the optimized variables and response function values for each interferogram segment. From these, the best result arises from the 70-point interferogram segment filtered with the 845 cm-1 filter. The optimal momentum value for this segment was 0.7. In Figure 5A, response function values for networks based on this segment are plotted vs hidden units. The error bars in Figure 5A are constructed on the basis of the 95% confidence intervals computed across the three replicate networks at each level of hidden units. For the optimal case of 10 hidden units, Figure 5B plots the response function values vs the number of training epochs. The network trained with 1250 epochs was judged optimal. Figure 5B plots separate curves for each replicate network. Comparison to PLDA Results. The optimized BNN described above based on an input segment of 70 interferogram points was employed for classification of the patterns in prediction sets 1 and 2. The classification results for prediction set 1 were tabulated across all of the TCE-active and TCE-inactive interferograms. Statistics computed were the overall percent correct classification of the TCE-active interferograms and the false detection percentage of the TCE-inactive interferograms. Prediction set 2 was subdivided into three subsets according to the type of analyte-inactive patterns. Patterns containing acetone and NH3

The BNN also exhibited a lower false detection rate for the monitoring set. The training results for PLDA showed no false detections because of the single-sided requirement of the PLDA training algorithm that penalizes false detections during the optimization of the discriminants.8 For prediction set 1, the classification percentages produced by both methods were quite high, although the BNN appeared to outperform PLDA slightly. False detection percentages were very low for both techniques. With prediction set 2, which consisted of data collected after instrumental changes had occurred, both the BNN and PLDA methods suffered a reduced recognition percentage for TCE and an increased rate of false detections. False detections were particularly bad for the PLDA method with prediction set 2. The false detection rate exceeded 10% for blackbody backgrounds, 30% for the acetone backgrounds, and 95% for the NH3 backgrounds. The BNN performed much better in this regard, although the false detection rate was still more than 48% for the NH3 data. As noted previously, no NH3 data were present in either the training or the monitoring sets. This underscores the unpredictability in performance of all pattern recognition methods when presented with patterns that are fundamentally different from those included in the training set.

Figure 5. (A) Plot of the mean response function values (eq 7) vs hidden units for the networks trained with the 70-point interferogram segment filtered with the 845 cm-1 filter. A momentum value of 0.7 was used. Error bars are computed as the 95% confidence intervals based on the response function values obtained from the three replicate networks. (B) Plot of the response function values (eq 7) vs epochs for the three replicate networks trained with the 70-point interferogram segment filtered with the 845 cm-1 filter. Each network consisted of 10 hidden units, and a momentum value of 0.7 was used. Table 7. BNN and PLDA Results % TCE correct

% false detection

data set

BNN

PLDA

BNN

PLDA

training monitoring prediction 1 prediction 2 lab/backgrounds lab/acetone lab/NH3

99.40 98.40 99.31

98.50 96.20 98.94

0.22 0.35 0.30

0 0.50 0.26

89.83

94.15

0.72 4.84 48.14

11.76 31.25 95.88

were tabulated separately. The remaining analyte-inactive patterns were taken together as backgrounds. All of the TCE-containing interferograms were tabulated together. Results from the application of the optimal network to each data set and subset are given in Table 7 along with the corresponding results from the application of PLDA. Classification results from the training and monitoring sets are also included in the table to facilitate comparisons. The best piecewise linear discriminant computed by Bangalore et al.9 was used to produce these classification results. That discriminant was based on the use of the 938 cm-1 filter and the 110-point interferogram segment. Results for the training and monitoring sets showed slightly improved TCE detection capabilities for the BNN relative to PLDA.

CONCLUSIONS Through the use of experimental design techniques, the research presented in this paper demonstrated that an effective protocol could be developed for applying ANN techniques to the automated detection of TCE vapor from passive FT-IR remote sensing measurements. Overall, the optimized BNN developed here demonstrated improved pattern recognition performance relative to the PLDA method used previously. While the principal drawback of the BNN is the increased computational overhead associated with training, the training protocol developed here serves to maximize the computational effort in the optimization of the variables that have the greatest significance in determining the network performance. In summary, the computed BNNs performed well for the identification of TCE signatures in FT-IR remote sensing interferogram data. One of the keys to the growth of FT-IR remote sensing as a general-purpose environmental monitoring method lies in the enhancement of its selectivity and sensitivity through improved data handling methods. On the basis of the results presented here, the use of ANNs with band-pass-filtered interferogram segments to implement automated compound recognition algorithms shows significant promise for meeting these performance goals. ACKNOWLEDGMENT This research was funded by the Department of the Army. Initial results from this work were presented at the 28th Central Regional Meeting of the American Chemical Society, Dayton, OH, June 1996.

Received for review September 17, 1999. Accepted January 20, 2000. AC991075S Analytical Chemistry, Vol. 72, No. 7, April 1, 2000

1689