a60
Anal. Chem. 1986, 58, 860-866 Dolphin, D., Ed. "The Porphyrins"; Vol. V, Academic Press: New York, Vol. V, Chapter 3, pp 53-126; Chapter 4, pp 127-150. Bottomly, L. A.; Kadish, K. M. Inorg. Chem. 1981, 20, 1348. Peychal-Heiling, G.; Wilson, G. S. Anal. Chem. 1971, 4 3 , 550. Giraudeau. A.; Callot, H. L.; Jordon, J.; Ezhar, D.; Cross, M. J . A m , Chem. SOC.1979, 101, 3857. Wolberg, A.; Manassen, J. J . Am. Chem. SOC. 1970, 92, 2982. Wolberg, A.; Manassen, J. Inorg. Chem. 1970, 9 , 2365. Felton, R. H.;Linschitz, H. J . A m . Chem. SOC.1986, 88, 1113. Kadish, K. M.; Morrison, M. M. Bioinorg. Chem. 1977, 7, 107. Chang, D.; Malinski, T.; Ulman, A.; Kadlsh, K. M. Inorg. Chem. 1984, 23, 817. Kadish, K. M.; Rhodes, R. K. Inorg. Chem. 1981, 20,2961. Geng, L.; Murray, R. W., submitted for publication in Inorg. Chem.
(42) Calvert, J.; Nowark, R. J. Southeast Regional American Chemical Society Meeting, Raleigh, NC, Nov 1984, Abstract 265. (43) Gaudiello, J. G.; Bradley, P. G.; Norton, A.; Woodruff, W. H.;Bard, A. J. Inorg. Chem. 1984. 23,3. (44) Ikeda, T.; Schmel, R.; Denisevich, K.; Willman; Murray, R. W. J . A m , Chem. SOC.1982, 104, 2883. (45) Leidner, C. R.; Murray, R. W. J . A m . Chem. SOC. 1984, 106, 1606.
RECEIVED for review August 2,1985. Accepted November 14, 1985. This research was supported in part by a grant from the National Science Foundation.
Detection of Hazardous Gases and Vapors: Pattern Recognition Analysis of Data from an Electrochemical Sensor Array Joseph R. Stetter'
Energy and Environmental Systems Division, Argonne National Laboratory, Argonne, Illinois 60439 Peter C. Jurs
Department of Chemistry, T h e Pennsylvania State University, University Park, Pennsylvania 16802
Susan L. Rose* Chemistry Division, Naval Research Laboratory, Washington, D.C. 20375-5000
A portable device Is being developed by Argonne National Laboratory to detect, Identify, and warn U.S. Coast Guard emergency response personnel of the presence of hazardous gases and vapors. The device operates in the situation of a spill or ship cargo hold where a puddle of the contamlnant may be present. The prototype device uses an array of four different electrochemicalsensors, which can be operated in four dlfferent modes. Thus, the array yields 16 channels of data for each chemical species detected. Pattern recognltlon is one way of determlnlng the unlqueness of the informatlon obtained and the capacity of each of the channels for classlflcatlon. The matrices generated by detectlng 22 gases and vapors with a prototype instrument are belng examined by use of pattern recognltlon methods. Results to date indicate that approximately half of the channels provide unlque Information.
While cleaning up chemical spills and waste sites or inspecting the hulls of chemical tankers, U.S. Coast Guard personnel may be exposed to hazardous gases and vapors. In this situation often the chemical is leaking into the air and, therefore, a sample of the vapor in the air is easily obtained. Background gases may be insignificant in such situations and one is left with the problem of identifying and quantifying a single unknown vapor in air. Certain classes of compounds are considered hazardous because of their toxicity, possible carcinogenicity, or flammability. Although commercial devices are available for detecting a single hazardous gas, the equipment cannot identify 'Present address:
60540.
Transducer Research, Inc., Naperville, IL
the compounds that are present. Therefore, it is not possible to assess the potential health risk associated with exposure. Further, there are hazardous gases or vapors that cannot be detected with any of the commercially available portable instruments. In addition, it would be beneficial to have a single instrument for all compounds rather than two or three units each providing some analyses. A portable instrument capable of detecting, identifying, and monitoring hazardous gases and vapors at and above the parts-per-million (ppm) level is being developed by Argonne National Laboratory to help prevent personnel exposures. The instrument, which is called a chemical parameter spectrometer (CPS) (1-3), is a small, microprocessor-controlled device composed of a compact array of amperometric gas sensors, each of which is adjusted to respond differently to electrochemically active gases and vapors. The instrument's sensor array yields more information than single sensor instruments and can be used to identify and quantify many airborne chemicals. Pattern recognition techniques use modern mathematical methods based on multivariate statistics and numerical analysis to elucidate relationships in multidimensional data sets. These methods, which are without human bias, can improve measurements by enhancing the extraction of chemical information from chemical data. They can also reduce interference effects and improve selectivity in analytical measurements. Multidimensional data sets lend themselves to computer-assisted pattern recognition, particularly when the complexity of the information makes data analysis by ordinary means very difficult. The fundamental premises of pattern recognition as applied to electrochemical sensor array analysis are as follows: 1. The compound and the instrument's response are related.
This article not subject to U.S. Copyright. Published 1986 by the American Chemical Society
ANALYTICAL CHEMISTRY, VOL. 58, NO. 4, APRIL 1986 861
no.
name
no.
name
1 2 3 4
nitrogen dioxide nitric oxide hydrogen sulfide ammonia benzene carbon monoxide cyclohexane acetic acid carbon tetrachloride formaldehyde benzyl chloride
12 13 14 15 16 17 18 19
ethyl acrylate nitromethane chloroform pyridine toluene sulfur dioxide tetrahydrofuran chlorine nitrobenzene acetone tetrachloroethylene
5 6 7 8
9
10 11
CYCLOHEXANE
NITROGEN DIOXIDE
Table I. Data Set
20 21 22
BENZENE
-1284
2. A compound can be adequately represented as a set of sensor responses. 3. A relation can be discovered between compounds and their responses by applying pattern recognition methods to a set of tested compounds. 4. The relation can be extrapolated to untested compounds from similar classes. The electrochemical responses, which encode chemical information in a numerical form, define the axes of a coordinate system in space. Each compound can be thought of as a point whose position in space is defined by the values produced by each sensor. Further, related objects tend to cluster in a space defined by the sensor responses. Pattern recognition is a set of methods for investigating such clusters in space. The large number of sensor responses generated by the CPS for each species increases the usefulness of the instrument only if the instrument produces a unique response for each compound. Pattern recognition methods can be used to determine the discrimination capacity of each of the sensors and each of the filament conditions as well as the uniqueness of the information in the data set. The investigations described here are part of a program designed to explore the use of pattern recognition techniques to improve the ability of instruments to accurately identify trace levels of specific compounds in real environments. The immediate goals of the program are to investigate the clustering of a set of hazardous gases and vapors and to minimize the number of sensors necessary to obtain good separation and analysis of the different chemical species. Twenty-two hazardous gases and vapors of interest to the U.S. Coast Guard were presented to the CPS. These compounds were chosen because they are typical of the environment in which the CPS will be used and because they represent several different classes of hazardous chemicals.
.
,
CARBON MONOXIDE
I 2'3 4 ' 5 ' 6 7 0 9'10'11'12'13'14'15'16'
-1264
CHANNEL
I ' 2 ' 3 ' 4 '5'6'7'0'9'10'1/'12'13'14'15'16' CHANNEL
Flgure 1. Typical normalized patterns obtained from the sensor array.
filaments (2,3). Sixteen channels of data are produced from the electrochemical sensors, which can be set at different oxidation and reduction potentials and operated in four different modes: (a) no filament, (b) platinum filament at a fixed temperature, (c) rhodium filament at a fixed temperature, and (d) rhodium filament at a second fixed temperature (2). The heated filaments partially oxidize some compounds before they pass over the electrochemical sensors. The types of electrochemical sensors and f i i e n t s used, as well as their potentialsand temperature settings, have not been optimized for compounds of interest but the technique has been reported (4). Data were collected under the conditions described in Table I1 and ref 2. Vapor mixtures at parts-per-million levels were passed through the four sensors sequentially,for each of the four modes of operation. At the end of this sequence, the 16 signals, or channels of information, were recorded for each compound. Sixteen channels of data provide a 22 X 16 matrix of data. Figure 1shows several normalized patterns (Le., largest signal is adjusted to 128 countg, other signals are scaled proportionately) obtained by using the prototype device. The differences in the histograms for NOz, cyclohexane, benzene, and CO are sufficient to permit identification of each compound even by eye. The responses of each channel as a function of concentration have been reported to be linear for such sensor systems in the concentration range studied (3). Other compounds have been studied in this system (2,5,6)and the instrument system that incorporates the sensor array has been described elsewhere in some detail (3). The data set was collected over a concentration range of 20-300 ppm. Each vapor had a known concentration, which was different from the other vapors in the data set. For pattern recognition analysis, the response to individual compounds was corrected for concentration by dividing the responses for a given compound in each channel by the known concentration. This normalized the data, making all the responses equivalent to the response of each at 1ppm. Methodology. The methods were implemented with a computer software system known as ADAPT, which stands for Automated Data Analysis and Pattern Recognition Toolkit (7). This system includes a wide variety of techniques for performing complex compound-response analysis. The responses of the amperometric sensors to compounds of interest are numerical physicochemical descriptors. Experiments were conducted to collect these descriptors. The sensor and
EXPERIMENTAL SECTION This study was conducted in two stages-(a) collection of data with the CPS and (b) application of pattern recognition analysis methods to the collected data. Data Set. The 22 gases and vapors, listed in Table I, ranging in concentration from 20 to 300 ppm in air, were presented individually to the sensor array. The array consists of four different electrochemical sensors preceded by two heated noble metal
Table 11. Channel Numbers as Determined by Operating Modes and Sensors
channel number by sensor operating mode
cell no. working electrode biasa
no filament
Pt filament, 750 " C Rh filament, 450 " C Rh filament, 600 O C 'Volts vs. a Pt-air reference electrode in the same electrolyte.
1
2
3
4
Au
Au +300 mV
Pt +200 mV
Pt black
5
9 10 11 12
13 14 15 16
-200 mV 1 2 3 4
6 7 8
0 mV
862
ANALYTICAL CHEMISTRY, VOL. 58, NO. 4, APRIL 1986
filament conditions needed to be carefully selected to obtain the best results. To thoroughly represent the data with a minimum number of features, the experimenter must rely on his or her knowledge of the data set. The sensor responses (descriptors) for each compound were entered into the data set and stored as a data matrix. Each row in the data matrix is represented by a pattern vector where Xi is the pattern for compound i, and xi are the sensor responses from 1 to n. Because descriptor values can differ by several orders of magnitude, it was: important to make them more compatible. Autoscaling,a preprocessing method that scales and normalizes the data, was useful for correcting the imbalance between descriptors. Each pattern vector component was standardized to a mean of zero and a standard deviation of unity. Even though autoscaling altered the actual values of the descriptors, it did not alter the number of features or the basic geometry of the clustering (7). Because many of the available descriptors encode similar information, collinearities between descriptors may cause numerical instabilities in the analysis phase. Using a multiple linear regression program to locate such collinearities resulted in a set of descriptors in which each contributes unique information. Once a satisfactory set of descriptors had been chosen, the pattern recognition phase of the analysis was initiated. In general, the compounds are considered as points defined by n descriptors and projected onto n-dimensional space, where the coordinate system is defined by the descriptors. Three different pattern recognition methods are available: display, mapping, and clustering. Because it is impossible to imagine the data points clustering in n-dimensional space, two display methods were used to transform the data into two-dimensional space for easier visualization. The first of these plots the patterns of a data set as circular profiles, that is, as polygonal figures, consisting of lines connecting points equally spaced around the center like spokes on a wheel. The distance of each point from the center depends on the actual sensor response. This routine provides a visual, graphical way of presenting the sensor responses. Similarities or differences in patterns can be more easily recognized. The second of these, the Karhunen-Loeve transformation, finds the axes in the data space that account for the major portion of the variance while maintaining the least amount of error. A correlation matrix for the stored data set is computed, and the eigenvalues and eigenvectors are then extracted. The two-principal-componentsplot presents the plane that best represents the data (8). The nonlinear mapping routine transforms a set of points from n space to two-space by maintaining the similarities and dissimilarities between the points. It does this by minimizing the error function
where d , is the distance in n space and dij* is the distance in two-space between points i and j (9). Clustering techniques are considered unsupervised learning techniques because the routines are given only the data and not the class membership of the points. Such methods group together similar compounds according to some criterion. By examination of the different clustering results, a clearer insight is gained into the actual clustering in n space (8). ADAPT includes a variety of agglomerative hierarchical clustering routines, which group the data by progressively fusing them into subsets, two at a time, until the entire group of patterns is a single set. The routines maintain a particular within-grouphomogeneity, depending on the criterion and the fusing strategy used. Three dissimilarity metrics were used (a) correlation coefficient, (b) Euclidean distance, and (c) Canberra distance. The fusing strategies investigated were (a) nearest neighbor, (b) furthest neighbor, and (c) flexible fusion. The data are displayed in a dendrogram (IO). ADAPT also includes four cluster-seekingroutines described in ref 8 as (a) simple cluster seeking, .(b) maximin distance, (c) K means, and (d) Isodata. Each routine groups the data into in-
CLUSTER 1
CLUSTER MEMBERS HCHO
SO2
2
H2S
3
NO
4
Remaining gases
CLUSTER 1
c'2
CLUSTER MEMBERS HCHO
SO2
2
H2S
3
NO
4
Remaining gases
NH3
C12
Flgure 2. K means clustering results: (a) 16 channels and (b) 8
channels.
dividual clusters according to Euclidean distance, each of which uses a differentthreshold to define proximity. The cluster routines vary in complexity, generating clusters based strictly on the distance threshold. These routines are exploratory in the sense that performancedepends on the data set, the starting parameters, and the chosen measure of similarity. Compounds are lumped together when the distances separating them are within the specified threshold. The results of these routines can be compared to see how consistent the clusters remain. The distance between members within each cluster is calculated as well as the distance between the cluster centers. These results assist the investigator in visualizing the clustering if any exists in the data set (8).
RESULTS AND DISCUSSION Regression analysis revealed high correlations between the channels representing different modes or filament conditions. Eigenanalysis showed that 99% of the variance was contained in the first six principal components. The results also suggested that sensor responses for the modes corresponding to no filament and the rhodium filament operated at 450 "C do not contribute any new information. In other words, when modes 1and 3 were deleted, the results for the remaining eight descriptors (modes 2 and 4) were almost identical with the results obtained from all 16 channels. The K means results shown in Figure 2 illustrate typical results. The circular profiles for eight channels (modes 2 and 4) shown in Figure 3 are presented t o assist in visualizing the information contained in the sensor responses. Many of the compounds appear very similar; H2S produces a strong response from all the sensors. Figure 4 is a plot of the 22 compounds in two-dimensional space as defined by the first two principal components, which represent 93% of the total variance of the data set. Figure 4a is dominated by HzS, Clz, and NO, which are labeled. Figure 4b shows an expanded view of the most densely clustered region. Even here the compounds are relatively well separated. The good separation of the compounds in the cluster suggests that the sensors characterized in the compounds, but that the analysis was dominated by a few outliers. Since it is important that the instrument be able to determine the outliers as well as the other members of the data set, they cannot simply be removed. The Euclidean distance (Figure 2) and hierarchial clustering (Figure 5) routines verified the principal-components plot by consistently separating the following compounds from the others: HzS, NO, Clz, HCHO, and SOz. These compounds were separated into their own classes, and the remaining gases in the data set were clustered together. A closure (normalization) method was implemented to reduce the effects of sensor sensitivity to some of the gases and vapors. The method, which is referred to as pattern nor-
ANALYTICAL CHEMISTRY, VOL. 58, NO. 4, APRIL 1986
0
0 NO
NO2
a
863
co
Benzene “3
0 0 Benzyl chloride
HCHO
CC14
Acetic Acid
Cyclohexane
Ethyl Acrylate
0 0 0 Toluene
Pyridine
CHC1)
Nitromethane
Tetrahydrofuran
so2
0 0 0
0
Acetone
Tetrachloroethylene
Nitrobenzene
c12
Flgure 3. Circular profiles of the data set using 8 rather than 16 channels.
Table 111. Correlation Matrix for 16 Channels following Pattern Normalization
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
1.000
0.991
0.997 0.993 1.000
0.557 0.594 0.535 1.000
0.202 0.222 0.172 0.333 1.000
0.082 0.006 0.060 0.120 0.888 1.000
0.180 0.204 0.159 0.146 0.958 0.856 1.000
0.024 -0.003 0.044 -0.633 -0.012 0.153 0.205 1.000
1.000
9 10 11 12 13 14 15 16
malization, divides the pattern descriptor values by the square root of the sum of the squared values. All the pattern vectors then have a length of one, which forces all the patterns onto a hypersphere with its center at the origin. Pattern normalization should be carried out prior to autoscaling to avoid numerical difficulties. Pattern normalization allows the data to be considered independent of both concentration and sensor sensitivity to specific compounds. When pattern normalization is used prior to autoscaling, the results are much different. The principal-componentsplot (see Figure 6a) shows little clustering of the gases and vapors tested. The 16 channels of data are less correlated when pattern normalization is used as shown by the correlation matrix in Table 111. Eigenanalysis revealed 99% of the variance was present in the first ten components.
9
10
11
0.164 0.098 0.216 0.221 0.124 0.253 0.157 0.092 0.200 0.323 0.331 0.360 0.731 0.601 0.720 0.645 0.658 0.565 0.690 0.511 0.693 -0.173 -0.201 -0.210 1.000 0.823 0.920 1.000 0.781 1.000
12
13
14
15
16
0.224 0.166 0.236 -0.152 -0.408 -0.242 -0.397 0.349 -0.499 -0.264 -0.377 1.000
0.428 0.454 0.419 0.414 -0.034 -0.159 -0.089 -0.368 0.106 -0.079 0.028 -0.418 1.000
0.413 0.461 0.411 0.398 -0.071 -0.239 -0.101 -0.378 0.087 -0.186 0.012 -0.416 0.920 1.000
0.391 0.438 0.394 0.386 -0.057 -0.162 -0.100 -0.354 0.150 -0.060 0.030 -0.419 0.969 0.947 1.000
0.339 0.352 0.326 0.495 -0.217 -0.490 -0.285 -0.467 -0.296 -0.424 -0.201 -0.120 0.346 0.532 0.324 1.000
With eigenanalysis and regression analysis, the features can be reduced from 16 to 10 channels. As shown in the principal-components plot, Figure 6b, mode 3 and sensor responses from cells 1and 2 for mode 2 were deleted without any loss of classification information. All of the clustering routines available were used to investigate the data set. By use of 10 channels, the clusters shown in Figure 7 are generated by Euclidean distance clustering routines and are typical of the clustering results for both 16 and 10 channels. Grouping according to chemical class is more apparent using pattern normalization than using concentration correction. In addition, the circular profiles, shown in Figure 8, illustrate the uniqueness of each pattern. Because the information has been normalized, all patterns are the same size and equally important.
ANALYTICAL CHEMISTRY, VOL. 58, NO. 4, APRIL 1986
864
1s
4 110 10
1116 21 24
NO ll 12
PC
I
0 la
IS 1
14
22
13 1s
4
I
N
Y
3
l
11
e
"
7
10
20
6 16
12 18
0
PC 1
Figure 6. Two-principakomponents plot of the responses after pattern normalization and autoscaling: (a) using 16 channels and (b) using 10 rather than 16 channels. Numbers refer to the 22-compound data set given in Table I. Cluster
0
J
0
C H ~ C H ~ O ~= C cn, H
-
co
0
Cluster Members
I--
e
-
co
6
NO
H2S SO,
-
CHJCOOH
CCI,
NO2
"')(..