Environ. Sci. Technol. 1987, 21, 891-897
Classification and Identification of Hazardous Organic Compounds in Ambient Air by Pattern Recognition of Mass Spectral Data Donald I?. Scott" Environmental Monitoring Systems Laboratory, U.S. Environmental Protection Agency, Research Triangle Park, North Carolina 277 11
William J. Dunn I11 and Silvio
L. Emery
College of Pharmacy, University of Illinois at Chicago, Chicago, Illinois 60612
rn A new method for classification and identification of target air pollutants based on class modeling pattern recognition of autocorrelation transformed mass spectra has been developed. Target and other compounds can be classified into three classes: chloro compounds, bromoalkanes and bromoalkenes, and non-halobenzenes. After classification, the 78 target compounds are identified by comparing their mass spectra with those of their three nearest neighbors in the target set. This pattern recognition scheme has been applied to gas chromatographymass spectrometry data from three field samples obtained in routine ambient air monitoring. The accuracy of target compound classification and identification was 88 and 85 5%. Seventy-five other compounds were tentatively identified in these samples by library search methods. Classification results showed good agreement between pattern recognition and library search techniques for the limited number of chloro and non-halobenzene compounds in the samples. However, a large number of hydrocarbons were misclassified as chloro compounds.
Introduction In the routine monitoring of ambient air quality, gas chromatography-mass spectrometry (GC-MS) is used to determine a target group of organic pollutants. There are approximately 80 organic compounds on the present target list including substituted benzenes, haloalkanes, and haloalkenes with most containing chloro and/or bromo substituents. Compounds on the list are identified according to gas chromatography retention times and a combination of forward and reverse library search of the mass spectra ( 1 ) . The most common methods of computer-assisted identification of nontarget compounds from mass spectra are library search and interpretative procedures ( 2 , 3 ) . These methods have been reviewed recently by Martinsen ( 4 ) . The use of these procedures during routine analysis of large numbers of samples with hundreds of compounds in each sample requires extensive computer time and considerable expense. Also, the identification of unknown spectra at trace levels in complex samples such as ambient air by library search techniques is difficult and frequently gives incorrect results. We have been developing alternatives to these traditional techniques for compound classification and identification by using computational pattern recognition methods (5, 6 ) . In a recent study (7) of the 78 compounds considered here, it was found that the use of information theory and SIMCA pattern recognition (5)resulted in the successful classification of binary-encoded reference mass spectra into four main groups involving chloro and bromo substituents. In another recent study of these compounds (8),a proposed hierarchical classification and identification scheme was successfully applied to the autocorrelation transformed 0013-936X/87/0921-0891$01.50/0
0 1987 American
mass spectra of relatively simple routine GC-MS Calibration data files. The classes used in the hierarchical classification were chloro compounds, bromo compounds, and non-halobenzenes. This scheme uses SIMCA class modeling and k nearest-neighbor pattern recognition techniques and achieved ca. 86 5% classification and identification accuracy of the target compounds. In this paper, the application of this latter pattern recognition scheme to more complex GC-MS data files obtained from actual field samples will be discussed.
Class Modeling Pattern Recognition The full mass spectrum of a compound can be reprein which the elements sented numerically as a vector of the vector are the intensities of the ions observed in the mass spectrum. This is shown in eq 1 for the mass specIk,i
= Ik,35,
Ik,36,
Ik,256
(1)
trum of compound k. Each spectral vector also represents a point in a multidimensional measurement space. The dimensionality of the measurement space is determined by the number of mass intensities considered. For the purpose of this paper, only the mass interval 35-256 will be of interest. It has been suggested by McGill and Kowalski (9) and by Wold and Christie (10)that the autocorrelation transform (11) of the mass spectrum rather than the full spectrum is more appropriate for pattern recognition applications. This transform of a mass spectrum is given in eq 2 and is equivalent to multiplication
rl = C(I,- W,+, - I)/X(I, - I)2 1
(2)
1
of the mass spectrum by itself after being shifted j mass units. We use here the transform centered relative to the mean, I. The autocorrelation coefficient, rJ,represents the correlation between all ions separated by j mass units throughout the spectrum. I t is calculated here for 1 5 j I100, which assumes that there is little information or statistical significance in correlations of ions separated by more than 100 mass units. The usual mass spectrum and autocorrelation transformed spectrum of toluene are shown in Figure 1. The mass spectrum is characterized by ions at m / e 39, 50, 51, 63, 64, 65, 91, and 92. In alkylbenzenes these ions are separated by fragments that are of general formula C,H,, where x = 1 and 2 and y = 2,3, and 4. These fragments are found in the autocorrelation transformed spectrum of toluene at lags of 26-29. Thus, the autocorrelation transform converts the conventional mass spectrum to a spectrum of frequency of loss of neutral fragments of a given mass and allows the comparison of compounds and unknowns on this basis. The autocorrelation transformed mass spectrum is a more natural presentation of the fragmentation processes occurring in a mass spectrometer than is the usual mass spectrum. It also has a number of
Chemical Society
Environ. Sci. Technol., Vol. 21, No. 9, 1987
891
Mars Spectrum
I
lO0l
50
150
100
260
200
m/e .6 Autocorrelation Transformed Spectrum
T
r.
1
JI h 20
40
BO
Pq
100
masr of fragment fort Flgure 1. Mass spectrum and autocorrelation transformed mass spectrum for toluene.
advantages for pattern recognition applications. The transformed mass spectrum of a compound expressed in vector notation as in eq 3, provides a means of comparing
rl = r l , rz, r3, ... rp
(3)
compounds with regard to similarity of fragmentation patterns. If the spectra for a class of similar compounds are projected into this fragmentation pattern space, their point representations will be located in a well-defined cluster. The spatial spread of the points in the cluster will represent the variation in fragmentations, i.e., chemical structure, for the class. This class variation can be modeled with principal components models (5)as shown in eq 4. Here,
Figure 2. Classification and identification scheme for application of class modeling pattern recognition to mass spectral data.
(4)
r, is the mean correlation coefficient for lag j , and ekJis the residual of the coefficient for compound k at lag j . The tkk are principal component scores for the compounds, k , and the 4 ’ s are the loadings for the respective components, a. The number A of statistically significant components that model the systematic class variation discussed above is determined by cross-validation ( 5 ) . The transformed mass spectra can be tabulated in matrix form. Each row consists of the autocorrelation spectrum for the respective compound, and the columns contain the respective coefficients for each mass fragment over all compounds. After exploratory data analysis, the compounds that are found to be similar are placed in classes, Le., training sets, and principal component models are derived for each of the training sets. The residuals for a class are used to calculate a standard deviation, which is used to establish a confidence interval around that class. Therefore a class model with its confidence interval represents a volume in pattern space where the members of a class and unknowns similar to it have the highest probability of being observed. The fit of unknown spectra to class models on the basis of distance from the models determines the class assignments of the unknowns. Ideally, a compound will fall inside the confidence interval for a class and outside that for the others. Methods
Hardware and Software. An IBM PC-XT was used in this study. The microcomputer was equipped with 640K of memory and an 8087 math coprocessor. Commercial 892
Environ. Sci. Technol., Vol. 21, No. 9, 1987
software was used to download mass spectra data files from the Finnigan mass spectrometer via a modem. Software was written in BASIC to format, transform, and plot spectral data. A modified version of the SIMCA program was used for the principal components analysis. Additional software was written for the k nearest-neighbor analysis, the correlation coefficient calculations, and the model fitting. An output file of the classification and identification results was generated which could be listed as a summary report. Derivation of Class Models. Since this aspect of the problem has been discussed elsewhere (81, only a brief discussion will be presented here. The low-resolution mass spectra of the 78 target compounds were obtained from the EPA-NIH Mass Spectral Library on an INCOS data system. A list of the 78 compounds is given in Table I. The internal standards for retention times, 1-fluoro-2iodobenzene and perfluorotoluene, are included as are two ethers, dioxane and tetrahydrofuran. The latter two compounds are on the target list but do not fit into the training sets. Three chemical classes of compounds were found in the target set from preliminary principal component analysis of the transformed mass spectra (8). These are nonhalogenated benzenes (class l),chlorobenzenes, chloroalkanes, and chloroalkenes (class 2), and bromo- and bromochloroalkanes and -alkenes (class 3). Principal components models were derived from the autocorrelation transformed spectra for each class by use of modeling power (5) for variable selection. Twelve variables were included in the class 1model, 18
Table I. Compounds Included in This P a t t e r n Recognition S t u d y compound
class”
(1)perfluorotoluene (2) 1-fluoro-2-iodobenzene (3) p-xylene (4) 1,3,5-trimethylbenzene (5) isopropylbenzene (6) n-butylbenzene (7) l-methyl-4-isopropylbenzene (8) o-dichlorobenzene (9) p-dichlorobenzene (10) l-chloro-2-methylbenzene (11) l-chloro-4-methylbenzene (12) p-chlorostyrene (13) 1,l-dichloroethane (14) 1,1,1,2-tetrachlor,oethane (15) 1,2,3-trichloropropane (16) 3-chloro-l-propene (17) 2-chlorobutane (18) 1,3-dichlorobutane (19) 1,4-dichlorobutane (20) cis-1,4-dichloro-2-butene (21) 3,4-dichlorobut-l-ene (22) 1,4-dioxane (23) l-chloro-2,3-epoxypropane (24) 2-chloroethoxyethene (25) acetophenone (26) benzonitrile (27) benzene (28) toluene (29) o-xylene (30) m-xylene (31) ethylbenzene (32) styrene (33) chlorobenzene (34) bromobenzene (35) n-dichlorobenzene (36) l-chloro-3-methylbenzene (37) chloroform (38) carbon tetrachloride (39) bromochloromethane (40) bromotrichloromethane
0 0 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2
2 1 1 1 1 1 1 1 1
2 3 2 2 2 2 3 3
compound (41) dibromomethane (42) bromoform (43) 1,2-dichloroethane (44) l,l,l-trichloroethane (45) 1,1,2-trichloroethane (46) 1,1,2,2-tetrachloroethane (47) pentachloroethane (48) 1,l-dichloroethene (49) trichloroethene (50) tetrachloroethene (51) bromoethane (52) 1,2-dibromoethane (53) l-chloropropane (54) 2-chloropropane ( 5 5 ) 1,2-dichloropropane (56) 1,3-dichloropropane (57) l-bromo-3-chloropropane (58) 1,2-dibromopropane (59) 2,3-dichlorobutane (60) tetrahydrofuran (61) benzaldeyde (62) l-bromo-l-chloroethane (63) 2,2-dibromopropane (64) 2-bromopropene (65) 2-bromopropane (66) 3-bromopropene (67) l-bromopropane (68) l-chlorobutane (69) l-bromo-2-chloroethane (70) bromodichloromethane (71) 1-bromobutane (72) 2,2-dichlorobutane (73) dibromochloromethane (74) 1,1,2-trichloropropane (75) 1,3-dibromopropane (76) 1,1,1,2-tetrachloropropane (77) 1,2,2,3-tetrachloropropane (78) 1,3-dibromobutane (79) 1,1,2,3-tetrachloropropane (80) 1,4-dibromobutane
classn 3 3 2 2 2 2 2 2 2
2 3 3 2 2
2 2 3 3 2 0 1 3 3 3 3
3 3 2 3 3 3 2 3 2 3 2 2 3 2 3
Classes: class 1 = non-halo aromatics; class 2 = chlorohydrocarbons including aromatics and aliphatics; class 3 = bromo- and bromochlorohydrocarbons; class 0 = internal standards and dioxane and tetrahydrofuran.
were included in class 2, and 40 were included in class 3. No more than three principal components were required for the models. Validation of the class models by classification of the training compounds resulted in 86% being correctly assigned within two standard deviations of the appropriate models. Criteria for Classification and Identification. The spectral classification and identification scheme is given in Figure 2. T o identify a specific target compound, the distances from the unknown to the three nearest neighbors in its assigned class were obtained, and the correlation coefficients of the unknown mass spectrum with the spectra of these three nearest neighbors were calculated. The correlation coefficient was calculated for the usual mass spectra rather than for the transformed spectra. This was necessary since some compounds have identical transformed spectra even though their mass spectra are quite different. This results from the fact that some mass spectra differ only by being shifted by a constant mass. A minimum correlation coefficient of r 2 0.90 is set for identification purposes. This allows for variation in spectra due to instrument differences and to impurities in the unknown. In order for a compound to be assigned to a chemical class, it must be within two standard deviations of a specific class model. In some cases overlap of classes is observed, and an unknown may fall within these bounds for
two different classes. In such cases, the correlation coefficients of the unknown spectrum with the mass spectra of the three nearest neighbors in the class of closer fit are calculated first. If the unknown is not identified as a member of the class of closer fit, the process is continued for the other class. If an identification is not obtained at this point, the unknown is assumed not to be a member of the training sets. Compound Identification in Field Samples. The GC-MS data files for three field samples were obtained during routine monitoring of ambient air by the Analytical Services Branch, Environmental Monitoring Systems Laboratory, U.S. EPA, Research Triangle Park, NC. Organic pollutants in ambient air are concentrated on Tenax solid sorbent and subsequently thermally desorbed into a Finnigan 4510 GC-MS equipped with an INCOS operating system. The identities of the target compounds were determined by the Analytical Services Branch using both GC retention times and a combination of forward and reverse spectral matching techniques with stringent matching parameters (1). The identification of other compounds not on the target list was based on a Finnigan search technique using “ F I T as the primary matching parameter and “PURITY” and “RFIT” as secondary bounded parameters (12). PURITY is the square of the correlation coefficient between the unknown and library spectra computed over all peaks. FIT Environ. Sci. Technol., Vol. 21, No. 9, 1987
893
Table 11. Classification a n d Identification of Target Compounds i n Tenax GC-MS Field Samples
identity" l,l,l-trichloroethane benzene carbon tetrachloride toluene trichloroethene tetrachloroethene ethyl benzene mlp-xylenef o-xylene isopropylbenzene styrene m/p-dichlorobenzenef benzaldehyde 1,2,4-trimethylbenzene 1,3,5-trimethylbenzene
library search PURITY for sample 56 58 63 94 98 75 95 14 96 96 93 87 90
95 82
93 97
96 12
95
96 93 89 93
97 94 89 88
RFITb for sample 56 58 63 96 98 15 95 75 98
97 83
95 98
2 1 2
96 73
95
1
96
96 94 90 98
97 94 90 98 80 93
94 87 92
80 90 83 93
81
81
92
93
class'
85 95 82
83 92
95
2 2 1 1 1 1 1 2 1 1 1
pattern recognition daisd for sample 56 58 63 identitye 2 1
2 1
2 1
C C
1 U
1
C
1 1 1
1 1 1 1 2
1 1 U 2 1 1 1 1
1
C C C C Cg
2
1
C C C C
1
1
1
1
1
nIdentified by GC retention times and a combined forward and reverse search. bPURITY is the square of the correlation coefficient between the library and unknown spectra taken over all masses. RFIT is similar, but taken only over the masses in the unknown spectrum. Maximum value is 99. 1 = Non-halobenzene compound; 2 = chloroalkane, chloroalkene, and chlorobenzene compounds; 3 = bromo- and bromochloroalkanes and -alkenes; U = unclassified. Class determined by pattern recognition modeling. No entry means that the sample did not have the compound present. e C designates a correct identity. /These isomers are not resolved by the GC and have identical mass spectra. g In sample 58, isopropylbenzene was misidentified.
and RFIT are similar parameters computed over only the peaks in the library and unknown spectra, respectively. PURITY is a measure of the similarity of the unknown and library spectra. FIT is a measure of the degree to which the library spectrum is contained in the unknown spectrum. RFIT is a measure of the degree to which the unknown spectrum is contained in the library spectrum. A minimum value of FIT of 70 out of 99.9 was required for a rwortable identification, and the compound with the highest FIT value was reported. The RFIT/PURITY ratio also had to be less than 1.05. The searches were conducted on the 25 400 mass spectra in the EPA-NIH data base and were restricted to compounds with molecular weights less than 321. Some of the compounds identified in the field samples are probably due to artifacts (13). Each sample contained approximately 50 identifiable compounds with a total of about 100 different compounds for the three samples. Other spectra were found but could not be identified.
Unknown Mass Spectrum sample 83 scan #703
I
50 1
Od150e
100
200 0
250 1
Styrene Masa Spectrum
I
50 1
loo
E%
O
200 0
250 1
Results and Discussion
Figure 3. Mass spectrum of unknown and styrene.
Target Compounds. Target compounds identified and classified in these samples are listed in Table 11. The target compounds were identified by the use of both gas chromatography retention times and a combined forward and reverse library search. Therefore, the identification of the target compounds is much more reliable than that for other compounds, which used only a library search. The matching parameters, PURITY and RFIT, also are listed in the table. Fifteen different target compounds were identified from the retention times and library search in the three samples. There were 33 compounds detected in the samples, and no bromo compounds (class 3) were found in any sample. The pattern recognition results show the class for each compound as determined from the class modeling and an indication of correct compound identification if one could be made from the nearest-neighbor mass spectral correlation coefficient results. The application of the pattern recognition scheme to the transformed data for the target compounds resulted in 11of 13 of the compounds in sample 56,9 of 10 in sample 58, and 9 of 10 in sample 63 being correctly classified. The clas-
sification was 88% correct. The compound identification results were 85% accurate for sample 56, 80% for sample 58, and 90% for sample 63 for an overall accuracy of 85%. The accuracy of identification and classification of the target compounds in these field samples is very good in view of the fact that only mass spectral information is used for this purpose. The accuracy rates in these field samples are almost identical with those found for the same pattern recognition procedures used with GC-MS calibration data containing all of the target compounds (8). This indicates that the presence of compounds other than the target compounds in the field samples does not cause any major problems with the classification and identification of the target compounds. This should be true as long as the resolution of all compounds in the samples by the gas chromatography is adequate to yield reasonable mass spectra. The target compounds that were incorrectly classified and identified were carbon tetrachloride in sample 56, trichloroethene in samples 56 and 58, and styrene in sam-
894
Environ. Sci. Technol., Vol. 21, No. 9, 1987
Table 111. Classification and Identification of Other Compounds in Tenax GC-MS Field Samples library search
identity"
56
PURITYb for sample 58
2-methylpropane trichlorofluoromethane oxalic acid butane 2-methylbutane 2,2-dimethylbutane isopropyl formate 2-methyl-1-propen-1-one 1,1,2-trichloro-1,2,2-trifluoroethane 2-butenal 1,3-pentadiene pentane 2-methylpentane 2-butanone 2-methylfuran hexane 4-methylpent-1-ene perfluorotoluenee 3-methyl-2-butanone cyclohexane 2-methylhexane 2,2,3,3-tetramethylbutane 2,2,4-trimethylpentane heptane 2-heptene methylcyclohexane 2,4-dimethylhexane 2,bdimethylhexane dimethyl disulfide 5-(pentyloxy)-l-pentene 4-methyl-2-pentanone 1-heptanol 3-ethyl-4-methylhexane 1,2,4-trimethylcyclopentane 1,2,3-trimethylcyclopentane N-pentylidenemethanamine 2,3-dimethylhexane 2-ethyl- 1-butanol 2-methylheptane 3-methyleneheptane hexanal 1-ethyl-2-methylcyclohexane 1,2-dimet~ylcyclohexane octane 4,5-dimethyl-l-hexene oct-2-ene (2-methylpropyl)cyclopentane 1,1,3,3-tetramethylcyclopentane 2,2-dimethylheptane hexamethylcyclotrisiloxane ethylcyclohexane 1,1,3-trimethylcyclohexane 1,3,5-trimethylcyclohexane 2,5-dimethylheptane 5-methyl-1- hexanol 2,3-dimethylheptane 3-heptanone 4-ethylheptane 5-methyl-3-hexanone heptanal 1-ethyl-2-methylcyclohexane 1-ethyl-4-methylcyclohexane 2,3-dihydro-4-methylfuran 2,2,4-trimethylheptane nonane 2,3-dimethyloctane 2,3,3-trimethylbut-l-ene heptane 1,7,7-trimethyltricyclo[2.2.1] 2-buten-1-01 1-ethyl-2-methylbenzene 1-methylhexyl hydroperoxide decane
94 84
91 89
63
classC
91 93 81
U 2? U U U U U U 2? U U U U U U U U 2? U U
88 90
83 80 85 92 95 72 82
88 79 78 82 78 75 93 90 95 84
97 84 98 96
85 99 90 96
91 96 96 85 76 97 95 94 93 97 94 89
94 99 77 86 82 81 96
94 97 75 96
94 98 91 96 98
98 92 94 95 98
95 93 93 98 81 94
83 89 94 91 89 95 89
89 95 94 91
86 86 72
93
92 92 92
91 92 90
95 93 91
90 81
97 96 90
96 96 94 75 87
81 90 74
73
91 71 94
U U U U U U U U U
classd by pattern recognition for sample 56 58 63 2 2
2 2 2
2
1 2
2 2 2 1 2
U 2 2 3
2
2 2 2
2 2
2 2 2
2
U U U
3 2 2 2
U U U U U U U U U U U 1
U U
2
2 2 2 2 2 1
2 2
2 2
2 2 2 2 2 2 2
2 U 2 2 2
U U U U U U U U U U U U U
1 2 2 2 3 2
2
U U U U U U U
2
2 3
U U U U U U
2 2 2
1 2 2 2 1 2
2 U U
3 2 1 2 1
2 1 1 2 1 2 2
2 2 2
2 2 1 3 2 1
2 U 2
3
2 2
2
2 2 2
2 1 2
2
1 2 2
Environ. Sci. Technol., Vol. 21, No. 9, 1987
895
Table I11 (Continued) librarv search PURITYb for sample
classd by pattern recognition for sample
identity"
56
58
63
class'
56
1-fluoro-2-iodobenzenee 2-ethylhexanal 2-methylnonane
96
96 91 91
96 91
2? U U
2
58
63
2
2 2
3 2
Identified by mixture mass spectral library search only. PURITY is the square of the correlation coefficient between the library and unknown spectra taken over all masses. Maximum is 99. Class assuming that the spectral search compound identification is correct. U is unclassified in the three-class scheme: (1) non-halobenzene compounds; (2) chloroalkane, chloroalkene, and chlorobenzene compounds; (3) bromo- and bromochloroalkanes and -alkenes. Class determined by pattern recognition modeling. e Internal standards.
Unknown 100
I
I
I
Mass Spectrum
I
t
50
sample 63 scan 667
100
160
200
250
m/e
,
100,
50
Xylene Mass Spectrum
100
150
200
250
m/e Flgure 4. Mass spectra for unknown identified by library searching methods to be 5-methyl-1-hexanol and that of xylene.
ple 63. Isopropylbenzene was correctly classified in sample 56, but was not correctly identified. The first four spectra that were misidentified have spectra that contain ions of coeluting compounds. This is shown in the low values of PURITY and RFIT (72-80) for these compounds compared to the much higher values for the other compounds, Le., PURITY and RFIT values of 81-98. PURITY measures the overall match of the two mass spectra, unknown and library, and RFIT measures the match with the unknown as the reference spectrum. Trichloroethene, isopropylbenzene, dichlorobenzene, and 1,3,5-trimethylbenzene were not identifiable without the GC retention times, i.e., by mass spectral search alone, in these samples. The latter three compounds, except for isopropylbenzene in sample 56, were correctly identified with the pattern recognition scheme, which does not require GC retention times. An example of the coelution problem is the spectrum identified with retention time and spectral search as styrene in sample 63. The mass spectrum of this substance and the training data for styrene are given in Figure 3. The correlation coefficient of this unknown mass spectrum with that of the styrene spectrum in the training set data is only 0.85, which is less than the minimum of 0.90 required for identification in the pattern recognition scheme. Therefore, this spectrum would not be identified as that of styrene in the proposed scheme. It can be seen that the spectrum of the unknown contains a number of ions below mle 56, which do not belong to the styrene spectrum. Other Compounds. The results of the spectral search identification and the pattern recognition classification results for the other compounds found in these samples 896
Environ. Sci. Technol., Vol. 21, No. 9, 1987
are given in Table 111. The values of the search parameter, PURITY, are also listed. Since only spectral search results and no retention times were used in the compound identifications, these identifications are much less certain than those of the target compounds. A class was assigned to each of the compounds from the three pattern recognition classes assuming that the search identifications were correct. These class assignments are compared with those obtained from the pattern recognition scheme in Table 111. Since the pattern recognition scheme cannot identify compounds that were not in the target set, this is the only comparison that can be made. There is the strong possibility that some search identifications may be incorrect although the PURITY values are generally good. An example of an incorrect spectral identification is illustrated in Figure 4. This is the mass spectrum identified as that of 5-methyl-1-hexanol by spectral search with a relatively low PURITY and FIT of 72 and 85. Visual examination of this spectrum shows that this is obviously not the case. The molecular ion for the alcohol (mle 116) does not appear in the spectrum. This is probably the spectrum for rn-xylene, a target compound, as it is identified by the pattern recognition scheme. There were 75 different compounds identified in 120 occurrences in the three samples. The classification results agree very well for the two class 1 spectra (l-ethyl-2methylbenzene) and class 2 spectra, if the fluoro and fluorochloro compounds are identified with class 2, with 2 of 2 and 10 of 11 agreements, respectively. No class 3 (bromo) compounds were identified, but 8 of 120 compound occurrences were erroneously classified as such. Incorrect Classification. One obvious problem with applying the class modeling approach to these data is that a large number of compounds that should be unclassified are erroneously classified into a target class. For example, a very large number of the alkanes and alkenes, which were not on the target list and were not modeled, were incorrectly classified into class 2. This is due to the presence of aliphatic fragments in the transformed spectra in both the alkyl chlorides and hydrocarbons giving natural class overlap. This reveals a problem with classification based on similarity modeling and its use for class discrimination of some chemical data. Models developed for classification using modeling power contain variables optimal for determining similarity of the data within the training sets. These variables may, or may not, be optimal for class discrimination, especially if there is no class model developed for nonmembers of the training sets. In the classification step, it may be appropriate to use all of the data in the unknown, rather than just those variables selected by modeling power in the training phase. In this way, information about class differences (discrimination) can be used in the classification step.
Therefore, it should be possible to improve class assignment by use of variables in the unknown that are not included in the class model. This could be done by giving these variables nonzero weights in the classification step. This may be considered by some to be a major variation from the philosophy of classification based on class modeling. In the training phase, variables that contain noise or no information about class assignment are deleted. This can be interpreted as reducing the dimensionality of the pattern space. This could also be interpreted as nulling those variables in the training sets but keeping them active for use in discrimination. Since this is a major question in classification based on class modeling, it will be addressed in another paper with additional examples.
Conclusions The proposed pattern recognition procedure performed very well in classifying and identifying the target compounds in these complex field samples. Its performance in classifying other compounds found in the field samples was less satisfactory. The major problem with nontarget compounds was misclassification of compounds that were not members of any modeled class as members of modeled classes. To improve its performance on these latter compounds, either models for the nontarget compounds should be added to the scheme, and/or the weighting of the variables should be changed in the classification step to improve discrimination. Overall, the results on these complicated field samples are encouraging. It should be emphasized that no GC retention time information is used in the pattern recognition scheme, only mass spectral information. One cannot expect perfect results with the scheme as proposed here, but this procedure certainly can yield useful screening results. It could also be used to select mass spectral data files for more detailed manual and/or computer interpretation. Acknowledgments
We gratefully acknowledge the assistance of Lynn Wright, Curtis Morris, and Joe Bumgarner of EMSL, US. EPA, Research Triangle Park, NC, by providing the GCMS data files. We thank the Research Resources Center of the University of Illinois a t Chicago for the use of the data processing unit on the Finnigan 4510, which was used for data transfer. Registry No. p-Xylene, 106-42-3; 1,3,5-trimethylbenzene, 108-67-8; isopropylbenzene, 98-82-8; butylbenzene, 104-51-8; 1methyl-4-isopropylbenzene, 99-87-6; o-dichlorobenzene, 95-50-1; 95-49-8; p-dichlorobenzene, 106-46-7; l-chloro-2-methylbenzene, l-chloro-4-methylbenzene, 106-43-4; p-chlorostyrene, 1073-67-2; 1,l-dichloroethane, 75-34-3; 1,1,1,2-tetrachloroethane,630-20-6; 1,2,3-trichloropropane, 96-18-4; 3-chloro-l-propene, 107-05-1; 2-chlorobutane, 78-86-4; 1,3-dichlorobutane, 1190-22-3; 1,4-dichlorobutane, 110-56-5; 1,4-cis-dichloro-2-butene, 1476-11-5; dibromomethane, 74-95-3; bromoform, 75-25-2; 1,2-dichloroethane, 107-06-2; l,l,l-trichloroethane, 71-55-6; 1,1,2-trichloroethane, 79-00-5; 1,1,2,2-tetrachloroethane,79-34-5; pentachloroethane, 76-01-7; 1,l-dichloroethene, 75-35-4; trichloroethane, 25323-89-1; tetrachloroethene, 127-18-4; bromoethane, 74-96-4; 1,2-dibromoethane, 106-93-4; 1-chloropropane, 540-54-5; 2-chloro-
propane, 75-29-6 1,2-dichloropropane, 78-87-5; 1,3-dichloropropane, 142-28-9; l-bromo-3-chloropropane, 109-70-6; 1,2-dibromopropane, 78-75-1; 2,3-dichlorobutane, 7581-97-7; tetrahydrofuran, 109-99-9; 3,4-dichlorobut-l-ene, 760-23-6; 1,4-dioxane, 123-91-1; l-chloro-2,3-epoxypropane, 106-89-8; 2-chloroethoxyethane, 110-75-8; acetophenone, 98-86-2; benzonitrile, 100-47-0; benzene, 71-43-2; toluene, 108-88-3; o-xylene, 95-47-6; m-xylene, 108-38-3; ethylbenzene, 100-41-4; styrene, 100-42-5; chlorobenzene, 108-90-7; bromobenzene, 108-86-1; m-dichlorobenzene, 541-73-1; l-chloro-3-methylbenzene,108-41-8; chloroform, 67-66-3; carbon tetrachloride, 56-23-5; bromochloromethane, 74-97-5; bromotrichloromethane, 75-62-7; benzaldehyde, 100-52-7; l-bromo-lchloroethane, 593-96-4; 2,2-dibromopropane, 594-16-1; 2-bromopropene, 557-93-7; 2-bromopropane, 75-26-3; 3-bromopropene, 106-95-6; 1-bromopropane, 106-94-5; 1-chlorobutane, 109-69-3; l-bromo-2-chloroethane, 107-04-0; bromodichloromethane, 75-27-4; 1-bromobutane, 109-65-9; 2,2-dichlorobutane, 4279-22-5; dibromochloromethane, 124-48-1; 1,1,2-trichloropropane, 598-77-6; 1,3-dibromopropane, 109-64-8; 1,1,1,2-tetrachloropropane, 81203-3; 1,2,2,3-tetrachloropropane, 13116-53-5; 1,3-dibromobutane, 107-80-2; 1,1,2,3-tetrachloropropane, 18495-30-2; 1,4-dibromobutane, 110-52-1.
Literature Cited (1) Berkley, R. E.; Bumgarner, J. E.; Driscoll, D. J.; Morris, C. M.; Wright, L. H. S t a n d a r d Operating Procedure for the GCIMS Determination of Volatile Organic Compounds Collected on Tenax; U.S. Environmental Protection Agency, Environmental Monitoring Systems Laboratory: Research Triangle Park, NC, October 1983; EMSL/RTP-SOPEMD-020. (2) Venkataraghavan, R.; McLafferty, F. W.; Amy, J. W. Anal. Chem. 1967,39, 178. (3) McLafferty, F. W.; Stauffer, D. B. J . Chem. Inf. Comput. Sci. 1985, 25, 245. (4) Martinsen, D. P. Appl. Spectrosc. 1981, 35, 255. ( 5 ) Wold, S. P a t t e r n Recognition 1976, 8, 127. (6) Sharaf, M. A.; Illman, D. L.; Kowalski, B. R. Chemometrics; Wiley: New York, 1986; Chapter 6, p 242. (7) Scott, D. R. Anal. Chem. 1986,58,881. (8) Dunn, W. J., 111; Koehler, M. G.; Emery, S. L.; Scott, D. R. Chemometrics Intell. Lab. Sys., in press. (9) McGill, J. R.; Kowalski, B. R. J. Chem. Inf. Comput. Sci. 1978, 18, 52. (10) Wold, S.; Christie, H. J. Anal. Chim. Acta 1984, 165, 51. (11) Box, G. E. P.; Hunter, W. G.; Hunter, J. S. Statistics for Experimenters; Wiley: New York, 1978; p p 62-63. (12) Sokolow, S.; Karnofsky, J.; Gustafson, P. The Finnigan Library Search Program; Finnigan Application Report on Incos Data System, No. 2; Finnigan Instruments: San Jose, CA, March 1978. (13) Walling, J. F.; Bumgarner, J. E.; Driscoll, D. J.; Morris, C. M.; Riley, A. E.; Wright, L. H. Atmos. Enuiron. 1986, 20, 51.
Received for review December 9, 1986. Accepted May 29, 1987. Although the research described in this article has been funded by the U S . Environmental Protection Agency under Cooperative Agreement CR-811617 with the University of Illinois a t Chicago, it has not been subjected to Agency review. I t therefore does not necessarily reflect the views of the Agency, a n d no official endorsement should be inferred. The mention of trade names or commercial products does not constitute endorsement or recommendation for use.
Environ. Sci. Technol., Vol. 21, No. 9, 1987
897