Anal. Chem. 1999, 71, 4134-4141
Prediction of Substructure and Toxicity of Pesticides with Temperature Constrained-Cascade Correlation Network from Low-Resolution Mass Spectra Chunsheng Cai† and Peter de B. Harrington*
Center for Intelligent Chemical Instrumentation, Department of Chemistry and Biochemistry, Clippinger Laboratories, Ohio University, Athens, Ohio 45701-2979
Artificial neural networks are trained to predict the toxicity or active substructures of organophosphorus pesticides and then are applied to screening GC/MS data for environmentally hazardous compounds. Every mass spectral scan in the chromatographic run is classified, and separate chromatograms are obtained for either toxicity or substructure classes. Classification of mass spectra allows the detection of chromatographic peaks from potentially hazardous compounds that may be missing from the reference database. The neural network models predict substructures and toxicity from mass spectra without first determining the complete configurational structure of the pesticides. Temperature constrainedcascade correlation networks (TCCCN) were used because they are self-configuring networks that train rapidly and robustly. The toxicity classes are defined by the World Health Organization, and the substructure classes are standard organophosphorus pesticide groupings. The TCCCN models are used to mathematically resolve peaks in the chromatograms by substructure and toxicity. Evaluations yielded classification rates of 97 and 84% for substructure and toxicity, respectively. Organophosphorus pesticides, which have been favored over more persistent organochlorine pesticides due to their faster degradation rates, have played an important role in global agricultural chemistry for many years and are still widely used today.1 Most organophosphorus pesticide (OPP) compounds are phosphates, phosphorodithioates (phosphorothionothiolate), and phosphorothioates (phosphorothionates). Some others may fall in the following structural categories: phosphorothiolate, phosphorodithiolate, phosphoramide (phosphoramidate), phosphorodiamidate, phosphonate, and phosphinate. OPP compounds can be categorized into six classes based on the basic structure, as given in Table 1. The structures of some pesticides studied are also given in this table. The pesticides have a broad range of toxicity for humans. According to the World Health Organization, pesticides can be classified into five categories according to acute †
Hoechst Marion Roussel, Inc., 10236 Marion Park Drive Mailstop C1-MO444, Kansas City, MO 64137-1405. (1) Chambers, J. E.; Levi, P. E. Organophosphates: Chemistry, Fate, and Effects; Academic Press: San Diego, 1992.
4134 Analytical Chemistry, Vol. 71, No. 19, October 1, 1999
Table 1. The Substructure Categories, the Structures and Some Examples of OPP Compounds
toxicity. They are extremely hazardous (IA), highly hazardous (IB), moderately hazardous (II), slightly hazardous (III), and unlikely to present hazard in normal use (O), as given in Table 2.2,3 For trace analysis of OPP compounds in environmental samples, gas chromatography is a favored method, though other (2) Kidd, H.; James, D. R. The Agrochemicals Handbook, 3rd ed.; The Royal Society of Chemistry: Cambridge, England, 1991. (3) Organophosphorus insecticides: a general introduction; International Programme on Chemical Safety, Environmental health criteria 63; World Health Organization: Geneva, 1986. 10.1021/ac990159y CCC: $18.00
© 1999 American Chemical Society Published on Web 08/26/1999
Table 2. Toxicity Classification of Pesticides by the World Health Organization LD50 for the rat (mg/kg of body weight) class
IA IB II III IV
extremely hazardous highly hazardous moderately hazardous slightly hazardous unlikely to present acute hazard in normal use
oral
dermal
solids
liquids
solids
liquids
5 or less 5-50 50-500 over 500
20 or less 20-200 200-2000 over 2000
10 or less 10-100 100-1000 over 1000
40 or less 40-400 400-4000 over 4000
methods such as Raman and infrared spectroscopy,4 liquid chromatography,5 and electrochemical sensors6 are actively investigated. Gas chromatography/mass spectrometry (GC/MS) is a powerful method that can provide more information compared to the other detection methods.7,8 The chromatographic peaks may be identified by retention time, if the experimental conditions are fixed.9 Nevertheless, most commonly, peaks are identified by comparing chromatographic mass spectra to reference spectra of known compounds in a database. The identity of the peak is considered to be the spectrum in the database that best matches the experimental spectrum. Identification by spectral matching could be problematic if the experimental mass spectrum of a compound is not in the reference database or if the experimental spectrum suffers in quality. A common cause of unsatisfactory chromatographic mass spectra is peak-skewing. When the chromatography is good, the concentration of the analyte may change faster than the scan rate of the mass spectrometer. This problem is typically encountered with quadrupole spectrometers that scan a large range (e.g., 50-550 m/z). A second problem is that in some cases the chromatographic peaks may not be completely resolved or may contain multiple components. A solution to peakskewing is to average all the mass spectra across the chromatographic peak. However, averaging the spectral scans discards key information that may be used to resolve underlying components if a peak contains more than a single component. To overcome these problems, alternative methods for the mass spectral identification and classification have been investigated.10-13 Among many computer-assisted classification methods used to expedite and aid the analysis of GC/MS data, supervised learning algorithms such as expert systems have been used to classify mass spectra.10,11 The Fuzzy rule-building expert system (FuRES) is a pattern recognition method that combines fuzzy logic with a multivariate rule-building expert system.12 FuRES was (4) Tanner, P. A.; Leung, K. M. Appl. Spectrosc. 1996, 50, 565-571. (5) Lacorte, S.; Molina, C.; Barcelo, D. J. Chromatogr., A 1998, 795, 13-26. (6) Hart, A. L.; Collier, W. A.; Janssen, D. Biosens. Bioelectron. 1997, 12, 645654. (7) Rosen, J. D. Application of New Mass Spectrometry Techniques in Pesticide Chemistry; John Wiley & Sons: New York, 1987. (8) Lin, Y. W.; Hee, S. S. Q. J. Chromatogr., A 1998, 814, 181-186. (9) EPA, Method 8141B: Organophosphorus Compounds by Gas Chromatography, 1998. (10) Scott, D. R. Anal. Chim. Acta 1988, 211, 11-29. (11) Scott, D. R. Chemom. Intell. Lab. Syst. 1994, 23, 351-364. (12) Harrington, P. B. J. Chemom. 1991, 5, 467-486. (13) Tandler, P. J.; Butcher, J. A.; Tao, H.; Harrington, P. B. Anal. Chim. Acta 1995, 312, 231-244.
demonstrated in the classification of GC/MS data from plastic recycling products.13 The mass spectra were compressed by a modulo routine that reduced the number of m/z measurements to 14 points for each spectrum. The application of the FuRES to GC/MS data for classification of OPP compounds was also studied.14 The spectra from OPP compounds do not have a simple repeating fragmentation pattern and were not amenable to the modulo compression routine as the plastic recycling products (i.e., alkanes, alkenes, and dienes) were. The FuRES training time was prohibitively long if large sets of mass spectra are used for training. Neural networks are an alternative method for classification. Back-propagation neural networks (BNNs) are popular among chemists.15,16 The BNNs are feed-forward network trained by propagation of error back through the networks. The network size and architecture (number of processing units, layers, and interconnections) must be determined before training. Disadvantages of BNNs include the determination of the optimal network architecture and long training times. The cascade correlation neural network (CCN) was developed to alleviate these problems.17 The CCN configures its own architecture as it trains. It starts with a minimal network (i.e., input neurons and output neurons). The CCN then sequentially adds hidden units until the error decreases below a user-defined threshold. Each new hidden unit is connected to the network inputs and the outputs form the previously installed hidden units. Therefore the outputs from the previously added hidden units cascade into each new unit. Several candidate units can be trained in parallel, and the one with largest covariance can be selected as the next hidden unit to install into the network. Once trained, the hidden units no longer are adjusted; thus only one unit is trained at a time. This trait is somewhat unique for the CCN and eliminates the chaos of simultaneously adjusting all processing units and all adjustable parameters as in the BNN training. The number of hidden units of the CCN will increase until the desired error is obtained or the error converges above the threshold. The CCN like all other neural networks is prone to overfitting the training data. Temperature constrained units were introduced to reduce overfitting by controlling the length of the hidden unit weight vectors, and the temperature constrained cascade correlation network (TCCCN) was developed.18 The vector length of the hidden unit weight vectors is controlled by a parameter called computational temperature. The training for the TCCCNs is significantly faster than the training for the FuRES. However, they do not provide a mechanism of inference as readily. As Wold noted, the research of complicated relationships between chemical structure and chemical reactivity, biological activity, and other chemical and physical properties is “doing less well”.19 The research so far is focused on quantitative structure/ activity relationships (QSAR), which requires some descriptors to encode molecular information. These descriptors are used to (14) Cai, C.; Harrington, P. B. Presentation at the 23rd Federation of Analytical Chemistry and Spectroscopic Societies, Kansas City, MO, October 1996. (15) Zupan, J.; Gasteiger, J. Anal. Chim. Acta 1991, 248, 1-30. (16) Wythoff, B. J. Chemom. Intell. Lab. Syst. 1993, 18, 115-155. (17) Fahlman, S. E.; Lebiere, C. The Cascade-Correction Learning Architecture. Carnegie Mellon University Technique Report CMU-CS- 90-100, August 1991. (18) Harrington, P. B. Anal. Chem. 1998, 70, 1297-1306. (19) Wold, S. Some Reflections on Chemometrics. Newsletter 18 for the North American Chapter of the International Chemometrics Society, October 1998; pp 3-4.
Analytical Chemistry, Vol. 71, No. 19, October 1, 1999
4135
Figure 1. General objective. By training the input spectra with known properties, the relationships between the spectra and the properties can be modeled. This model can be used to rapidly predict the properties for every mass spectral scan in the chromatographic run.
represent a molecule digitally and are needed in order to build a relationship (either qualitative or quantitative) between the molecule and its property. By using mass spectra that are very characteristic and can be considered as representative of structures, the qualitative relationship between the mass spectra and toxicity has been created in this paper. In this study, the TCCCN is used to predict substructure and toxicity properties of OPP compounds directly from their mass spectra. Each point on the total ion current (TIC) chromatogram is the summation of peak intensities of a mass spectrum, which makes it suitable for further chemometric processing. The subchromatograms are produced by multiplying the output from the neural network with the total ion current (i.e., sum of the mass spectra peaks). The TCCCN is applied to every mass spectra scan in the GC/MS data. The chromatogram is decomposed into subchromatograms either by substructure or by toxicity. Toxicity is the direct property that is sought. The substructure classifications are indirectly related to toxicity and can be used to validate the screening results. THEORY The desired goal is to rapidly screen the chromatograms for peaks that are composed of spectra with characteristic properties. For environmental monitoring, a key property is toxicity. Other properties such as chemical substructures that are related to toxicity may also be screened. The goal is the use of mass spectra to predict these properties and achieve a chemometric detector that is selective for the different functional groups or toxicity classes. See Figure 1. In some cases, reference spectra may not be available in the reference databases and recognizing key substructures is a useful approach. A mass spectrum can be considered as a coding of a compound’s structure and some electronic properties may manifest themselves in the fragmentation pathways that generate the mass spectrum. Therefore, it may be possible to predict or observe electronic properties of a compound directly from a mass spectrum. Mass spectra may not be sufficient to entirely encode the chemical information for precise prediction of toxicity. However, for screening chromatographic properties, toxicity is the key property of interest and not necessarily the chemical structure. This approach is advantageous because chemical pesticides may decompose or react in the environment to produce new compounds that are not contained in the reference library. Although 4136 Analytical Chemistry, Vol. 71, No. 19, October 1, 1999
Figure 2. Architecture of TCCCN. O stands for the hidden layer outputs, and Y ˆ i is a matrix of predicted class assignments. Each row corresponds to an input spectrum, and each column corresponds to a class assignment. The column with the largest output value designates the predicted class.
the mass spectra may not encode information regarding configurational isomers, the mass spectra do encode electronic properties such as electronegativity of the organophosphorus bond that is related to toxicity. The architecture of the TCCCN is given in Figure 2. The inputs are multiplied by the weights and adjusted by a bias, v
netij )
∑w
jmim
+ bj
(1)
m)1
for which v is the number of input connections to unit j, wjm is a component of the normalized weight by Euclidean length, and im is the input activation coming from the mth neuron in the preceding layer. The results (netij) are input to the transfer function; the outputs of the jth hidden unit (oij) is obtained by
oij ) f(netij) ) 1/(1 + e-netij/tj)
(2)
for which tj is the computational temperature. The TCCCN training maximizes covariance between candidate neuron output and its residual error. The temperature constrained transfer function is applied on both hidden layer and output layer. The weights for hidden units are trained by maximizing the covariance between the unit’s output and the residue error. The covariance magnitude (|Cj|) of the output from candidate unit j and the residual error from output k is obtained from p
|Cj| )
n
∑ ∑(o |
ij
- ojj)(eik - jek)
(3)
k)1 i)1
for which the covariance is calculated with respect to the n observations in the training set. The absolute values of the covariances are added for the p output units. The averages are obtained for the n objects in the training set for the hidden unit output (oj) and error (ej). The denominator of n - 1 is omitted from the calculation, because it is constant through the entire training procedure. The weights are adjusted so that |Cj| is maximized.
Figure 3. Mass spectra of diazinon. Panels A-C are from the reference database and panel D is a single scan from the GC/MS. The base peak could be 137 or 179 amu.
For the temperature constrained neural networks, the weight vector direction is adjusted by changing the weights. The weight vector is constrained to unit Euclidean length, and the temperature (tj), which controls the length of the weight vector, is adjusted. In a conventional neural network, the magnitude of the weights and therefore the length of the weight vector can range freely. Overfitting usually occurs with weight vectors that are relatively long. The temperature constrained models restrict the temperature so that it maximizes the magnitude of the first derivative of the covariance between the output and the residual error with respect to temperature. This objective function is advantageous, because it causes the error surface to remain steep, which facilitates gradient training. In addition, outputs are continuously distributed throughout their range when the derivative is maximized, which ensures fuzzy interpolation of the hidden unit. The training for the output layer undergoes the same procedures, except that it uses the outputs from the hidden layer as inputs. The sigmoid output units were unconstrained because their inputs were overdetermined (i.e., there were more spectra in the training set than hidden units in the network). After training, the weights and the temperatures for both hidden layers and output layer are stored. For the testing, the input data are input into eqs 1 and 2 to generate hidden layer outputs and then these outputs are input into eqs 1 and 2 again for the output layer, generating the predicted values. To evaluate the TCCCN method, the prediction accuracy, i.e., the percentage of the correctly classified, is used as a performance metric. For classification, each class is assigned a value of unity for true and zero for false. During prediction, the class output with the largest value indicates the predicted classification. Rootmean-square error (RMSE) is not considered as a good means
for the classification problem because it is biased toward the outliers;20 thus it was not used as a criterion to compare the classification results. Generally, any calibration or classification model requires the training data set and testing data sets that are samples from the same population. If high-quality reference mass spectra can be used to build models, the models will be better at generalization or spectra that are lower in quality. Low concentrations and peakskewing are the two factors that cause experimental spectra to differ from the reference spectra. Another benefit of using highquality reference spectra is that the networks tend to train faster, because experimental variances associated with peak-skewing and low concentrations are removed from the training set. EXPERIMENTAL SECTION In this study, all standard MS data in the training and testing sets were obtained from the Wiley Registry of Mass Spectral Data, 6th ed. (John Wiley & Sons, Inc.). For approximately 90 OPP compounds mentioned in ref 2, 75 were found to have spectra in this database. There were 197 mass spectra obtained from the database. Note that some compounds have more than one mass spectrum in the database. Mass spectra from the same compound were not identical. An example of diazinon is given in Figure 3. Three spectra from the library are plotted in panels A-C. They vary in the relative peak intensities and the position of the base peak changes. Panel D is a spectrum from the GC/MS run. For classification by substructure, some structural categories were merged together because they have similar chemical bonds, (20) Buckheit, J. B.; Donoho, D. L. In Wavelet Applications in Signal and Image Processing III; Laine, A. F., Unser, M. A., Wickerhauser, M. V., Eds.; Improved Linear Discrimination Using Time-Frequency Dictionaries. SPIE: Washington DC, 1995; pp 540-551.
Analytical Chemistry, Vol. 71, No. 19, October 1, 1999
4137
and only a few examples of the substructures were available in the mass spectral database. Therefore, OPP compounds were grouped into six substructure categories, given in Table 1. For the classification by toxicity, the last two categories (III and O) were treated as one toxicity category in this work because only a few OPP compounds were available in each of these two classes. For each spectrum, the m/z unit was rounded to the nearest integer. When several peaks in the same spectrum rounded to the same m/z unit, the intensities were coadded. For example, if a spectrum had three peaks [in the form of m/z unit (intensity)], 144.6 (111), 145 (333), and 145.3 (222), they were treated as a single peak, 145 (666). The MS data were formatted into a matrix, for which the rows corresponded to spectra and the columns corresponded to m/z units. The m/z axis was the union of all the m/z variables for the data in the training set. For the prediction sets, if the spectra contained m/z units that were missing from the m/z axis, they were omitted from the calculation; and if the peaks were missing in the prediction set, they were assigned intensity values of zero. The number of variables (i.e., the columns of the data matrices) varied among the data sets and ranged between 335 and 370. The cascade neural network software was programmed in C++, compiled in Borland C++ compiler (V5.02, Borland Inc.), and run on a Pentium Pro 200 MHz computer equipped with 64 MBs of RAM, which was operated under MS-Windows NT 4.0. For training, a pool of five candidate neurons was trained simultaneously, and the one that had the largest covariance was installed as the hidden neuron. The weight updates were accomplished with a quickprop algorithm.21 The learning rate was set to 0.01, and for the quickprop parameters, the maximum growth factor (µ) was set to 1.0 and the shrink factor was set as 1/(1 + µ). The typical training time for data sets with 98 mass spectra was 5 min. The GC/MS data were collected from a Hewlett-Packard HP5890 gas chromatograph directly interfaced with a HP5988A quadrupole mass spectrometer that was operated in EI mode. A HP9000 series 300 computer with an HP59979 MS Chemstation software was used to record the GC/MS results. The column was HP-1 (cross-linked methyl silicone gum, 12 m × 0.2 mm × 0.33 µm). The temperature program consisted of an initial temperature of 70 °C followed by ramping at a rate of 8 °C/min to 250 °C. The mass spectrometer was tuned with PFTBA. The pesticide standards were pesticide kit no. 52 from PolyScience (Niles, IL, lot no. LA35505). The pesticides and the purity were mevinphos (2methoxycarbonyl-1-methylvinyl dimethyl phosphate, 60% plus 40% active related compounds), phorate (O,O-diethyl S-ethylthiomethyl phosphorodithioate, 98%), dimethoate (O,O-dimethyl S-methylcarbamoylmethyl phosphorodithioate, 98%), disulfoton (O,O-diethyl S-2-ethylthioethyl phosphorodithioate, 99%), diazinon (O,Odiethyl O-2-isopropyl-6-methylprimidin-4-yl phosphorothioate, 98%), methyl-parathion (O,O-dimethyl O-4-nitrophenyl phosphorothioate, 99%), and malathion (S-1,2-bis(ethoxycarbonyl)ethyl O,O-dimethyl phosphorodithioate, 95%). (21) Fahlman, S. E. An Empirical Study of Learning Speed in Back-Propagation Networks. Carnegie Mellon University Technique Rreport CMU-CS-88-162, September 1988.
4138 Analytical Chemistry, Vol. 71, No. 19, October 1, 1999
Table 3. OPP MS Data Set Composition for the Prediction of Structure category H I J K L M
no. of OPPs
no. of spectra
11 24 7 21 7 5
33 66 12 64 10 12
75
197
phosphorodithioate phosphorothioate phosphorothiolate phosphate phosphonate phosphoramide
total
Table 4. Confusion Matrix of a Substructure Classification class
H I J K L M total (98)
predicted predicted predicted predicted predicted predicted
H
I
J
K
L
M
28 3 1 0 0 0
1 31 1 0 0 0
1 1 4 0 0 0
0 1 0 15 0 0
1 0 0 2 3 0
0 0 1 0 1 3
32
33
6
16
6
5
RESULTS AND DISCUSSION Two data sets were used for each neural network model. The training set was used to build the model, and the testing set was used to evaluate the model. The neural networks already have a certain ability to avoid overfitting by the temperature constraint. Therefore, only two data sets were used and the networks were trained until the error was 5% of the initial error (i.e., standard deviation about the mean of the classes) or convergence was achieved at higher errors. Prediction of Substructure. The MS data composition is given in Table 3. To simplify the problem, 163 spectra from 3 major categories (i.e., H, I, K) were used to test the TCCCN. The training set contained 50% of the spectra from each substructure category (82 spectra). The other 81 spectra were contained in the testing set. No replicate spectra were used. The spectra used in the training set were randomly partitioned into 10 training-test set pairs. For the TCCCN, a random number was needed to initialize the neural network. For each of the 10 partitions, the program was run 10 times using different initial conditions. Therefore, there were 100 runs and the average prediction accuracy was found to be 93.0 ( 0.5% (95% confidence interval). From the results of three major categories, the TCCCN works very well. For the prediction of the spectra in all six OPP categories, the same computational procedure was used, except that only three training-testing pairs were used. For the total 197 spectra, 99 spectra were in the training set and 98 spectra were in the testing set. The prediction accuracy was 83.5 ( 2.4%. The confusion matrix of a substructure prediction is given in Table 4. The rows give the predicted class by the network and columns give the known class of the prediction spectrum. For the 98 spectra in the test set, 84 were correctly classified, which is the diagonal summation of the matrix. For the 32 spectra of class H, 3 were misidentified as class I and 1 was misidentified as class J. The errors tended to be related to the size of the training sets, with
Table 5. OPP MS Data Set Composition for the Prediction of Toxicitya no. of OPPs category
H
I
J
K
L
M
total
no. of spectra
class IA class IB class II class III + O
5 6 9 1
5 4 10 5
0 4 2 1
6 3 1 1
4 0 0 1
2 3 0 2
22 20 22 11
65 44 65 23
21
24
7
11
5
7
75
197
total
a For each toxicity class, the breakdown of the number of each structure class is also given.
Table 6. Confusion Matrix of a Toxicity Classification
Figure 4. GC/MS total ion current chromatogram of a pesticide mixture.
class
IA IB II III + O total (97)
predicted predicted predicted predicted
IA
IB
II
III + O
27 1 4 0
2 17 2 1
4 6 22 0
2 3 1 5
32
22
32
11
the class that had the smaller training set having the larger number of misclassifications. Prediction of Toxicity. The MS data composition is given in Table 5. For the 197 spectra, 100 spectra were used in the training set and 97 spectra were used in the testing set. Five randomly selected training-test partitions were used. Table 5 also shows the relationship between the substructure and toxicity. General trends such as phosphate pesticides (K) are more toxic than others can be observed. However, the toxicity is largely affected by the functional groups also, and the relationship is more complicated than the one in which only substructure is considered. Each data set used to build network models using 10 different network initial conditions. The prediction accuracy was 72.3 ( 1.5%. A confusion matrix is given in Table 6. Recognition Ability. The prediction accuracy results are not bad, considering the limited number of compounds in the training data. If all 197 spectra from the 75 OPP compounds are included in the training set, and assuming that the analyte in the real sample is one of the 75 OPP compounds, then the problem is simplified as pure pattern recognition and higher prediction accuracy can be obtained. An experiment was performed with one spectrum for each OPP in the training set and all the others in the testing set. In this way, 75 spectra from all 75 OPP compounds were in the training set, and 122 from 49 OPP compounds were in the testing set. The result for substructure classification was 97.0 ( 1.1%. For toxicity recognition, the accuracy was 83.7 ( 2.1%. These results indicate that the TCCCN can recognize and predict the OPP spectra from GC/MS experiments. Discriminant partial least-squares (DPLS) regression is a standard multivariate analysis method.22,23 The same data sets were used to obtain DPLS results. The selection of latent variables is important in DPLS. However, an optimizing step was accomplished by using the test data set that yielded the lowest (22) Stahle, L.; Wold, S. J. Chemom. 1987, 1, 185-196. (23) Vong, R.; Geladi, P.; Wold, S.; Esbensen, K. J. Chemom. 1988, 2, 281-296.
prediction error. Fifteen latent variables were found optimal for both substructure and toxicity data sets. For the optimized DPLS method, the classification accuracy results of 98.4% for substructure recognition and 87.7% for toxicity recognition were obtained. The TCCCN results are comparable to the optimized PLS results. In addition, TCCCN has a certain ability to prevent overfitting. For example, the training error usually may not decrease to 5% of its initial value. Screening GC/MS Chromatograms. For the experimental GC/MS data, OPP compounds must be selected from a large variety of non-OPP compounds and other pesticides. In order for a method to be practically useful, a means to differentiate OPP compounds from non-OPP compounds was proposed. The method created a non-OPP category. The non-OPP category consisted of some non-OPP spectra that were similar to the OPP spectra. Euclidean distance measured the similarity between two spectra. Shorter Euclidean distances indicate greater similarity. Fifteen similar non-OPP spectra were obtained from the reference database for each OPP spectrum, and duplicates were removed from the training set. A training set was built with 934 spectra and 397 variables that comprised OPP spectra and non-OPP spectra. The TCCCN was used to predict each MS scan from GC/MS runs. Figure 4 gives a GC/MS total ion current chromatogram, which was obtained by injection of the mixture of seven pesticides. The pesticides were mevinphos (substructure class K and toxicity class IA), phorate (substructure class H and toxicity IA), dimethoate (substructure class H and toxicity II), disulfoton (substructure class H and toxicity IA), diazinon (substructure class I and toxicity II), methyl-parathion (substructure class I and toxicity class IA), and malathion (substructure H, toxicity III). For each mass spectrum, the result from TCCCN was an array in which the values for each class were between 0 and 1. These values were normalized by the summation of the array so that they represented the possibility of the pesticide class. Then, multiplication of the possibility with the TIC of each mass spectrum would generate subchromatograms from the GC/MS data. Figure 5 gives the subchromatograms of the substructure classification. Each OPP was detected and correctly classified into one of subchromatograms. Note that the baseline was mostly classified into non-OPP category. The TCCCN results for toxicity classification are given in Figure 6. Each OPP was correctly recognized. Analytical Chemistry, Vol. 71, No. 19, October 1, 1999
4139
Figure 5. Substructure subchromatograms predicted by TCCCN. Of the six substructure categories, only three are shown.
Figure 6. Toxicity subchromatograms predicted by TCCCN. Class IB has been omitted because it does not have any features.
CONCLUSIONS Organophosphorus substructures and toxicity can be predicted from the low-resolution mass spectra using neural networks. The neural network models allow the rapid screening of every mass spectral scan in a chromatographic run. The outputs of the neural network may be used to mathematically resolve chromatographic peaks by substructure or toxicity (potential hazard). The TCCCN is a powerful and robust pattern recognition method that can 4140 Analytical Chemistry, Vol. 71, No. 19, October 1, 1999
effectively predict substructure and toxicity and make the approach a useful tool for screening environmental samples. Because each scan is classified, overlapping and contaminated peaks may be resolved. For toxicity, a classification rate of 72.3 ( 1.5% was obtained. Although this result is marginal, most of the errors occurred with recognizing underrepresented classes in the training set. The variance of the classification rates accounted for different training
and test set partitions and different neural network models. For prediction of substructure, the classification rate was 83.5 ( 2.4% with a similar trend of a relatively larger number of misclassifications with classes that had a smaller number of training examples. For recognizing the same compounds in training and test sets, the classification rates improved to 97.0 ( 1.1 and 83.7 ( 2.1% for substructure and toxicity, respectively. The results suggest that larger training sets would improve prediction accuracy for structural features and toxicity. In addition, because recognition accuracy (i.e., predicting different spectra from the same compounds used in the training set) was relatively higher, the problem was not caused by the disparity between chromatographic and reference spectra. Future work will expand the range of pesticide classes. However, the training set sizes may increase to thousands of mass spectra when nonpesticide spectra are included. The TCCCN fast training rates coupled to data compression may make faster training a possibility.
Databases of reference spectra are increasing in size. It is important that the reference databases increase in breadth and that important classes of compounds are not represented by many spectra from a few compounds but that each compound class is well represented. ACKNOWLEDGMENT Elaine Saulinskas is thanked for her assistance on running the GC/MS experiments. This work was presented in part at the 23rd Meeting of Federation of Analytical Chemistry and Spectroscopy Societies (FACSS), 1996, Kansas City, MO, at the 1998 Pittsburgh Conference, New Orleans, LA, and at the 25th Meeting of FACSS, 1998, Austin, TX.
Received for review February 10, 1999. Accepted July 19, 1999. AC990159Y
Analytical Chemistry, Vol. 71, No. 19, October 1, 1999
4141