Selective reduction of infrared data - Analytical Chemistry (ACS

Aug 1, 1987 - Pattern Recognition Assisted Infrared Library Searching. Barry K. Lavine , Kadambari Nuguru , Nikhil Mirjankar , Jerome Workman. Applied...
0 downloads 0 Views 514KB Size
Anal. Chem. 1987, 59, 1914-1917

1914

of 0.822 f 0.041, an intercept of 0.050 f 0.036, and an R = 0.968. Initial results have shown that TXRF is a potentially valuable method for the determination of elements in aerosol particulates, especially when limited sample is available as in the case of size fractionation. The principal advantages of TXRF are the small sample size requirements, the low limits of detection, simple standardization, lack of matrix effects, and the relatively simple sample preparation. Modification of the instrumentation to optimize the excitation could lead to a method for the determination of elements of lower concentration.

4

3.2 ?

2.4

W

2 U I

+ 1.6 m

-I U

.a

ACKNOWLEDGMENT The assistance of C. Streli and V. Casensky is gratefully acknowledged. Registry No. S, 7704-34-9; C1, 7782-50-5; K, 7440-09-7;Ca, 7440-70-2.

(1

CL BY TXRF

(UG M - 3 1

LITERATURE CITED

Figure 6. Plot of chloride determined by IC vs. TXRF results. Symbols i n d i t e size fraction &m a.d.): 0, 0.04-0.1; 0,0.1-0.4; 0.4-1.6; X 1.6-6.4; A, 6.4-25.

+.

size range of 0.1-1.6 pm a.d. IC data were not available for stages 4 and 5 covering the size range of 1.6-25 pm a.d. TXRF data for stage 4 and 5 (not shown) indicate that potassium concentrations were below 0.10 pg m-3 over the sampling period. Figure 5 is a plot of potassium values obtained by IC vs. TXRF for stages 1-3. The slope of the regression line is 0.982 f 0.082 with an intercept of -0.026 f 0.046 and an R = 0.951. Figure 6 is a plot of chlorine data obtained by TXRF and IC. The data show generally higher values for TXRF compared to IC, especially prevalent are the results from stage 5 for particles in the 6.4-25 pm size range. The discrepancy between IC and TXRF was much larger for stage 5 and may be due to contamination during the storage or preparation of the specimen. Data shown for stage 5 were not used in calculating the regression line in Figure 6. The line has a slope

(1) Whitby, K. T. Atmos. Envlron. 1978, 12, 135-159. (2) Broekaert, J. A. C.; Wopenka, B. W.; Puxbaum, H. Anal Chem. 1982, 5 4 , 2174-2179. (3) Nelson, J. W. I n X-ray Fluorescence Analysis of Environmental Samples; Dzubay, T. G . , Ed.; Ann Arbor Publishers: Ann Arbor, MI, 1977; pp 19-34. (4) Johansson, T. B.; Van Greiken, R. E.; Nelson, J. W.; Winchester, J. W. Anal. Chem. 1975, 4 7 , 855-860. (5) Lannefors, H.; Carisson, L. E. X-ray Spectrom. 1983, 72, 138-147. (6) Yoneda, Y.; Horiuchi, T. Rev. Sci. Instrum. 1971, 4 2 , 1069-1070. (7) Wobrauschek, P.; Aiginger, H. Anal. Chem. 1975, 4 7 , 852-855. (8) Michaelis, W.; Knoth, J.; Prange, A.; Schwenke, H. Adv. X-Ray Anal. 1985, 28, 75-83. (9) Stossei, R.; Prange, A. Anal. Chem. 1985, 5 7 , 2880-2885. (IO) Alglnger, H.; Wobrauschek, P. Adv. X-Ray Anal. 1985, 28, 1-10, (11) Preining, 0.; Berner, A. EPA-600/2-79-105, 1979; Environmental Protection Agency, Research Triangle Park, NC. (12) Knoth, J.; Schwenke, H. Fresenlus’ 2.Anal. Chem 1980, 307, 7-9. (13) Currie, L. A. Anal. Chem. 1968, 4 0 , 568-593.

RECEIVED for review February 20, 1987. Accepted April 20, 1987. This work was supported in part by a grant from Tracor X-ray Inc. D.J.L. and D.B.B. thank the Fulbright Commission of the U S . Information Agency for funds to conduct this research.

Selective Reduction of Infrared Data Robert J. Anderegge and Dong-jin Pyo Department of Chemistry, University of Maine, Orono, Maine 04469

As gas chromatography/lnfrared spectrometry (GC/IR) becomesroutkrely available, method^ mu81 be developedto deal wlth the large amwnt of data produced. We demonstrate computer methods that quickly search through a large data file, locating those spectra that clrspfay a spectral feature of interest. Based on a modiffed library search routlne, these selective data reduction methods retrieve all or nearly all of the compounds of interest, whHe rejecting the vast majority of unrelated compounds. A greater degree of selectlvlty Is observed than wlth chemigram-type routines.

The coupling of chromatographs to various types of spectrometers has led to the development of a number of extremely powerful instrument systems for the analysis of complex mixtures. Gas chromatography/mass spectrometry (GC/MS) and, more recently, gas chromatography/infrared spectrom-

etry (GC/IR) are probably the two most widely used examples. These computerized systems are prodigious data producers, capable of generating hundreds of spectra per hour. One of the great challenges of modern analytical chemistry is to try to make sense of data as fast as instruments can spew it forth. To that end, we have been interested for some time in developing computerized methods of selective data reduction. The goal of these methods is the rapid sorting of hundreds of spectra to find the relative few which may be important enough in a given analysis to warrant further attention. For example, if a complex mixture of unknown compounds is to be analyzed, the sample might be chromatographed and the GC effluent be directed into a continuously scanning spectrometer. After 30 min of analysis, 900 spectra might have been collected and stored in a data file. Let us assume, in this particular case, that the analyst is interested in finding and identifying all of the compounds in the mixture that are chlorinated. If the GC effluent had been detected with a

0003-2700/87/0359-1914$01.50/00 1987 American Chemical Society

ANALYTICAL CHEMISTRY, VOL. 59, NO. 15, AUGUST I , 1987 1915

chlorine-selective detector, such as the electron-capture detector (ECD), the chlorinated compounds would have been easy to locate; but the ECD contributes little structural information for the identification of unknowns. In the present case, the spectrometer would provide full spectral data, useful for identification, assuming the spectra of the chlorinated compounds could be located. Each of the 900 spectra could be visually examined, searching for spectral indications of the presence of chlorine, a tedious and time-consuming approach. The data system could perform a library search on each of the 900 spectra-again, time-consuming and inefficient. If the data system were sufficiently sophisticated, it could perform some sort of artificial intelligence or automated interpretation scheme on each of the 900 spectra, but the time involved would surely be prohibitive. The sensible approach is to somehow select some of the 900 spectra for further attention but to select them in a way that maximizes the probability of selecting chlorinated compounds. All of the usual identification routines-manual interpretation, library search, or automated interpretation-could be applied to this smaller subset. This selection is the process we refer to as selective data reduction ( I ) . The selection criteria often used are the total intensity of the spectra or the presence of some spectral feature of interest. Commercial spectrometers frequently have simple routines built into the data systems: peak finding routines are just selective data reduction using spectral intensity as a criterion. Routines based on spectral features vary with the type of spectrometer used. In GC/MS, one has mass chromatograms (2) or the presence of isotope clusters (3-5) to use as indicators of compound class. In the example above, the presence of chlorine isotope clusters can be used to indicate chlorinated compounds ( 4 , 6). In GC/IR, two basic options for selective data reduction are currently available. The peaks with the highest intensity in the Gram-Schmidt plot can be selected for further study. Alternatively, a “chemigram” (7) can be used to indicate the presence of absorptions of interest. These are plots of integrated absorbance in a defined spectral window. Rather than looking a t all 900 spectra, with the use of chemigrams, the analyst sorts out only those spectra that are likely to contain a particular functional group. Although useful, chemigrams are not always very selective, in that they show the integrated absorbance over a chosen frequency window. In this paper, we provide a third alternative: selective data reduction using patterns of absorbances. Although the goal of the reduction is the same as that of a chemigram, the use of patterns provides a more selective criterion for data reduction. We have written computer algorithms to search through hundreds of spectra, retrieving only those that display the pattern of interest, and these algorithms have great potential for the analysis of GC/IR data. Although our routines are based on the presence of spectral patterns, they are distinct from “pattern recognition“ methods (8, 9) in both purpose and approach. Pattern recognition statistically sorts a large database into a number of clusters, and assigns a spectrum to a compound class based on the nearness of some metric representing the spectrum to one of the clustered units. Our approach seeks only to reduce the number of spectra which must be futher interpreted by the analyst, and so looks only for similarity within a defined spectral window. The database is not really required, and in fact, one need not know in advance what functional group is responsible for the pattern of interest.

EXPERIMENTAL SECTION The computer programs described were written in FORTRAN and run on an IBM 370 computer. They are available from the authors. An IR database was used because it provided a large

number of spectra, representing a variety of compound types. The database chosen was the EPA Vapor Phase Collection of 3300 spectra, available from James de Haseth at the University of Georgia. The spectra were stored in digital format on magnetic tapes. A general information “header”record, including compound name, formula, molecular weight, Chemical Abstracts Service (CAS) registry number, melting point, boiling point, Wiswesser line notation (WLN), etc., was followed by the digital spectra, measured at 2 cm-’ resolution from 4000 to 450 cm-’. The format of the records has appeared in the literature (10). The basic strategy of our method was to use spectra from the database to identify patterns of absorbance that characterize certain functional groups; then to search for those patterns in a series of “unknown”spectra. Representatives of a functional group were identified by computer searching the Wiswesser line notation (WLN) in the database header records. The list of spectra retrieved by WLN was checked against the compound names to avoid coding errors. An “average spectrum” was calculated by taking the mean absorbance of all of the normalized spectra at each frequency interval (2 cm-’) throughout the range. Since the goal of the project is rapid screening, only a small portion of the full IR range was used, a portion chosen surrounding a characteristic band of that functional group. For example, when searching for carboxylic acids or alcohols, the -OH stretching region was used (3800-3400 cm-’). The average spectrum was considered t o represent the functional group. Other spectra were then tested in the same frequency window to see if they exhibited the same pattern of absorbance as the average. A score was assigned to each spectrum reflecting the degree of similarity to the average. Since this process is similar to a library search routine, except in that it is applied only to a small region of the spectrum, we use the same metrics reported in the literature for library searching (11). In most of the work described herein, the “difference squared” metric was used

where MSQis the similarity indicator and s, and ri are the absorbance values of the sample and reference spectra in a frequency interval i. Clearly, the smaller the value of MQ, the better the match between the unknown and the reference (or average) spectrum; a perfect match would give MSQ = 0.

RESULTS AND DISCUSSION In the evduation of the performance of our selective data reduction methods, it was necessary to keep in mind the goals we set out to accomplish. The methods were not intended to be an automated interpretation scheme (12,13) or library search routine (11,14). The goal was simply to screen large numbers of spectra in a short time and to determine for each spectrum whether it “probably did” or “probably did not” contain the pattern of interest. Assuming the pattern would be displayed by all members of a compound class, the list of “probably dids” should include all of the spectra of interest in the data file. In our GC/IR data set example, this would correspond to sorting through the 900 spectra and selecting out the few that warrant further attention. These few could then be subjected to the usual procedures of spectral identification or full library search. The spectral pattern to be used must be general enough that all members of the compound class are recovered, but selective enough that most of the other functional groups are eliminated. To select patterns, we adopted an approach used by Kowalski et al. (13).A series of spectra known to contain the functional group were averaged. Absorbances due to the functional group should average to a high value; absorbances due to other parts of the molecule should (ideally) be random and average to a low value. The database from which we drew our spectra was the EPA Vapor Phase FTIR Database. Since it is one of the few FTIR spectral collections with all spectra in full, digital form, spectral averaging was particularly easy. Vapor-phase spectra were chosen since the major application of the selective data reduction routines will be in GC/IR data

1918

ANALYTICAL CHEMISTRY, VOL. 59, NO. 15, AUGUST 1, 1987 WAVE NUM B E R S ( c m-’ ) 3700 3600

3500

3400

; : 5

x n.

0.1

100

-

80

-

60

-

40

-

20

0

0.2 LLI

U

050

200

250

THRESHOLD VALUE

(MSQ)

100

150

3.00

3.50

Figure 2. Percentage of carboxylic acids (--) and nonacas (-) an M,, value less than the threshold shown.

z

6 m

WM

E

52 m

4

0.3

0.4

0.5

Figure 1. Average spectrum of 185 carboxylic acids from the EPA database (3800-3400 cm-’). processing. The number of spectra included in the average need not be large; in the frequency window 3800-3400 cm-’, we saw no difference in the average spectrum for carboxylic acids whether we used 20 known acid spectra or 185. This number will depend, to some extent, on the variability observed in the patterns exhibited by the members of that compound class. Whenever practical, we have tried to use at least ten spectra to determine a spectral average; although in one case discussed below, five spectra provided a suitable average spectrum. To test our approach, we selected carboxylic acids as our first compound class. In the spectral range 3800-3400 cm-’, carboxylic acids should show an 0-Hstretching vibration, so we restricted our attention to that region. As a control, we integrated the absorbance in the 3800-3400-cm-’ window for the first 100 spectra in the database. This is the information we would get in a chemigram experiment. Not surprisingly, the spectra that were highest in total absorbance were of 14 alcohols, 4 carboxylic acids, and 1 amine. If the pattern of absorbance, rather than just integrated intensity, is considered, one would expect a higher degree of selectivity. One hundred eighty five spectra of carboxylic acids were averaged in the 3800-3400-cm-’ window. The resulting pattern (Figure 1) showed a single sharp band centering around 3560 cm-’. The same 100 spectra from the database were again considered; this time, M , for each was calculated-a measure of similarity of the spectrum to the average pattern for carboxylic acids. The results were striking. Four carboxylic acids (100%)had the lowest MsB values: 3-chlorbutyic acid, 0.14; butyric acid, 0.26, heptanoic acid, 0.30; isobutyric acid, 0.40. The next lowest MSQwas for 2-bromo-p-cresol, with an MSQ = 1.70. Not

only were the acids located as best matching the average pattern, but there was a large distance between the worst acid ( M , = 0.40) and the next closest nonacid ( M , = 1.70). When the experiment was repeated on the entire database, more than 92% of the carboxylic acid spectra had MSQ values of 1.5 or less; fewer than 4% of the non-carboxylic acids had MsQ of 1.5 or less. Figure 2 shows the percentage of compounds By choosing a threshold (either acids or nonacids) vs. the M,. value of 1.5, the vast majority or the acids are retrieved, while only a small fraction of the nonacids are included. This threshold value will vary with the functional group and search parameters; but a graph such as Figure 2 is easy to construct, and the proper value of the threshold can be determined from it. The graph, in essence, gives one a rather conservative statistical estimate of the performance of the data reduction for any given threshold value. The program generally performs somewhat better on real data than on library spectra, owing to the variability of database spectra (6). If it is important to miss none of, or very few of, the target compounds, one would choose a high threshold and tolerate the increased occurrernce of false positives. If maximum discrimination is sought, one would select a threshold where the difference between the two curves in Figure 2 is the greatest. The spectra of the carboxylic acids that were not assigned MsQ< 1.5 were examined to see why they had not been found. In several cases, the problem was a badly sloping base line. This causes the “pattern” of absorbance in the spectral window to be unlike the average pattern. Such problems could be corrected by a base line subtraction routine (15) or use of a different search metric (11). In other cases, the spectrum in the database was mislabeled and was, in fact, not that of a carboxylic acid. The selective data reduction techniques have been shown to be useful in locating database errors (5, 6). Similarly, the spectra of the nonacids that received MSQ < 1.5 were examined. In almost all cases these were phenolic compounds with a bulky group in the ortho position. It has been reported that the presence of large ortho substituents shifts the 0-H stretching virbration to the region of 3600-3650 cm-l (16),causing the overlap with the carboxylic acids. It is interesting that most of the interfering compounds belong to a single compound class. Another functional group that also would absorb in the 3800-2400-cm-’ window is the alcohol. The average of 161 spectra of alcohols from the database is shown in Figure 3a. The pattern is somewhat broader than that for carboxylic acids, both because the band for alcohols is usually somewhat broader than for acids and because the alchols show more variability in the position of the maximum absorbance. The search results were less satisfactory than for craboxylic acids, as well, although still far better than the results of a chemi-

ANALYTICAL CHEMISTRY, VOL. 59, NO. 15, AUGUST 1, 1987 WAVE NUMBE RS

a

3700

3600 3500 3400

0. 0

w U

Z

d

0.1

a

WAVE NUMBERS 3700 3600 3500 3400 3300

b

w

0.1

3200

3100

-

U

Z

d

p:

2m

Q:

0.2

Flgure 3. Average spectra of (a) 161 alcohols from the EPA database (3800-3400 cm-') and (b) five intramolecular hydrogen-bonding alcohols (3800-3100 cm-').

gram-type search. At a threshold level of Mss I0.60, 88% of the alcohols were found, along with 8% of the nonalcohols. Most of the nonalcohols found were amines, in which the N-H stretch overlapped the pattern of the average alcohol. Several alcohols that did not match well with the average OH stretch were compounds capable of intramolecular hydrogen bonding. This results in a broadening of the OH stretch absorption and a shift to lower frequency (Figure 3b). Five spectra of these hydrogen-bonding alcohols were averaged, and that pattern was searched against the database. The search located 53 additional spectra, all showing the characteristic hydrogen bonding OH absorption. Although this was a predictable result, it points out a powerful feature of the selective data reduction routines. One need not know what functionality is responsible for the spectral pattern, only that such a pattern is shared by more than one spectrum. Retrieving spectra from the database that share a common spectral feature and then studying the structures of the retrieved molecules-is an ideal way to learn more about structure/spectra relationships.

1917

While the selectivity of data reduction methods based on a single pattern of absorbance is much greater than that based solely on integrated absorbance, a still higher degree of specificity can be achieved by use of multiple patterns in the same spectrum. For example, in the work described above on carboxylic acids, if the patterns in the OH stretching region and in the carbonyl stretching region are both included, a much greater selectivity results. The ortho-substituted phenols, for example, do not interfere, and fewer false positive results are recovered. The compromise is, of course, search time, which approximately doubles by the addition of a second spectral pattern. However, for patterns in the fingerprint region, where substantially more variability occurs, the multiple pattern approach may be necessary. Neither chemigrams or pattern-search data reductions perform very well in this region. Experiments are under way to more fully assess the value of multiple pattern searches, as well as to apply the data reduction methods described to data generated in GC/IR analyses of real samples. Two further aspects of the selective data reduction deserve comment. In all of the work described, the spectra were recorded and manipulated at 2-cm-' resolution. If one intends to collect spectra at some other resolution, the library spectra used for calculating an average pattern for each functional group should be deresolved to match the resolution of the data collection, making subsequent calculations more straightforward. The effect of noise on the data reduction is also important. Since our search is based upon a library search metric ( I I ) , the susceptibility of our method to interference by noise reflects the susceptibility of the metric itself. Certain metrics are less prone to interference than others (for a discussion, see ref 11). In our work, the difference squared metric proved satisfactory; but if noisy spectra are anticipated, other metrics could be incorporated into the algorithms instead. In conclusion, selective reduction of IR data based on the rapid search for spectral patterns has potential for increasing the efficiency of GC/IR analyses. The methods can be used to select spectra which warrant further interpretation, whether that interpretation be by use of library searching, artificial intelligence, or manual spectral identification.

LITERATURE CITED Anderegg, R. J. A m . Lab. (Fairfield, Conn.) 1985, 77,20. Hites, R. A.; Biemann, K. Anal. Chem. 1970, 42,8 5 5 . Anderegg, R. J. J . Chromatogr. 1983, 275, 154. LaBrosse, J. L.; Anderegg, R. J. J . Chromatogr. 1984, 374, 83. Anderegg, R. J. Anal. Chim. Acta 1985, 776, 175. LaBrosse, J. L.; Anderegg, R. J. J . Chromatogr. 1984, 374, 93. Mattson, D. R.; Julian, R. L. J . Chromatogr. Sci. 1979, 17,416. Jurs, P. C.; Isenhour, T. L. Applications of Pattern RecognNon; Wiiey: New York, 1975. (9) Frankel. D. S.Anal. Chem. 1984, 56, 1011. (IO) Griffiths, P. R.; Azarraga, L. V.: de Haseth, J. A,; Hannah, R. W.; Jakobsen, R. J.; Ennis, M. M. Appl. Spectrosc. 1879, 33, 543. (11) Loww, S. A.; HUDDier, D. A.; Anderson, C. R. J . Chem. I n f . C o m ~ u f . Sci. 1985, 25, 235. (12) Woodruff, H. B.; Smith, G. M. Anal. Chim. Acta 1981, 133,545. (13) Kowalski, B. R.; Jurs, P. C.; Isenhour, T. L.; Reiiiey, C. N. Anal. Chem. 1969, 4 7 , 1945. (14) De Haseth, J. A.; Azarraga, L. V. Anal. Chem. 1981, 53,2292. (15) Koenig, J. L. Appl. Spectrosc. 1975, 29, 293. (16) Deianey, M. F.; Warren, F. V., Jr. Anal. Chem. 1981, 53, 1460. (1) (2) (3) (4) (5) (6) (7) (8)

RECEIVED for review December 16, 1986. Accepted April 24, 1987.