Anal. Chem. 1983, 55, 1548-1553
1548
ral Compression Procedure for Resolution earch Systems P. M. Owens' and T. L. Isenhour" Department of Chemistry, University of North Carolina at Chapel Hill, Chapel HN, North Carolina 275 14
A compound ldentlflcatlon procedure that utilizes the tlme domain representatlons of Infrared absorbance spectra is presented. Thls algorithm is shown to be independent of spectral resolution. Compression of the time domaln representatlons through single bit quantlration Is described. This Compression allows a 100-fold decrease In infrared spectral storage requirements. Search results using clipped and unclipped time domain representatlons from spectra collected at different resolutions are reported. The spectral transform representatlons are found to produce reliable search results for Infrared spectra calculated from 512 and 256 point Interferograms.
The increased automation of sample analysis has created a need for versatile and efficient spectral matching techniques. While large spectral libraries enhance the reliability of an analysis, they also require systems with large storage capacities. Additionally, spectral libraries continue to grow with increasing data availability and with an increasing demand for reliable sample analysis. With this continued expansion of the data base, a need for spectral compression techniques has developed. One method often employed involves encoding only peak locations into library data files. The infrared spectral library collected by the American Society for Testing and Materials (ASTM) locates absorption maxima to the nearest tenth of a micron. While this technique allows efficient data compression, much of the spectral intensity information has been removed. A comparison of search algorithms which use only peak locations with those which incorporate intensity data has shown that the latter yields significantly superior search results (1). To incorporate spectral intensity information while implementing data compression requires sophisticated mathematical techniques. In one case, factor analysis has been used to effect a significant reduction in infrared spectral library size (2). Additionally, spectral Fourier transform representations have been utilized in compressing mass spectra (3) and infrared spectra ( 4 ) . The Gram-Schmidt orthogonalization procedure has also been used to establish interferometric libraries (5),while phase correction has been applied to this procedure to remove instrumental effects (6). This paper presents a spectral compression technique utilizing the Fourier transform representations of infrared spectra. The developed compression algorithm is independent of spectral resolution. Search results comparing 4 cm-' resolution library spectra with 64 cm-l and 128 cm-l resolution spectra are reported. Further compression of spectral transformations through clipping (single bit quantization) is presented. Search results using complete spectra are compared to those obtained with clipped spectral transform representations. Present address: Department of Chemistry, United States Military Academy, West Point, NY 10996.
THEORY S p e c t r a l Transform Representations. An effective spectral compression algorithm should incorporate intensity information. One such technique involves calculating the inverse Fourier transform ( s ( t ) ) of an infrared absorbance spectrum ( 8 0 )as denoted in eq 1. The procedure for cal-
culating the -i (inverse) transform of a function has been previously described (7). The complex SV, array (eq 1)is first reflected about its last point. The complex conjugate of this extended array is then passed to the fast Fourier transform (FFT) routine to produce s ( t ) . A simpler procedure will yield identical results. The s ( t ) function can be obtained from the cosine transform of the real S(f) array which has been zero filled to twice its original size. The second inverse (4) transform procedure is valid for all arrays obtained from forward (+i) transforms of time representations with a mean of 0. Since the zero filling is simply a time domain interpolation, its inclusion is not necessary, and a simple time domain spectral representation can be determined from the cosine transform of the infrared absorbance spectrum (eq 2). s ( t ) = { S ( f l cos (27rft) df
All subsequent references either to time domain representations or to spectral transforms depict inverse transformed absorbance spectra. The time domain representations were obtained with eq 2 using nonzero-filled absorbance spectra. The term interferograms will be used to represent the measurements which are transformed and ratioed in calculating absorbance spectra. The calculation of each point in the time domain representation s ( t ) requires intensity information from all frequencies. Since these intensities are characteristic of a particular compound, a segment of points from s ( t ) would also exhibit unique characteristics. This allows the replacement of complete library spectra with short segments from time domain representations. Application of this technique to infrared spectral compression has been demonstrated by Azarraga, Williams, and de Haseth ( 4 ) . Comparison of Different Resolution Spectra. Utilization of time domain representations allows comparison of infrared spectra collected at different resolutions. Figure 1 illustrates the absorbance spectra of p-anisaldehyde calculated from collected 2048- and 512-point interferograms. These spectra can be compared by partially zeroing each spectral array so that data from the same frequency range is kept. The inverse transform of each spectrum is then calculated and normalized to its value at time 0. The initial segments of these normalized time domain representations can then be compared. The length of the comparable segments is N/2 where N represents the number of points in the lower resolution spectrum. In Figure 1,the 256 point (32 cm-l) lower resolution spectrum allows the first 128 points (past the burst) to be used in time domain representation comparisons. Figure 2 shows the comparable segments obtained from the inverse transforms
0003-2700/83/0355-1548$01.50/00 1983 American Chemical Society
ANALYTICAL CHEMISTRY, VOL. 55, NO. 9, AUGUST 1983
* 1549
A
0
600
1000
1500
PO00 2500 WRVE NUPEER
3000
r T " m ~ r ' " " I " ' " " " I " ' " " " I " ' " " " I " ' " " " I " ' ' ' ' ' ' ' l
0
500
1000
1500
2000 MVE
2500
3000
3500
9000
3500
+ooo
Numw
Figure 1. Infrared spectra of p-anisaldehyde obtained from (A) 2048 point interferograms and (B) 512 point interferograms.
of the spectra in Figure 1. The degree olf similarity between the two is evident. Searches employing time domain representations can be extended to sample spectra with either llower or higher resolutions than library splectra. The lowest resolution spectra possible is dictated by t,he frequency range used in calculating library time domain representations. This frequency range is determined by which absorbances are retained prior to inverse transforrning library spectra. For lexample, library time domain representations obtained by using absorbances between 740.7 and 3704 cm-' can be used with those obtained from 256 cm-l resolution sample spectra. This is possible since 256 cm-l spectra contain information at both library end points, 740.7 cm'-l and 3704 cm-*. On the other hand, 512 cm-l resolution sample spectra could not be directly used (without interpolation) since tho only low frequencies available are at 0, 493.8, and 987.6 cm"', respectively. Thus, extremely low resolution spectral searches are possible by matching library frequency ranges with those available from low resolution spectra. Search results from spectra with a higher resolution than library spectra can also be obtained. This capability is only limited by the time necessary to calculate time domain representations of high-resolution sample spectra. Clipped Spectral Transform Representation. Clipping allows further compression of spectral time domain representations. Since the infrared absorbance is 0 at a frequency of 0, the infrared1time domain representation oscillates above and below 0. The clipping operation replaces a set of data points with a corresponding set of bits. Bit values of 1 correspond to data points greater than 0, while the remaining points are assigned a bit value of 0. The clipping operation has been applied to NMR (8)and mass (3) spectra. In both cases, it was determined that little information loss had occurred in clipping time domain representations. A graphical depiction of information content can be obtained by comparing the inverse transform of the clipped representation with the complete spectrum. Figure 3 compares the isopropyl alcohol infrared spectrum obtained from the transform of the
*I
B
0
Flgure 2. Infrared time domain representations of p -anisaldehyde spectra with resolutions of (A) 8 cm-' and (B) 32 cm-'.
128 bit clipped representation (Figure 3C) with the library spectrum (Figure 3A) and with the spectrum calculated from 512 point interferograms (Figure 3B). While the clipped representation yields a spectrum with large oscillations, some major peak information has been retained. Figure 4 illustrates a similar comparison for p-xylene. Once again, some major peak location and intensity information are present. In both
1550
ANALYTICAL CHEMISTRY, VOL. 55, NO. 9, AUGUST 1983
Table I. Search Results for Unknown N o t Present in the Library Target: Pentyl Propionate hit no. compound bits matched 1 propionic acid, butyl ester 122 2 propionic acid, propyl ester 120 3 citric acid, tributyl ester 119 4 propionic acid, isopentyl ester 119 5 butyric acid, methyl ester 116
__
quirements. In spite of this significant data reduction, it appears that much of the unique spectral information has been retained. Clipping of the time domain representations can therefore be extended to infrared spectral compressions.
EXPERIMENTAL SECTION
Figure 4.
Instrumentation. The unknowns for search comparisonswere collected from several sources in order to obtain a representative set of GC/FT-IR data. One data set was obtained from Leo V. Azzaraga of the Environmental Protection Agency, Athens, GA. This data set was produced from a 0.8-pL injection of a solution containing 0.5 pg/& each of acetophenone, acenaphthene, 2,3,5-trimethylphenol,methyl salicylate,bis(2-chloroethyl) ether, and 2,4,6-trimethylphenol. Three-fourths of the effluent was directed through a 2 mm inner diameter (i.d,) by 30 cm light pipe which was interfaced to a Digilab FTS-14. Effluents were detected with a mercury-cadmium-telluride (MCT) detector. Three data sets were obtained from previous work at the University of North Carolina (UNC). All three data sets were collected with a Digilab FTS-14 through a 1.5 mm i.d. by 54 cm gold-coated light pipe. A triglycine sulfate detector was used to analyze a mixture containing m-methylanisole, 2,4-pentanedione, o-allylphenol, and propanal. An MCT detector was used to collect the other two data sets. A fifth data set was collected at the Research Triangle Institute with a Nicolet 7199 infrared spectrometer interfaced to a Varian 3700 gas chromatograph. The Nicolet data were produced from a 3.0-pL injection containing 6.91 pg of cyclohexanol, 3.46 pg of dodecane, 5.59 pg of 1,2,4-trimethylbenzene, and 4.83 pg of isopentyl hexanoate. Data were collected through a 3 mm i.d. by 42 cm gold-coated light pipe with an MCT detector. A 1/8 in. by 6 ft column packed with 15% SE30 was used. The carrier gas flow rate was set at 28 mL/min of helium, while light pipe and transfer line temperatures were kept at 200 "C. The injector temperature was kept at 220 "C and a column temperature of 128 "C was maintained. The FT-IR scan rate was set at one interferogram per second. Procedure. Program development and search computations utilized a 32K Nova minicomputer at UNC and a Prime 850 minicomputer network at the United States Military Academy. The 2300 compound EPA vapor phase library was used in all searches. Search results were obtained by using both unclipped and clipped time domain representations (inverse transforms) of absorbance spectra. For unclipped searches, a normalized vector was calculated by using a selected segment from the time domain representation of a sample spectrum. The dot products between this sample vector and the corresponding library vectors were then determined. The dot products were used to identify matches within the library. The bit searches involved clipping the time domain representations obtained from sample and library absorbance spectra. The top hits were then identified by determining the number of matching bits between these representations.
cases, the entire spectrum has been compressed into 128 bits and the illustrated distortions are not totally unexpected. Additionally, any distortions introduced by the clipping operation should be similar for similar sample and library spectra. The compression of time domain intensities into single bits results in a 16-fold reduction in data storage re-
Clipped Time Domain Searches. The clipped transform search algorithm was initially tested with unknowns not present in the library to determine if chemical similarity could be identified. Table I lists the search results for a pentyl propionate sample. The top five hita are all esters while three of the top four hits identify esters of propionic acid. Figure 5 depicts the spectrum of pentyl propionate with the library spectra from the top two hits. From the spectral similarities,
I
'
asoo.
"
'
1
'
3000.
~
'
"
'
'
'
2500.
'
I
~
#
'
.
2000.
,
,
'
1500.
*
-
I
1000.
-
,
LRVEwIllER
Figure 3. Infrared absorbance spectra of Isopropyl alcohol: (A) EPA vapor phase library spectrum, (B) spectrum obtained from 512 point interferograms,and (C) spectrum obtained from the inverse transform of the 128 bit clipped time domain representatlon.
'soo.
booo.
I
"
2500.
.
'
"
12000.
"
I
.
*
"
woo.
I
"
1000.
yIw8mEII
Infrared absorbance spectra of p-xylene: (A) EPA vapor phase library spectrum, (B) spectrum obtained from 512 point interferograms, and (C) spectrum obtained from the inverse transform of the 128 bit clipped time domain representation.
RESULTS AND DISCUSSION
ANALYTICAL CHEMISTRY, VOL. 55, NO. 9, AUGUST 1983
Table 11. Search Results by Using Different Segments of the Transformed Absorbance Spectrum Target: Acenaphthene displacement of 60 displacement of 1 dot hit no. compound product hit no. compound 1 2 3 4 5
acenaphthene naphthalene, l-methylnaphthalene, l-ethylnaphthalene, 1,5-dimethylnaphthalene, 1,4-dimethyldisplacement of 1
hit no.
compound
1 2 3 4 5
dodecane tridecane hendecane tetradecane pontadecane
--
dot product 0.956 0.814 0.791 0.778 0.775
0.870 1 acenaphthene 2 propylamine,3,3PR-/methylimino/-di 0.827 3 bitolyl, M, MPR0,825 0.792 4 l,4-butanediamine 0.782 5 hexanonitrile, 6-aminoTarget: Dodecane displacement of 256 dot product hit no. compound octyl disulfide 1.000 1 tridecane 0.999 2 dihexylamine 0.998 3 dodecane, l-bromo0.998 4 1-undecanol 0.997 5
dot product 0.977 0.976 0.972 0.971 0.964
.__
Table 111. List of Compounds Used in Library Searches 1. acenaphthene 12. hexane 13. methanol 2. acetone 3. acetophenone 14. 2,4-pentanedione 4. p-anisaldehyde 15. o-allylphenol 5. rn-methylanisole 16. 2,3,54rimethylphenol 6. benzene 17. propanal 7. 1,2,4-trimethylbenzene 18. isopropyl alcohol 8. 1-butanol 19. rn-xylene 9. cyclohexanol 20. o-xylene 10. dodecane 21. p-xylene 11. geraniol it appears that the clipping operation has resulted in little information loss An analysis was then conducted to determine which portion of the spectral transform representation t~ select as the search vector. Table I1 lists the results from two searches by using different vector Eiegments. In both, optimum chemical similarity was obtained with vectors beginning one point past the burst. For a displacement of 1,acenaphthene (naphthalene with a C,H4 group substituted across adjacent carbons) yields substituted naphthalenes for hits 2-5. However, if the search vector begins 60 points past the burst, the top hits do not appear to be chemically similar. Since the first segment of the spectral transform dictates the overall spectral shape, its omission from the search analysis apparently decreases the chemical similarity of the results. With unknowns that are not present in the library, search algorithms which do not include this initial portion of the spectral transform could prove to be unreliable. For compounds present within the library, it appears that a displacement of 60 gives optimum specificity. Acenaphthene is more clearly identified as the correct compound with a displacement of 60. Its dot product is closer to 1.0 while those of the other hits are significantly lower. This specificity a t a displacement of 60 agrees with previously reported work (9). Additionalily, since segments
-
'sodo.'
--Lr--
boio.'
'
I2500. "
'
"
2000. I "
'
I1500. "
'
-
'
YIMDP
'
'
'2?ioo. ' '
" 1110. ' ' ' YIvEWmcR
'
'1500.* '
'
I1000. "
.
complete spectra 100.0%
256 point
100.0%
100.0% 100.0%
100.0% 100.0%
100.0%
100.0%
Loo.
Figure 5. Search results using the 128 bit clipped time domain representation of an unknown not present in the library: (A) spectrum of the unknown, pentyl propionate, (B) library spectrum of the top hit, butyl propionate, and (C) library spectrum of the second hit, propyl propionate.
located away from the burst contain the higher frequency information needed to differentiate between similar compounds, one would expect the observed increase in specificity. The clipped transform search algorithm was then tested by using the several GC/FT-IR data sets described previously.
Table IV. Search Results for all 21 Compounds by Using Different Segments of Spectral Fourier Transform Representations search vector correct compound top hit top 2 hits top 5 hits top 10 hits
1551
64 bit 256 bit (clipped) 64 bit (at burst) (60 past burst) 90.5% 71.4% 76.2% 95.2% 71.4% 76.2% 95.2% 81.0% 76.2% 100.0% 81,0% 90.5%
1552
ANALYTICAL CHEMISTRY, VOL. 55, NO. 9, AUGUST 1983
Table V, Search Results for Clipped Time Domain Representations Target: Dodecane 128 bit search vector hit no. 1 1 1 1 5
compound octyl disulfide hendecane dodecane tridecane 17 compounds
bits matched 128 128 128 128 127
256 bit search vector
hit no,
compound octyl disulfide dodecane tridecane decyl disulfide tetradecane
1 1 3 4 4
bits matched 256 256 254 252 252
Table VI. Search Results for All 21 Compounds with Different Numbers of Initial Interferogram Points to Calculate the Infrared Absorbance Spectraa number of interferogram aoints used to calculate sDectra correct compound top hit top 2 hits top 5 hits top 10 hits a Search vectors are from time the burst.
2048
512
256
128
64
100.0% 100.0% 90.5% 61.9% 33.3% 100.0% 100.0% 95.2% 81.0% 52.4% 100.0% 100.0% 95.2% 90.5% 76.2% 100.0% 100.0% 100.0% 95.2% 85.2% domain representations (inverse transforms of absorbance spectra) and begin 1 point past
Table VII. Comparison of Search Results with Different Length Interferograms to Calculate Absorbance Spectraa 1 6 point search vector
32 point search vector
2048 64 2048 128 point interferogram point interferogram point interferogram point interferogram
correct compound top hit 61.9% 33.3% 81.0% 61.9% top 2 hits 71.4% 52.4% 95.2% 81.0% top 5 hits 85.7% 76.2% 100.0% 90.5% top 10 hits 95.2% 85.2% 100.0% 95.2% a Search vectors are from time domain representations obtained by inverse transforming the different resolution absorbance spectra.
A total of 21 compounds (all present within the library) were used in all search comparisons with sample sizes ranging from 400 ng to over 10 pg, These compounds are listed in Table 111. Search results for both clipped and unclipped representations are depicted in Table IV. In all cases the complete spectra correctly identified the unknown as did the 256 point time domain search vector. The clipped 256 bit search vector correctly identified 19 of 21 unknowns as the top hit, with the other two being listed second and seventh, respectively. Further data compression using only a 64 bit search vector resulted in lower hit rates but correctly identified the unknown in over 70% of the samples. Placement of the 64 bit search vector at 60 points past the burst gave better search results than those with the vector beginning at the burst. This again illustrates the specificity of search vectors which begin away from the burst. The lower hit rates attained by the 64 bit search vector were found to be due to similarities of spectra within the library. With dodecane as the unknown, 44 long chain hydrocarbon compounds were identified as having the same bit pattern. While the 64 bit search vector could be used to identify a class of compounds, its lack of specificity lessens its reliability in search algorithms. Table V compares dodecane search results using 128 and 256 bit search vectors. With a 128 bit vector, 4 compounds match the 128 bits and 17 others match 127 bits. The 256 bit search vector is a more discriminant function yielding only two compounds which match the sample’s bit pattern. From these results the clipped infrared Fourier transform representation appears to contain sufficient information for utilization in search algorithms. These representations produce reliable search results for GC/FT-IR sample sizes as
small as 400 ng/component. In order to maintain compound specificity, at least a 256 bit search vector is needed. Compression of the EPA library spectra into this 256 bit format allows more than a 100-fold reduction in spectral storage requirements. Resolution Independent Time Domain Searches. As discussed previously, the spectral Fourier transform representation can be used in search algorithms which are independent of spectral resolution. A comparison was made to determine the required resolution for reliable searches. The spectral transform library was initially established (eq 2) by utilizing the infrared absorbances between 740.7 and 3704 cm-l. Selection of this frequency range required that the interferograms used to calculate absorbance spectra contain a t least 64 points. The GC/FT-IR data were 2048 point interferograms. Resolution was varied by using from 64 to 2048 interferogram points in spectra calculations. Search results from spectra calculated at five different resolutions are listed in Table VI. All search vectors used were calculated from the time domain representations (inverse transforms) of these different resolution absorbance spectra. By use of 512 point interferograms to calculate absorbance spectra, all 21 compounds were correctly identified as the top hit. With 256 point interferograms, 19 of 21 compounds were listed as the top hit while the other two unknowns were second and eight, respectively. The 256 point interferogram segments utilized in spectra calculations were centered at the light burst and yielded spectra with a resolution of approximately 128 cm-l. Further reduction of collected interferogram points resulted in poorer search results. The 64 point interferograms (yielding 512 cm-l resolution spectra) correctly identified 33% of the unknowns and con-
Anal. Chem. 1983, 5 5 , 1553-1557
tained over 75% of the unknowns among the top five hits. Table VI1 illustrates a comparison of search results obtained from 16 and 32 point search vectors. The 16 point search vectors were calculated from the time domain representations of 8 cm-’ and 512 cm-l spectra while the 32 point search vectors were obtained from the time domain representations of 8 cm-l and 256 cm-l spectra. Since these search vectors were calculated from short segments of time domain representations (beginning 1past the burst), they contained only low resolution information. In both cases search results improved significantly wnth search vectors calculated from the inverse transforms of higher resolution spectra. These superior search results give an indication of the magnitude of leakage and phase error effects induced in calculating absorbance spectra from extremely short interferograms. Since all of the search vectors contained only low resolution information, it also appears that extremely low resolution spectra maintain many of the unique characteristics of highier resolution spectra. The infrared spectral time domain representation can be applied to GC/FT-IR analyses. For sample sizes greater than 400 ng, reliable search results are obtainable from 256 and 512 point interferograms using presently available data bases. With these shorter interferograms, a greater percentage of data collection occurw near the light burst, the region offering the highest interferometrilc signal to noise ratios. Griffiths has shown that a doubling of interferogram length requires a 4-fold increase in measurement time in order to match the signal to noise ratios of the lower resolution spectrum (10). Thus, it appears that the shorter interferograms may yield increased signal to noise ratios, provided, of course, that these shorter interferograms result in sufficiently increased scan rates. Future efforts are being directed toward this and toward
1553
determining resolution requirements for identifying compounds present in quantities approaching current GC/FT-IR sensitivity limits.
ACKNOWLEDGMENT The authorai wish to thank Leo V. Azarraga of the Environmental Protmtion Agency, Athens, GA, Dan T. Sparks and previous workers at the University of North Carolina, Chapel Hill, NC, and the Research Triangle Institute, Research Triangle Park, NC, for supplying and assisting in the collection of the GC/FT-IR data. Appreciation is also extended to R. B. Lam of Foxboro Analytical, Norwalk, CT, for his many helpful comments.
LITERATURE CITED Rasmussen, G. T.; Isenhour, T. L. Appl. Spectrosc. 1979, 33, 37 1-378. Hangac, G.; Wledboldt, R. C.; Lam, R. B.; Isenhour, T. L. Appl. Specfrosc. 1082, 36, 40-47. Lam, R. B.; Foulk, S . J.; Isenhour, T. L. Anal. Chem. 1981, 53, 1679- 1684. Azarraga, L. V.; Williams, R. R . ; de Haseth, J. A. Appl. Spectrosc. 1981, 35, 466-489. Small, G. W.; Rasmussen, G. T.; Isenhour, T. L. Appl. Specfrosc. 1970, 33, 444-450. de Haseth, J. A.; Azarraga, L. V. Anal. Chem. 1981, 53, 2292-2296. Lam, R. B.; Wledboldt, R. C.; Isenhour, T. L. Anal. Chem. 1981, 5 3 , 889A-895A. Crawford, E. F.;Larsen, R. D. Anal. Chem. 1977, 49, 508-510. de Haseth, J. A.; Leclerc, D. F., presented at the 1982 FACSS meeting, paper 496. Grifflths, P. R. Anal. Chem. 1972, 4 4 , 1909-1913.
RECEIVED for review January 13, 1983. Accepted April 22, 1983. This work was supported by the National Science Foundation Gr