Computer storage and search system for infrared spectra including

A new approach to a coding and retrieval system for infrared spectral data: The `Effective Peaks Matching' method. Oi-Wah Lau , Ping-Kay Hon , Tao Bai...
2 downloads 11 Views 348KB Size
Computer Storage and Search System for Infrared Spectra Including Peak Width and Intensity Elwin C. Penski, Daniel A. Padowski, and James B. Bouck Physical Chemistry Branch. Chemical Research Division, Chemical Laboratory, Edgewood Arsenai. Aberdeen Proving Ground, M d . 21070

In recent years, many reports have been written on systems for searching spectral data files, particularly those collected by the American Society for Testing and Materials (ASTM). Methods for searching the mass spectra files have been reported by many authors. A few of these are Grotch ( I ) ; Wangen, Woodward, and Isenhour (2); and Hertz, Hites, and Biemann ( 3 ) . A number of methods (4-6) for searching the ASTM infrared (IR) data file have been developed. Some workers such as Horlick (7) have developed systems for comparing spectra in a more detailed manner than peak by peak methods. These latter methods have not been applied to the searching of large files. Codding and Horlick (8) have developed a small binary coded file of 35 direct current arc emission spectra and a method to search the file. Kowalski, Jurs, and Isenhour (9) have applied computerized learning machine methods to classify rather than match spectra. The studies listed above have been developed based on foundations such as statistics, file research, computer logic, learning machines, artificial intelligence, cross-correlation techniques, information theory, graph theory, and pattern recognition. It has been found in this laboratory that while the ASTM file of infrared spectra is by far the largest file available (IO), the data storage and data search system (11) has a number of drawbacks: The spectra file card format is not compatible with most card readers. The magnetic tape file is not compatible with some computers without programming to rearrange the format. There is no gradation in the intensity data stored with a spectrum and, in addition, the intensity data is ambiguous in that it is not applied to a specific peak but to all peaks within a 1-micron range. The search techniques are based mainly on the types of matching possible with a card sorter, and, as a result, do not take into account many variables such as differences in spectrometers and the spectroscopist's interpretation of spectra. The ASTM system leads to a large number of matches if not used with additional information. The ASTM system does provide means to narrow these matches down by specifying considerable chemical data for each of the spectra, but the coding techniques for these chemical data are rather difficult to adapt to standard searching methods. S. L. Grotch, A n a l . Chem . 45, 2 (1973) L. E. Wangen. W . S. Woodward, and T. L Isenhour. Anal. Chem.. 43, 1605 (1971). H. S. Hertz, R . A Hites. and K Biemann. Anal. Chem. 43, 681 (1971) D. H. Anderson and G . L. Covert, Anal. Chem. 39, 1288 (1967). D. S Erley, Anal Chem . 40,894 (1968) R W Sebesta and G G Johnson, J r . . Anal. C h e m . 44, 260 (1972) G . Horlick, Anal. Chem , 45, 319 (1973) E . Codding and G . Horlick, Appl Speclrosc . 27, 366 (1973) B R Kowalski. P C . J u r s , and T. L Isenhour, Anal. Chem.. 41, 1945 ( 1969) L. H . Gevantman, Anal. Chem . 44 (7), 30A (1972)

"Codes and Instructions for Wyandotte-ASTM Punched Cards," American Society for Testing and Materials, Philadelphia, Pa.. 1964

The object of the study described in this report is the development of an IR spectral search system with fewer limitations which is designed for up-to-date spectrometers and computers. In addition, the method described here should have greater utility in devices for detecting and determining the composition of pollutants, illicit materials, and military weapons. At present, it is not applicable to large files of 100,000 spectra since the large files which are now available do not have the detailed spectral information used by this system. When detailed files are developed, computer speeds have increased, and the system is optimized; it should be applicable to any type of search method, including use with conversational time-sharing terminals. The results from such searches should yield more definitive results than present methods.

EXPERIMENTAL In the method described in this report, the positions of the largest peaks are stored on a card along with a code number for the peak intensity and shape. The format and intensity codes are given in Tables I and 11. Code numbers for the type of sample medium, cell, and spectrometer are stored as shown in Tables 111, IV, and V; and the compound is given an arbitrary identification number on each card. It is assumed t h a t a match between two sharp peaks is not equivalent to a match between a broad, weak peak and a sharp, strong peak, nor that a match between two sharp, intense peaks is equivalent to a match between two sharp, weak peaks. Table I1 shows the weights given to a match between like peaks, and Table VI shows the extent that the preceding weights are reduced by a match between unlike peaks. The weights given in Tables I1 and VI are based solely on the authors' judgement.

THEORY The match sum, a measure of how good a match has been obtained, is calculated by summing several factors. The first factor is based on the assumption that the separation in wavelength of peaks reduces their probability of being a match by a functional relationship based on the normal distribution (12). Two peaks with exactly the same wavelengths have a distance factor of 1. The second and third factors are the weights taken from Tables I1 and VI, respectively. The match result is obtained by dividing the match sum between the unknown and known by the match sum between the unknown and itself. The following equations are used to calculate the match sum and the match results, respectively, for a pair of spectra.

Mi

=

M

X

100.O/M,

(2)

where

A4 = Match sum Mo = Match sum of unknown with itself (12) A. J .

Duncan, "Quality Control and Industrial Statistics." Richard A.

Irwin. Inc.. Homewood. Ill , 1959.

A N A L Y T I C A L C H E M I S T R Y , V O L . 46, NO. 7 , JUNE 1 9 7 4

955

Table I. Spectrum Card Format Peak

Table IV. Cell Code and Range of Effectiveness

Column

Item

1

1-3

2

4 6-8

14

9 ... 66-68

Wavelength X 10 or wavenumber x 10-1 Intensity code (Table 11) Wavelength X 10 or wavenumber x 10-1 Intensity code

69 71 72 73 74 75-80 a

Code (col. 72)

Type of cell window or pellet

Effective range, p

0 1

CSI KBr NaCl KRS-5 Polyethylene CsBr CaF2

1.0-40.0 1.0-25.0 1.0-15.0 1.0-40.0 15.0-1000.0 1.0-35.0 1.0-9.0

No cella

Unlimited

2 3 4 5

6 7

Wavelength X 10 or wavenumber x 10-1 Intensity code Sample preparation (Table 111) Cell (Table IV’, Spectrometer (Table V) Wavelength or wavenumber unit codea Spectra identification

8

9 For plastic films.

Table V. Spectrometer Code and Range of Effectiveness

Code: 0 for microns and 9 for cm - 1

Code (col. 73)

Code

Intensitya

1

Sharp (50)b Medium to broad (200)b Very Broad (>200) Sharp (40)b Medium to broad (150)“ Very broad (>150) Sharp (30)b Medium to broad

Strong Medium Medium Medium Weak Weak

(100)b

Weak

8

Very broad

(>loo)

1.0

0.7 0.3 0.7 0.5 0.5

0.3

Table 111. Type of Sample Preparation

5

Gas phase Liquid film Solid film KBr pellet CCla solution CHCL solution

6

CS, solution

7

Nujol mull

8

Other mull or pellet Solutions and/or other

2 3 4

9

Ineffective ranges,

p

25.0-40.0 12.0-15.0 8.0-8.4, 12.0-15.0 4.2-4.8 6.2-7,2 23.0-28.0 3.4-3.6 6.8-7.4

M , = Match result Y , = Wavelength of known spectral peak j X,= Wavelength of unknown spectral peak i u, = Estimated standard deviation at ith peak. Values of u, were assumed to be X , / 6 0 W,, = Weight from Table VI for peak intensity and shape for the ith peak of unknown and the j t h peak of known R, = Maximum contributions of ith unknown or weight of match between like peaks (Taken from Table 11) 956

1.5-6.0 1.5-9.0 2.5-15.0 2.0-25.0 ~

Code

0.1

*

0 1

5.5-15.0

Table VI. Reduction Factor for Matches between Unlike Peaks

The peak transmission for spectra with ca. 807, background transmission were coded as 0-307, strong, 30-607, as medium, and >60$, as weak. Maximum band half width (cm - 1 ) for designated shape.

Type

2 . O-15.0

2.5-40.0 2.5-25.0 10.0-25.0 15.0-34.3

~-

0.2

(I

Code (col. 71)

Effective range, p

NaCl prism Grating Grating KBr prism CsBr prism NaCl prism LiF prism CaF, prism Grating Prisms

2 3

Weight of like peak match

Shape

Strong Strong

0

Type of spectrometer

0 1

Table 11. Intensity and Shape Coding System-Weights of Match

A N A L Y T I C A L C H E M I S T R Y , VOL. 4 6 , N O . 7 , J U N E 1 9 7 4

Code

0 1

2 3 4 5 6 7 8

0

1

2

3

4

5

6

1.0 0.8 0.3 0.6 0.2 0 . 1 1 . 0 0 . 7 0 . 4 0.6 0 . 3 0.7 1.0 0 . 3 0.4 0.6 0.4 0 . 3 1.0 0.7 0 . 3 0.6 0.4 0.7 1.0 0.4 0 . 3 0.6 0 . 3 0.4 1.0 0 . 1 0 . 1 0.5 0 . 3 0 . 1 0 . 1 0 . 2 0 . 1 0 . 2 0.6 0 . 3 0 . 0 0.0 0 . 1 0 . 1 0 . 1 0 . 5

0.8 0.3 0.6 0.2 0.1 0.2

7

0.2 0.1 0.1 0.2 0.1 0.1 0.5 0.2 0.3 0.6 0.1 0.3 1.0 0.3 0.3 1.0 0.1 0.2

8

0.0 0.0 0.1

0.1 0.1 0.5 0.1 0.2 1.0

Table VII. Match of “Unknown” Spectrum (Diethylchlorophosphate, Liquid Film)

Compounds

Diethylchlorophosphate (Liquid film) Diethylchlorophosphate (Liquid film) Triethylphosphate (Liquid film) Triethylphosphate (Liquid film)

Compared peaks

14

Match result

100

Cell

Spectrometer

CSI

2.5-40 p Grating NaCl prism NaCl prism 2.5-40 p Grating

12

82.0 NaCl

12

70.3 KBr

14

64.2 CSI

Not all peaks in the two spectra are counted in the match sums if the spectra are run with a different medium, cell, or spectrometer. Tables 111, IV, and V list the ineffective and effective ranges for different media, cells, and spectrometers. Only those wavelengths which fall in the effective ranges of both spectra are compared. Matches between each pair of spectra may be run twice; the sec-

T a b l e VIII. Search Using Tri-n-butylphosphate Spectrum (Liquid Film, CsI Cell, and Grating Spectrometer (2.5-40 p ) ) as the “Unknown” Compounds

Spectra type

Tri-n-butylLiquid film, CsI cell, phosphate grating (2.5-40 H) Tri-n-butyl. Liquid film, KBr cell, phosphate grating (2.5-25 p ) Tri-n-butylLiquid film, KBr cell, phosphate prisms (2-25 p ) Di-n-butylLiquid film, KBr cell, n-butylgrating (2.5-25 p ) phosphonate

Match result

100

Compared peaks

14

88.6

14

82.0

14

77.7

14

Table IX. Search Using Triphenylphosphate Spectrum (KBr Pellet and KBr Prism as the “Unknown”) Compounds

Spectra type

Triphenylphosphate Triphenylphosphate

KBr pellet, KBr prism KBr pellet, NaCl prism

Match result

Compared peaks

100

12

100

4

with a shift of this spectrum by -0.07 micron. The “known” in this case had a final match result of 70.5 and was dichlorophenylphosphine sulfide [liquid film, CsI cell, and grating (2.5-40 b ) ] . In the case of the above menond match utilizes a correction for linear wavelength distioned dichlorophenylphosphine “unknown”, there were placement obtained from the first attempted match if five match results above 50. A spectral shift for each of these resulted in an improved match result. The distances such a correction is required. The wavelength correction was calculated by replacing R Lin’Equation 1 by RL(Y,-XL) of these shifts were -0.07, -0.06, -0.06, -0.01, and -0.03 microns. The consistency of these shifts, despite and by dividing the resulting value obtained from the their small size, indicates that the particular “unknown” modified right side of Equation 1 by the absolute value of spectrum is probably slightly in error in the designation of Mo This displacement correction procedure was tested specific wavelengths. with spectra where the wavelengths of the peaks were deThe effectiveness of the system in aiding the user in deliberately shifted less than l micron. It worked satisfactotermining an unknown structure was demonstrated in a rily. number of cases. For example, the use of diethyl 4-nitroRESULTS AND DISCUSSION phenyl phosphate (CC14 solution, NaCl cell, and NaCl prism) as an “unknown” produced only one match result All of the spectra in a 57-spectrum file of organophosabove 50, diethyl 4-aminophenyl phosphate (liquid film, phorus compounds were run as “unknowns” against the KBr cell, and NaCl prism) which had a match result of total file. As would be expected, each produced a match 60.1. Thus, even though a compound may not be in the result with itself of 100. The first sample “search” is file, the highest match results generally correspond to shown in Table VII. In the sample searches, the number very similar compounds. of compared peaks is the number of peaks in the overlapCurrently the file has about 60 spectra of organophosping region of the two spectra being examined. For examphorus compounds and a search requires abou,t 1.3 secple, if an “unknown” spectrum has a range of 2 to 15 mionds. Therefore, a search of 10,800 spectra would take crons and the “known” a range of 10 to 25 microns, only about 4 minutes of Univac 1108 time. This time could those peaks in the 10- to 15-micron region can contribute probably be reduced if some effort were devoted to reproto the match result and are called the compared peaks. gramming for speed. In all of the searches, it was found that when a match result above 90 occurred, the compounds compared were ACKNOWLEDGMENT identical. Generally, all comparisons of different spectra of the same compounds yielded a match result of 80 or We thank John J. Callahan and Donald R. Bowie for more as illustrated in Table VIII. The differences between their helpful comments. spectra of the same compound from various types of spectrometers or cells result from actual variation in the specReceived for review January 26, 1973. Accepted January tra and not the search system. When compounds are run 22, 1974. This project was supported in part by the U. S . on different spectrometers; with unlike cells, using samArmy Materiel Command Computer-Aided Design and ples of varying purity, and by many spectroscopists; some Engineering Program. dissimilarities in spectra must be expected. Also, there may be some differences in the coding of the spectral data, but this should be minimized by the assigning of the lower weights to matches between weak peaks. CORRECTION High match results were often found for different spectra of the same compound even where only a few peaks Acquisition and Analysis of Cyclic Voltammetric Data could be compared as in Table IX. The “unknown” listed in Table IX yielded no other spectra with match results above 50 besides the ones listed in that Table. In this paper by P. E. Whitson, H. W. VandenBorn, Generally, there were very few linear shifts of one specand D. H. Evans [Anal. Chern., 45, 1298 (1973)], Equation trum with respect to another to improve matches. The 1, p 1301, should read largest observed improvement was with the “unknown” of dichlorophenylphosphine [liquid film. CsBr cell, and gratR‘ = ZT(1 +pcos a ) ing (2.5-40 p ) ] . The match result was improved by 9.3

[k $1

A N A L Y T I A L C H E M I S T R Y , V O L . 46, N O . 7 , J U N E 1974

957