Computer Search System for Retrieval of Infrared Data Don H. Anderson a n d G . L. Covert Eastman Kodak Company, Rochester, N . Y . I4650 A system of indexing and sorting infrared spectral data using punched cards has been in operation for about 15 years. The number of curves to be indexed is now about 100,000 and the punched card sorting technique is time-consuming. The search system described here uses an IBM 7080 computer to search the data on magnetic tape. Searches that previously took hours can now be made in minutes. Two examples are presented and a detailed explanation of the basis for the searches is given.
INFRARED SPECTRA of reference compounds are used routinely for identification purposes and may also be employed for structure correlation studies. The first widely accepted method of searching libraries of infrared spectral data was that described by L. E. Kuentzel in ANALYTICAL CHEMISTRY (1). The effort that went into the development and use of his proposals was extensive and centered around ASTM Committee E-13 on Absorption Spectroscopy. The subcommittee on reference data served as the coordination agency. The number of spectra available for reference increased to such a n extent that Dr. Kuentzel along with many other spectroscopists recognized that the next logical step was to computerize the handling of data. Efforts of R. A. Sparks (2), Lee D. Smithson (.?), E. A. Diephaus and T. A. and others Entzminger (4,Sadtler Research Laboratories (3, who worked to develop computer systems must be recognized. The system described here compares spectra a t the rate of about 10,000 per minute and arranges them in order on a probability of match basis. This is a faster and more sophisticated system than any others so far described in any published or private disclosures made to us. The basic data for the system are derived from the spectral data absorption cards such as those available from ASTM. The initial program converts these data from punched card t o magnetic tape format. This can be accomplished by using any computer which has a column binary feature. This is necessary because of the non-Hollerith coding of the spectral and chemical data fields in the original Wyandotte-ASTM spectral data cards. By using column binary conversion, a card punched in non-Hollerith code can be converted to Hollerith form in computer memory. In addition this conversion program also performs a preliminary audit which checks for proper identification coding which was in columns 79-80 of the original spectral data cards. An error list is generated with a message for each discrepancy. (1) L. E. Kuentzel, ANAL.CHEW., 23, 1413 (1951). (2) R. A. Sparks, “Storage and Retrieval of Wyandotte-ASTM Infrared Spectral Data Using an IBM 1401 Computer,” ASTM, Philadelphia, Pa., 1964. (3) L. D. Smithson, L. B. Fall, F. D. Pitts, and F. W. Bauer,
“Storage and Retrieval of Wyandotte-ASTM Infrared Spectral Data Using a 7090 Computer,” Technical Documentary Report No. RTD-TDR-63-4265, Research and Technology Division, Wright-Patterson Air Force Base, Ohio, 1964. (4) T. A. Entzminger and E. A. Diephaus, “Storage and Retrieval of Wyandotte-ASTM Infrared Spectral Data Using a Honeywell-400 Computer,” U. s. Public Health Service, Robert Taft Sanitary Engineering Center, Cincinnati, Ohio, 1964. ( 5 ) Sadtler Research Laboratories, 1517 Vine St., Philadelphia, Pa. 1288
e
ANALYTICAL CHEMISTRY
As a result of the 3 to 1 expansion of some of the punched card fields, the tape records from this conversion are 200 characters long. As a safety precaution this tape is duplicated and used as input to a 7080 computer data condensation program. This program accomplishes the following :
I. Gives a detailed audit on each record and discrepancy messages where necessary.
2. Builds in a i.0.1 micron tolerance from each original coded spectral peak. 3. Selects from this tape only the data fields actually to be used in searching and puts these on another tape. This condenses the record length to 150 characters. This eliminates, for example, card columns 16 through 28, shortens the tape record and, as a consequence, the computer read time is lessened by as much as 25%. When supplemental or corrected data are to be added, the same approach is initially followed to the point where the data are on tape and have been audited. The corresponding incorrect data are then removed from the old master tape using either an IBM 1460 or 7080 computer and a utility extract program, specifying the serial numbers of spectra to be deleted. The new data from the card to tape and audit program after being sorted into numerical sequence are then merged with the previous master file, less the deletions, t o create a new file. Data for one infrared spectum are represented by 150 character records on the tapes. The tape is blocked 10 records per block. A higher blocking factor is not used in order t o conserve computer memory for the storage of matching spectra. T o initiate a search, the spectroscopist first fills out his request form. A maximum of 20 spectral terms and 15 chemical classification terms together with either melting point o r boiling point request terms can be used in a single search. These terms may be mandatory or desirable and as present or absent but there must be at Ieast one mandatory term. If there are no search terms, then the letters M and D are to be used. Search logic is based tentatively on accepting all curves which fulfill all mandatory request terms and the best answers are those fulfilling the largest number of spectral terms including both mandatories and desirables. Failure t o meet a mandatory term results in immediate rejection Spectral data terms are used exactly as they are read off the spectrogram. Chemical terms are the same as the codes in the Wyandotte-ASTM manual of instruction for indexing spectral absorption data (6). I n other words the complete request is in the people-language of the spectroscopist. Melting point or boiling point request terms may be recorded as mandatory if coded. This approach permits one to retain a record if it fulfills all other mandatories even if the data have not been originally recorded. In addition the search may request all boiling and melting point data on all answers without stating specific requirements. After the request form has been completed, the data are key-punched into cards and forwarded to the computer area together with control cards. ( 6 ) “Codes and Instructions for Wyandotte-ASTM,” ASTM, 1916
Race Street, Philadelphia, Pa.
1. Proper use of logical connectors 2. Card sequence within the request 3. Compliance with the maximum of 20 spectral terms and 15 chemical terms 4. Proper range of search terms, for example, no spectral terms below 2 or above 15 microns. 5 . At least one mandatory term 6. Overlapping of positive and negative terms
74)
loo
-
-80 8e
Sw I-
w
40
2
20 I
0'
A
.
I
I
I
I
i ' A '
I
:
I
I
1
I
I
I l l
I
I
I
I
I
1:
; ' I W*VELBENGTH9~WCRA\S) I
I
I l l
I
I i,.,
' A 1 l ! "
I I I U1,',"11
14
1!
Figure 1. Infrared spectrum of pdichlorobenzene, selected as typical of many types of unknowns with only a few absorption bands The data on the search cards and the control cards are then put on magnetic tape via a 1460 computer. The search program and this search request tape are then read into the 7080 computer t o search the master file. The program, written in Autocoder for searching only I R data on a 7080 computer, first checks that the correct master search tape has been mounted. It then reads the search request and checks for such items as the following and writes out error messages when necessary :
The search program stores in memory up t o five requests before searching the master files. The master file is then read sequentially, each spectrum being compared against each request stored. Hits or matches are stored internally in computer memory. The 7080 computer has 80,000 positions of core storage. U p to 100 matches for each request can be stored. When the number of matching curves exceeds 100, the best 100 answers are stored in memory based on the fulfullment of desirable terms in the request. Obviously all mandatory terms must be fulfilled. When the master file has been completely passed, the best 100 or less stored answers for each stored request are written on tape for subsequent printing on an off-line computer. The print out answer sheet together with control cards and original request cards are returned t o the spectroscopist.
BOILING MELTING B.P. POSITIVE "C-B*P* NEGATIVE "C M.P. - - POSITIVE "C M.P. NEGATIVB~C PPZibT f C B f C M + 2 C B+ (optional)
-
,.
Figure 3. Illustration of fields A, B, and C on the IBM card VOL. 39, NO. 1 1 , SEPTEMBER 1 9 6 7
1289
If more than 5 requests are submitted a t one time, the 6th, etc. requests are read into the computer after the search has been completed for the first set of five. The master tape is rewound and another sequential search is made for the additional requests up to 5 a t a time. The computer can only store u p to 5 requests and their best 100 answers on any given pass of the master file. A discussion of two examples, p-dichlorobenzene and resorcinol, will illustrate how the system can be used. Let us assume that the spectrum in Figure 1 represents an unidentified Sample No. 1. Because this spectrum is quite simple, one can use many exact negative terms in order to eliminate much unwanted data. The initial step in employing the computer search technique is to fill in the necessary data on the infrared retrieval request form as shown in Figure 2. The request provides for a maximum of the best 100 answers to be listed in the computer print-out. Even though 100 are requested, the requester may get fewer if the number of mandatory terms is large or if a good match is not on file. The search could be limited to the best 25, 10, or even one. In our experience, however, there appears to be little reason t o restrict one’s self to fewer than 100 or to ask for more than this number. The reason being there is no significant difference in cost for 10 or 100 answers to be printed out. Search terms are entered in the appropriate boxes and require one or more punched cards t o be prepared by the key punch operator. The 20 boxes permit one to employ u p to 20 spectral terms for search purposes. The letter M, for mandatory, in the first box signifies that the answer spectrum must hai>eabsorption peaks a t the wavelengths designatedLe., 12.2,9.9, 9.2, and 6.7 microns, respectively. When apeak a t 12.2 microns is specified, the computer automatically searches and includes curves coded a t 12.1, 12.2, and 12.3 microns-i.e., & O . l micron for each absorption peak requested. This minimizes the possibility of losing data due to slight differences in interpretation of the exact wave00618 1 00618 2 00618 3
REPUEST NO.
CURVE NO.
PEAKX0648 3
1 0 0 GEORGE L. C O V E R T INDUSTRIAL LABORATORY, BUILDING 3 4 M122*099*092*067*~I4A*-148*-14C*-13A~-138t-~3C*-~~A~~~~~-~~~~-~~~~-~~ 0 4 13 66
c o i ~ ~ * o ~ ~ * - o ~ v ~ - ~ ~ ~ * -O ~~ E~N Oi * - ~ ~ ~ o ~ n / o ~ ~ M
0
1*0*0*0*1*1*1*1*1*1*1*1*1*0*0 2 9 9 6-4-4-4-3-3-3-1-1-1-3-3 ......I........
2 9 2 7 A B C A B C A B C B C M0032988A 0001306CA WOO041 ZCA 002 L970A 006 76 7E A 00238ZFA 002383FA 002635GA J0003298A J0160148A J016014BA 013575CA 016742CA 0 17 174CA 007614EA A0338738A
1*0*0 2 7-7
...
03
Y 2 Y
004506CA 012437CA OOZ854EA 007485EA 008439EA
0 -3*3*3 8-8-9
BP OR HP C 0 5 2 np 1-03
DEC
2 1 Y 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03
D0058688A
22
...
70*060*029*-04VK-381*-382*-39VOEND
M 18
length. The asterisk * denotes logical A N D . I t serves as a connector between successive terms in the request. These mandatory absorption peaks are followed by a series of mandatory negative requirements for no absorption indicated by a n asterisk and minus sign - which denotes A N D NOT together with the codes - 14A, - 14B, - 14C etc. The letters A, B, and C enable one to eliminate references having certain specific coded data. The letters can be used to eliminate data in a condensed manner and so conserve space for other important terms when necessary. T o illustrate this usage let us refer to Figure 3 which represents a blank IBM spectral data card. When the letter A is used, it represents the combined Y overpunch, X overpunch, 0 and 1 positions in a vertical column such as the 14 column as outlined in the figure. The letter B represents the positions 2, 3, 4, and 5 in column 14 and C represents 6, 7, 8, and 9. This of course is applicable to any one of the first 15 columns which are used for spectral data. In addition, the letters may also be used in columns 32 through 57 for chemical structure terms. I n this case, of course, the letters are representing groups of chemical structure data as presented in the Wyandott-ASTM instruction manual. For example, the terms - 32A, - 32B, and - 32C would eliminate all organic compounds for a search except hydrocarbons. In the example, -14A means that the answer must have no strong absorption in the 14-micron region and no absorption peaks a t 14.0 and 14.1 microns; - 14C indicates that there are none a t 14.6, 14.7, 14.8, and 14.9 microns. Keeping in mind that a kO.1 range is also included, the total micron range covered by these terms is really from 13.9 through 15.0 microns. Any reference data on file which d o not meet all of these mandatory requirements are immediately eliminated from any further consideration. These mandatory terms are then followed by three desirable spectral terms written as D12Y, 072 and 07Y. The “D” does not precede each term. The term “Y” has been designated in the Wyandott-ASTM Instruction Book for coding as the symbol for a very strong absorption in an entire micron
03 03 03 03
03
* * * * * I * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
053 053 053
053
*** ***
CURVES n E E T THE MANDATORY REQUIREMENTS.
Figure 5.
Scroll from computer search on Example 1 VOL. 39,
NO. 1 1,
SEPTEMBER 1967
1291
Figure 6. Illustration of request data using more general search terms 4M)O 306(, 100
2000
1500
Chi!
loo0
900
700
800
-80 R
L
Y
z" 60
23 40
3 20 '
3
4
5
6
7 8 9 IO II WAVELENGTH IMICRONS)
12
13
14
15
Figure 7. Infrared spectrum of resorcinol, selected as typical of unknowns with many absorption bands region. In this case the maximum number of desirable terms could have been 5 . Assuming that our knowledge of the chemical composition of the sample is limited, we continue our search requirements with the three requests that it must not be a liquid, gas, or an inorganic material as indicated by the codes -381, -382 and -39Y, respectively. (These 381, 382, 39Y correspond to a 1 punch and a 2 punch in the 38 column on a n IBM card and a Y overpunch in column 39 on the original IBM spectral data card.) If one wishes to use melting point or boiling request terms, this may be done by specifying a range of temperature. In this particular example, the requester stipulated that melting point data in the range of 52" i 3" be included in the
( oPpOt iJoLn a l )
C B +
2
C B
-
search. This request is written as 52 i 03. Temperatures expressed as units are preceded by a zero on the request form. The program has been designed so that even if the boiling point or melting point data are not coded on the original spectral data card, the spectrum will not be rejected if it fulfills all other mandatories. Another option available to the searcher is to request that the boiling point or melting point, if available, be listed on all answers without stating specific requirements for it. This is accomplished by inserting the letter "L" in the box before END. These and other chemical codes which might be used are a part of the Wyandotte-ASTM system. The search request is concluded with the term END. At this point, the spectroscopist has completed his part of the job. These data are then key-punched into two or more cards as indicated in Figure 4. These cards are fowarded to the computer section for processing. The answer will be a printout list of curve serial numbers a s shown in the left column of Figure 5 arranged in order according to the best match with the desirable spectral data terms. The last line of the print-out lists the number of spectra which meet all the mandatory spectral terms. The data printed across the top of the answer sheet repeats the information in the original request form. This enables the spectroscopist to check the exact requested data supplied to the computer for the search against the data which he submitted initially. The asterisks opposite the answers
f
C M + l / O f O
Figure 8. A portion of request form to initiate a computer search on Example 2 1292
ANALYTICAL CHEMISTRY
2
CURVE NO. PEAKS
............. .............. ............... SPECTRAL D A T A
H
¶8
D
18181.183.’bOb0*0*0.0
i*iblbO808ObO
4 3 1 Q 0 7 7 6 6 3-5
1-4-1-9-9-8-4
; ; 6 4 6 7 3 7 2 r ~
e 2 2 4 e 2 y
CODES 08
n
........
CWEMfCAL DATA
D
-3bS83b48484b4.3 8.8-9-4-4.4-9 4
.........
BP OR MP C 1.10 MP /-03 DEG
1 2 Y 4 5 6 0 2
110 110
DI 36679R4 111 4 6 7 C I uQ97O.CA Dm i P 4 I’ 2 3 C A MI c~616CA L ‘3155FA
110
~Ol11204 i [’ L i 6 2 W 4 t 01535.JP I‘ 0 7 4 S1F 4 121 3IlC.A L Ob 927EA
12
110
CURVES M E E T THE MANDATORY REPkIIREMEh;TS,
Figure 9. Scroll from computer search of Example 2
indicate compliance with the request terms tabulated at the head of each column. A blank space indicates non-compliance. An X, when printed out, means n o information available in the reference for that particular term. The print-out, Figure 5, shows that only 22 spectra met the mandatory terms. Spectrum W3298BA was the first complete match to the example. This search request would be considered a very tight one because of the mandatory negative requirements. If these specific requirements together with the melting point had been deleted as shown in Figure 6, and substituted with the more general negative “Y” codes, the results would have been unsatisfactory. The requester would have found that 467 spectra fulfilled the few mandatory terms and the print-out would have totaled the maximum 100 spectral numbers. Included in these answers would have been many which were not good because they w o d d have had absorption peaks in regions where absorption was unwanted. A second example shown in Figure 7 has a more complex spectrum and it has been established by chemical tests that the compound does not contain primary, secondary, or tertiary amine groups and that there are no nitro groups in the molecule. In addition, the spectrum indicates that it is aromatic material. This information is included in the chemical codes using the Wyandotte-ASTM symbols -444, -445, -446, -490, and 342, respectively. Also a melting point of 110°C has been included with a i 03” range. The form for this request is shown in Figure 8. The answer sheet in Figure 9 shows a small list of answers because of the extensive use of negative demands. The first nine answers are reference curves from various sources and match the example. The system has proved to be a very workable one over the past three years. It has been in service not only for Kodak Park but also for subsidiary plants such as the Tennessee Eastman Co., Texas Eastman, Kodak PathC, Kodak Limited, and Kodak Australasia. Within the Kodak Park Plant, search service is provided on a 24-hour basis routinely with provision for faster service when necessary. One of the outstanding features of the search system is the use of the “people language” for input of a search request and for interpretation of the answer sheet. A technician or data clerk can take the computer answer sheet together with copies of all the necessary spectra and determine whether a match
has been attained. These features coupled with the fact that it is unnecessary for the user to learn a new complicated code system should make for widespread use of the system. When one inspects the print-out answer sheet, the best match or matches are at the top of the list followed by the next best and so on to the end of the answers. All the answers meet the mandatory terms and their order or arrangement is based on the best matches to the desirable terms. This feature is a very distinct advantage which was lacking in the card sorting technique. It has been our experience that a newcomer to the field finds it easier to acquire facility in the use of the computer system than he does with the card sorter systems. The elimination of the burden of handling thousands of punched cards provides the spectroscopist with more time to devote to more worthwhile assignments after devoting a maximum of five minutes to preparing his search request. In addition search time is reduced from several hours to a very few minutes. At present, data for approximately 110,000 spectra are the maximum for one magnetic tape. There are possibilities that this capacity may be increased. As the quantity of data becomes significantly larger, modifications can be introduced to reduce search time. The master tape could be sub-divided, for instance, into two segments, one having carbonyl absorption in the 5.6- to 5.9-micron range and one not having these data. An alternative t o splitting the file would be offered if magnetic tape facilities that have a packing density of 1600 bits per inch were used instead of the present 800 bits per inch. Our experience with the system has extended over a period of four years since it was completed. In addition, about three years were devoted to the development and debugging phases so that ample opportunity has existed to appraise the system and determine the several ways in which it has been useful. ACKNOWLEDGMENT
The authors express their thanks to Paul Horowitz and Graydon Loomis for their assistance in the development of the detailed computer program. RECEIVED for review January 16,1967. Accepted July 3,1967. VOL. 3 9 , NO. 1 1 , SEPTEMBER 1967
b
1293