Combined retrieval system for infrared, mass, and ... - ACS Publications

class categorization reliability could be improved by using a higher spectral resolution encoding of the data (e.g., 0.1-ppm intervals, rather than 1 ...
0 downloads 0 Views 794KB Size
the maximum likelihood and Bayes classifiers. When the data in Table I11 are considered, it is apparent that the Bayes classifier achieves far more balanced behavior than the maximum likelihood method. Non-class categori) made with approximately the same high zations ( P ( k ( n )are reliability, averaging 92.6% and 93.9% correct, respectively. However, class categorizations are far more reliable for the Bayes method, averaging 56.4% correct, instead of 35.170, which is the value for the other method. I t is possible that class categorization reliability could be improved by using a higher spectral resolution encoding of the data (e.g., 0.1-ppm intervals, rather than 1 ppm) or by use of a feature selection preprocessing step. Our visual examination of class/non-class histogram pairs suggests that the level of performance attained may be near the maximum possible using linear discriminants. In any case, the present results strongly suggest that the Bayes classifier performs in precisely the manner most desirable for on-line interpretation system intended to help guide the spectroscopist in spectral interpretation. In the absence of the prediction of membership in a structural feature class, there is a high probability that the compound whose spectrum is examined does not possess that feature. On the other hand, when a structural classification is made, there is a reasonably high possibility that the compound may possess that structure. Thus, a system using such categorizers should be easily capable of directing the spectroscopist's attention toward those structural features most probably present and, a t the same time, minimize the expenditure of time spent on less likely alternatives.

LITERATURE CITED (1) C. L. Wiikins, R. C . Williams, T. R. Brunner, and P. J. McCombie, J . Am. Chem. Soc., 96, 4182 (1974).

(2) T. R. Brunner, R. C. Williams, C. L. Wilkins, and P. J. McCombie, Anal. Chem.. 46. 1798 (1974). (3) T. R . Brunner, C. L: Wilk'ns, R. C. Williams, and P. J. McCombie, Anal. Chem., 47, 662 (1975). (4) C. L. Wiikins and T. L. Isenhour, Anal. Chem., 47, 1849 (1975). (5) T. R. Brunner, C. L. Wilkins, T. F. Lam, L. J. Sokberg, and S.L. Kaberline, Anal. Chem., 48, 1146 (1976). (6) L. F. Johnson and W. C. Jankowski, "Carbon-13 NMR Spectra", Wiley, New York, N.Y., 1972. 17) S. R. Lowrv. H. B. Woodruff. G. L. R i e r . and T. L. Isenhour. Anal. Chem.. 47, 1126 i1975). (8) H B. Woodruff, G. L. Ritter, S. R . Lowry, and T. L. Isenhour, Appl. Specrrosc.. 30, 213 (1976). (9) R . L. Duda and P. E. Hart, "Pattern Classification and Scene Analysis", Wilev-Interscience. New York. N.Y.. 1973. DD 32-34. (10) G. L: Ritter, S. R . Lowry, C. L. Wilkins, and T.'L. Isenhour, Anal. Chem., 47, 1951 (1975). (11) M. Sjostrom and U. Edlund, J . Magn. Reson., 25, 285 (1977). (12) H. Woodruff, private communication; H. B. Woodruff, C. R . Snelling, Jr., C. A. Shelley, and M. M. Munk, manuscript submitted for publication. (13) L. J. Sokberg, C. L. Wilkins, S. L. Kaberline, T. F. Lam, and T. R. Brunner. J . Am. Chem. SOC.,98, 7139 (1976). (14) H. L. Suprenant and C. N. Reilley, Anal. Chem., 49, 1134 (1977). D. M. Grant and E. G. Paul, J . Am. Chem. SOC.,86, 2984 (1964). (15) (16) G. B. Savitsky and K. Namikawa, J . Am. Chem. Soc.. 86, 1956 (1964). (17) L. P. Lindeman and J. 0.Adams. Anal. Chem., 43, 1245 (1971). (18) J. E. Sarneski, H. L. Surprenant, F. K. Molen, and C. N. Reilley, Anal. Chem.. 47, 2116 (1975). (19) A. L Burlingame, R. V. McPherson, and E). M. Wilson, Roc. Nar/. Acad. Scl. U . S . A . , 70, 3419 (1973). (20) R . E. Carhart and C. Djerassi, J . Chem. Soc., Perkin Trans. 2 , 1753 11973). (21) B , A. Jezi and D. L. Dairymple. Anal. Chem.. 47, 203 (1975). (22) N. J. Nilsson, "Learning Machines", McGraw-Hill, New York, N.Y., 1965.

RECEIVED for review July 18, 1977. Accepted September 22, 1977. Support of this research by the National Science Foundation under Grants MPS-74-01249 and CHE-76-21295 is gratefully acknowledged. The computer-readable 13C NMR data base is being compiled under United States Environmental Protection Agency Contract No. 68-01-3344 with the University of Nebraska-Lincoln.

Combined Retrieval System for Infrared, Mass, and Carbon43 Nuclear Magnetic Resonance Spectra Jure Zupan" and Matej Penca Chemical Institute Boris Kid&,

Ljubljana, Yugoslavia

Dugan Hadfi Faculty of Natural Sciences and Technology, University of Ljubljana, Yugoslavia

JoZe Marsel Institute Josef Stefan, University of Ljubljana, Ljubljana, Yugoslavia

A combined retrieval system based on three different spectrometric data: infrared, mass, and 13C NMR, as well as the corresponding data file structures are described. The identification of the compounds depends entirely upon their spectra. The retrieval system may be used either for searching all three types of spectra, two, or one only. The resutts of 300 test searches are presented and the accuracy, precision, performance, and similarity coefficients for all three search types are evaluated. The performances of infrared, mass, and I3C NMR spectra searches are 0.77, 0.91, and 0.95, respectively. The similarity coefficients for compound pairs are calculated with an algorithm based on WLNs of both compounds.

T h e utility of several spectrometric methods for fast and

reliable identification of chemical compounds has been vastly augmented by the use of computers in searching collections of spectral data. A number of search systems have been developed (1-36) and besides spectra they may retrieve other information such as fragments, molecular formula, chemical name, or the complete structure. Most of the search systems are designed for a single type of spectra or even for one sort of data base (1-29). The evaluation algorithms and also the experience show that though the reliability of searching may be quite high, it is never absolute. Very often the fault is in the quality of the data base, but the incompleteness of transforming the whole information contained in the original spectra into the coded form may also cause misses in the search. One possibility for improvement is to combine the searching for several types of spectra (30-36). This should be advantageous also in the development of methods that will allow ultimately the construction of the molecular structure ANALYTICAL CHEMISTRY, VOL. 49, NO. 14, DECEMBER 1977

2141

_ I _ _ _ _ _

Table I. Record Structure and Data Available in the Present Data Bases of the

cOSMOSS

IR No. of bits per feature

Position Intensity Third feature Structure Identification No.

Total Structure description

Chemical name Molecular formula Total number of records a

For 2000 compounds only.

MS

240 120 120 420 One word ( 6 0 ) 960 Yesa Yes‘ Yesa 92 000

WLN

System ”C WMR

540

180 One word ( 6 0 ) 420 One word ( 6 0 ) 1860 No Yes Yes 16 000

240 120b 120 420 One word (60) 960 Yes Yes Yes 1600

For 650 compounds only.

Table 11. Classification of Intensity Values and Third Rated Features for Different Types of Spectra -__

Code

Intensity, %

000

less than 5.0 5.0-16.9 17.0-33.9 34.0-50.9 51.0-67.9 68.0-84.9 85.0-99.9

001 013 011 100 101

110 111

100.0

undefined less than 0.5 0.5-0.9 1.0-1.9 2.0-4.9 5.0-6.9

1 sharp

7.0-1 0.0

more than l O . O t

starting with spectral data and possible auxiliary information. In comparison with other combined systems dealing with different spectrometric data, our intentions in constructing a new one were: (1)it should work as a black box: using only the spectra as input data. No protocols should be needed to describe the problem. In this respect, this system clearly differs: from those of Gray (35)and Gribov (331,partially from Koptjug ( 3 4 ) ,but is related to those of Clerc (30-32) and Heller (23,36);(2) all data files should be uniform (32)so that no difficulties would arise adding new spectra collections; (3) the search strategy should be carried out in two parts: the first one is very fast and rather loose yielding roughly 1% of data on goodfile and the second, performed only on the goodfile, should sort the spectra using rating algorithms that are normally slower by an order of magnitude than the search algorithms. As the result of these premises and on the basis of our previous work on the search systems (8, 13, 15), the system COSMOSS (Combined Spectral and MOlecular Structure Search) was developed. At the moment three types of data files can be used simultaneously or separately, i.e., infrared, mass, and I3C NMR. The system is, of course, at the moment far from perfect but with more practical experience we hope to improve the system.

DATA BASE T h e recorded structure in the data files shouid be chosen so that it may be easily applied to different types of spectrometric information. In general terms, each spectrum can he regarded as a group of signals plotted in some kind of “energy” or “time” scale vs. absolute or relative intensity. Each peak or signal. has, of course, some other properties relevant to the structure of the compound that could be stored and used for identification (e.g., half intensity width, asymmetry, shape, multiplicity, etc.) but the most important ones are certainly the position and intensity. Thus the guideline of our data base organization is to code the position and the intensity of each peak separately and to provide an additional place for the so-called “third feature” that might be useful for some special identification purposes. The “energy” scale is divided for each type of spectra into the appropriate number of intervals, represented by bits 2142

Third feature IR (100.o)iX ‘’C NMR multiplicity

ANALYTICAL CHEMISTRY, VOL. 49, NO. 14, DECEMBER

1977

normal broad

not known singlet doublet triplet quartet or higher odd even not detectable

initially set to zero and turned after to 1 if some peak occurs in the particular interval. This part of the record is called the position section. As shown in Table I the infrared, mass, and I3C NMR full spectral ranges are coded with 240, 540, and 240 bits, respectively. The digitation step for mass and ”C NMR spectrometric data is the same over the whole spectral range, Le., 1mle and 1 ppm from 1-540 m / e and from 1-240 ppm, respectively, while for the infrared spectra this is not so. The digitation interval in the ASTM-WYANDOTTE collection (37)of infrared spectra is based on the mp scale from 1.C-15.9 mp. This description ceases to be adequate or efficient in the regions somewhere above 12.5 mp or below 800 c m Most of the modern infrared spectrophotometers have the operating interval up to 50 mp or 200 cm requiring the linear scaling in cm This tranaformation will cause the same kind of trouble in the other part of the spectrum, though smaller. Therefore we decided to form two digitation steps: the first is 33.3 cm-’ wide in the region 2000-4000 cm-’ and the second of 10 em-’ in the region 2OC-2000 crn-l, amounting together to 240 bits. In the order of peaks in the position section, a three-bit place is provided for each one in the next, called intensity, section. Infrared and I3C NMR spectra are assumed to have a maximum of 40 peaks (120 bits) while for mass spectra this limit is extended to 60 (180 bits). If the spectrum has more than the allowed number of peaks, those with the lowest intensity are rejected and only 40 or 60 are coded. The full intensity range is divided into eight intervals (Table 11) each coded with three bits (OW111). The intensities in mass and ”C NMR spectra are coded relatively to the highest peak, while the infrared spectra are coded in the transmission scale. As mentioned before a special “third feature” for infrared and 13C NMR spectra is chosen to be the relative half width of the peak and multiplicity, respectively. The third part OF the record provides a three-bit place for coding eight different values (000-111) of the third feature of each peak. For low resolution mass spectra where the peaks are very uniform, the molecular weight of the compound has been chosen as the third representative feature of the whole spectrum. This choice has another advantage: it enables one to organize the molecular weight search for mass spectra very easily.

T h e fourth part of the record contains the information about the structure of the compound and is thus called the information part. It has 420 bits which is sufficient for 70 hollerith characters of the 6-bit character set. The complete structure information should be composed from the Wiswesser line formula chemical notation (WLN) ( 3 8 ) ,chemical name (CN) and molecular formula (MF). Each item is separated from the preceding one by two blanks. If some part of the structure information is not available or not known, i t is omitted. If the information is longer than 70 characters, the rest is cut off. However, with more than 75% of data which contain the full information, there is still some place left over. In this order the structural information is entered into the information section of the record. There is no doubt that W L N is the most valuable information since the complete structure, molecular formula, and molecular weight can be obtained straightforward. Unfortunately, with the exception of the 13C NMR collection (39)and a small part of the infrared file ( 4 0 ) , no other data have the complete WLN-CN-MF information. T h e last 60-bit-long part ensures the relationship of the particular record information with the other information sources. Normally the corresponding catalog number is stored here. The weakest point in the present data organization scheme are the 420-bit-long information parts in each file that occupy more than 35% of all space on drums or tapes. This immense waste of space carrying the same information three or more times (if further spectrometric data would be included) can be reduced by creating a new random access file containing only the structure and catalog number information. Thus the complete 420-bit-long information part in each file could be skipped and the relationship to the structure information should be provided via the last number in the record: the address of the beginning of the particular record in the random access file. Such a solution requires, of course, larger computer facilities with direct memory access routines that are not always available to the users. I t is a very time consuming procedure to create such a file and get proper connectivity addresses to each record in the data files but, fortunately, this has t o be done only once. Finally, the data sources should be mentioned. The large majority of 92000 IR spectra is based on the ASTMWYANDOTTE collection (37),and only very few, about 2000 of them (40) have the full intensity, shape, and WLN-CN-MF information. In practice the search very seldom runs over the whole collection. For laboratory use, smaller files (polymers, surface active agents, etc.) were created for standard or special use. The most frequently used files are Sadtler (41) and DMS (42) collections, marked with CA and FA and containing about 35 000 and 13 000 spectra, respectively. For mass spectra, the Aldermaston collection (43) with about 16000 spectra is used. It has the complete intensity information, chemical name and molecular formula, but lacks completely the WLN. According to some tests in our and other laboratories (44),the collection has many doubles and a surprisingly large amount of wrongly coded spectra. The smallest collection, but in all other respects the best, is the 13C NMR data file containing, only about 1600 spectra. I t has in a large part the complete intensity, multiplicity, and WLN-CN-MF information.

THE SYSTEM T h e retrieval strategy of the search system COSMOSS is tightly bound to this data base structure. T h e complete process of one search is done in three clearly separated parts. First, the input data are read in the free format and the corresponding masks of peak positions and no peak regions are formed. Second, the search through the position part only of each record over the entire file is performed and the records

Table 111. Default Values and Range Limits for Different Spectrometries if Not Specified together with the Band Position Units of band position

IR

MS

I3C NMR

cm-’

m le

PPm

200-4000

1-540

1-240

50% 2% normal

10%

50%

...

1 (singlet)

if intensity is greater than 33%

always

1 ~ i pif

specified Range limits set when exceeded (message) Intensitv Band shgpe Multiplicity Obligate presence if not requested

...

never

...

...

that pass the logic .AND. operations with all masks are saved on a goodfile. Third, the rating algorithm sorts the spectra from the goodfile according to the discrepancies in the intensity, position, and presence or absence of each peak in comparison with t h e corresponding peaks in t h e input spectrum. After this sorting, the spectra with the five highest scores are output. It should be mentioned that the masks for the peak positions (with the exception of the mass spectra search) are set up in a way that allows finding the peak with the tolerance of i=1 digitation step. The system COSMOSS is supposed to assist the analyst who has either infrared, mass, I3C NMR, or any combination of these spectra to find out if the identical or very similar spectra were stored in the data files. We trust that the system is able to provide, using the rating algorithms, some valuable hints about the structure of the unknown compounds even if the requested spectra are not in the files. As input, only the spectral data are requested. These data are input in the form of groups. Each group separated with slashes describes one peak or no peak region in the particular spectrum. The two very first groups are reserved for the search identification (IR, MS, or I3C NMR) and heading. If some group of data is input incomplete, the default values are set on the place of the missing or incorrect one. The default values are collected in Table 111. After each search the rank or top list of the five (even if there are more of them on the goodfile) most probable compounds is issued. The rating criteria depends on the respective spectra. They are subject to our greatest concern and are being continously developed. T h e example output lists are shown in Figure 1. In one run, any number of spectra can be input.

TEST OF THE RETRIEVAL ABILITY A test for the retrieval system should be made keeping in mind the fact that such a system is almost always used as a black box, and that i t is only the best hit that counts. If for example the correct compound is retrieved and placed on the second position of the top list, this achievement is very good from the computational point of view but is of very limited value for the user, if the top rated compound is completely different from the correct one ( 4 5 ) . This is the reason why the COSMOSS system outputs only the first five top rated compounds. As a test, 300 spectra (100 for each spectrometry IR, MS, and 13C NMR, respectively) were randomly chosen from the compounds whose spectra were known to be in the three data files, and coded by hand. T h e retrieval strategies and the rating algorithms which differ for each spectrometry were then checked, according t o Erley’s suggestions ( 4 6 ) for accuracy, precision, and performance. The results of the test are shown in Table IV. It can be readily seen that the performance gap ANALYTICAL CHEMISTRY, VOL. 49, NO. 14, DECEMBER 1977

2143

I

1

7 3 4

5 6

3448. 2941. ,277a. 14'49. 1389. 135:.

7

-

10~7. 1375. CM

F! 9

1010. 571. 526. 800. 741.

10

11 12 13 TOP -:ST

I'IT. :'>T. :tdT. It,T. ihl. 1t.T.

50. 50.

Cb'

50. 50.

CII

CM Cm Cu

50.

50.

ShGPE SHAPt REG 1 C ? SHAit SrBPt S-rAPE SHAaE h' SHIIFE Y SHAPE M SHAPE b4 S I A P c

SI,

=o

I

15

1

66.67

7

54.55 52.17

3

JL>J-C*j-UF

Cb3-C-C-C-O-C 1.

1.

;.

SHAP. M SHADE hl

1.

3fiLItATCP" 0fi~lbA'CP" OBLIG4TOKY ?E-:GATOR"

PEAK

DEAK

PFAK 3FAK

H

2 5

I

OH

h C-C-CH

3

1

Ch

,

3

5

/ / RETRIEVAL

TIYE

10.391

i

4

5 6 7 9 9 10 11 12 13

1 2 3

35.1

5EC.

'r10-PF4K

REGIC'.

-

-

-

-

--

-

-

79.67 72.57 72.27

3-YETHYLl\UOLE C9.HS.t.1 ~ - Y E T H Y L INDOLE C9.h9.N 6-VETtiVL INDOLE C9.h9.',

/ / RETRIEVPL TIYE 6.329

0

[R

rn

MS

bD:

AST AS-

0327 :503 150R

SEC.

. . . . . . . . . . . . . . . .S.E.A.R.C.H. .-. . I.N.D. L. T. . DATA ............................. 3

4 5 6

9. 111. 119. 122.

128. 136.

Pz'l PPP FPM PPtJ PPM PPM

50. I Y T .

0.

MULT.

1.

OBLiGATC2Y

PEA"

50. 50. 50.

0. MULT.

1. 1.

OBLIGATORv OBLIGATORY OYLIGATCRY OBLIGATORY L1BLIGPTOR'

PEAK PEAK FLAK PEAK PEAK

INT. ItvT.

INT.

50. IN'. 50.

INT.

0. 0. 0. 0.

MULT. MULT. MULT. MULT.

1. 1. 1.

TOP L I S T CF T h E MOST P K O R A P L E COMPCUNCS

............................................................ PTC. 1

//

79.16

RETRIEV'nL

TIME

WLN-C'i-mF T 5 6 RMJ 0

1.306

INEORPATICY

METHYLIN IN DOLE

C9H9211

9*2C

peak shapes are for MS and 13C NMR spectra much better fixed and sharp, thus the shifts in positions could be detected and classified very easily: either the spectrum is coded wrong or it does not belong to the requested compound. On the contrary, the peaks in the infrared spectra are normally much broader and hence the coding is much more difficult and subject to some arbitrariness (40). The data collection for infrared spectra has besides the known shortcomings in intensity information, which make the rating algorithms rather inefficient, a relatively high number of completely miscoded spectra. The latter is true, though to a lower extent, for the Aldermaston collection as well. However, in the case of mass spectra, this shortcoming is balanced by the extensive intensity information. In order to obtain for the user the more interesting information about the practical efficiency of the system, we have compared the structure of the top rated compounds of all unsuccessful searches with those of the requested ones. The comparison was made on the basis of the similarity coefficient calculated from both WLNs: the top and the requested compounds. The similarity coefficient SIJ, between two structures I and J expressed with WLN(1) and WLN(J) is obtained from the following formula:

?EC.

Figure 1. Output examples of all three search types. A low similarity between the input and retrieved spectra indicates either errors in the coding or that there is no counterpart of the input spectrum in the file

between the infrared and the other two types of spectra is very impressive. It is believed that there are two main sources of this large discrepancy: The spectrum description (in digitized form) is for 13C NMR and especially for MS spectra very clear and precise in comparison with those of infrared spectra. The position and 2144

NMR

Figure 2. The similarity between the structures of the top rated and requested compounds for all unsuccessful searches. To get a visual impression of the meaning of the terms very similar, similar, and slightly similar, for each one an example with the corresponding similarity coefficient is drawn

C NMR S F E C T R A L

1 2

very slmilar

i a ~ c ~ n

'4

-

1. 51. 63.

sirn'lar

12t374FA 1232FA 5-75FA 1003PFA

P/E 3 . O d L I G A T O R Y PEAK b'/E !0. I,!-. 0. O d L I G A T O R Y PEA.( hl/E 1 C . 1t.T. 0. O B L I G A T O R Y D E A " 6 5 . M/E 1 0 . lhJ-. 0. O B L : G A T ~ Q Y P E A * 77. w / E 1 3 . IP,T. M/E ( 80. N o > E b K REGIO'J 88.) ( 50. PEAK REGIOI, P/E 100.1 NO 1 0 3 . I"/E 10. LtJT. 0. O G L I 6 A T C R " P E A K (105. 125.1 I-rO ? F A K R E G I O h M/E 129. I.~/E i o . I~.T. 3. ~ ~ . I ~ T O R FYE A K 1. 0 C . L I G A T C P Y PEAK 130. Y/E 100. INT. i . 3 d L I G b T O Q V PECK 13:. Y/E 50. ItrT. (140. 5/10,) t.10 D E A K P E G I O h '.+/E (

sim lar'

IYFOPMA'ICly

CYCL3DETb'OUE-I-, 6-HV.IPCXYP H 3 5 P H ? * 1Uh.q /2-PHENACYCLIPEENE IUOCLE. ~ - M E T - Y L BICYCL0/2,?.:/HE"TAlE EP.23-2-C lr4-3EY7Ofl~INOtltIM:NEI M-DbEtIV

50.50 48.00

4

3

-p:c Ho HO

r o t o r s1igh:Ly

?

s I J = 0.91

CF T H E MOST P R ~ ~ C R L COMPJUIICS E

PTC.

I

s I J= 3 5 5

OH

ANALYTICAL CHEMISTRY, VOL. 49, NO. 14, DECEMBER 1977

where pI,Jis the percent of common elements and symbols and dI,Jis the average displacement for all common symbols in both WLN(1) and WLN(J) notations, kI,J is the weight factor describing the influence of the displacement on the similarity coefficient. In the present cdculation k1,J was taken to be 0.5 for comparing the acyclic parts of the notations and zero for others. The detailed explanation of the SI,^ calculation can be found in ( 4 7 ) . T h e distribution of the similarity coefficients calculated for all the unsuccessful searches is shown in Figure 2. It is apparent that a very easy distinction

Table

IV. Test Results for 300 Searches

IR Number of test compounds ( N b t ) Number of correct answers in the ith place of the top list N ( i )

MS

13CNMR

100

100

100

77

91 2 2 1

95

i

0

0

2 0 0 1

81

96

98

3

0 1

Total compounds found ( N T ) Compounds not found ( N F ) Accuracy Precision Performance

between the satisfactory and unsatisfactory searches can be made on this basis. As expected from the reasoning in the preceding paragraph, the mass and 13C NMR searches yield much better results if they fail to retrieve the correct spectrum than does the infrared search. It should be stressed that neither the accuracy, precision, or performance nor the similarity coefficients depend exclusively on the quality of the search and rating algorithms but reflect rather the accommodation of these algorithms to the particular data base. If the data file has too many errors or is deficient in information (this may be seen in the case of infrared collection) or is too small (the case of 13C NMR file), none of the retrieval systems is able to provide the user with a compound very similar to the correct one if it does not exist in the file or is wrongly coded. It is also interesting to mention the adaptation period of a n untrained user on the retrieval responses of the system. While there were no difficulties with the 13C NMR and very little with mass searches (only in indicating the intensities), a t least 40-50 searches were necessary for the user to learn how to input infrared spectra to obtain better results than at the beginning, in spite of the fact that the coding rules for the formation of the data base are very well known (37) and provide that all peaks with the intensities above 30% to be coded.

ABOUT T H E IMPLEMENTATION T h e program COSMOSS is written completely in FORTRAP; IV language using strictly the ANSI1 approved statements. All machine dependent constants are stored in common blocks and all calculations using them (Le., finding the proper word address from the bit position in a bit string, shifts for different word lengths and so on) are made as separate routines which can be very easily exchanged. The program is now operating on a CDC-Cyber 172 computer using the 60-bit-long word and requires 550008 words of central memory. The average search speed, including the encoding of free format input, calculating and outputting the top list is about 1100 spectra per second. In the present stage, the system is used in batch processing via terminals but the implementation for interactive work is under development.

CONCLUSION I n such an ambitious project as is the construction of an analytical black box system based on different spectra, there is present a t each stage of development the question-What next? Besides the standard maintenance of the system such as filtering and updating of the data files, making more virtuous rating algorithms, or reprogramming some 1/0 routines in a more efficient way, some essential improvements have to be made. In our opinion two main goals for the future work are of first importance: T o develop the preprocessing program that should identify the structure fragments of the unknown compound on the

19 0.81 0.97 0.79 0.77

4 0.96 0.97 0.93 0.91

2 0.98 0.98 0.96

0.95

basis of the input spectra before the actual search runs are performed. This can be achieved either by the hierarchical tree decisions based on average vectors (48) or using some standard method of artificial intelligence (49). To evaluate the top lists obtained by different searches for each spectrometry relative to each other, forming a final multispectrometry top list including of course all classifying hints obtained from the preprocessing routine. The evaluation of this part of the problem is bound to the full structure information in all files. We wish to emphasize that it is the WLN description which is most important for this routine, and that it is the one that makes possible (47) a quantitative computation of the similarity between two structures.

LITERATURE CITED R . A. Sparks, Storage and Retrieval of WYANDOTTE-ASTM Infrared Spectral Data Using an IBM 1401 Computer, American Society for Testing Materials, Philadelphis, Pa., 1964. D. H. Anderson, and G. L. Covert, Anal. Chem., 39, 1288 (1967). D. S. Erley, Anal. Chem., 40, 894 (1968). F. E. Lytle and T. L. Brazie, Anal. Chem., 42, 1532 (1970). G. A. Massios, A m . Lab., 3, 55 (1971). R. W. Sebasta and G. G. Johnson, Jr., Anal. Chem., 44, 260 (1972). C. S. Rann, Anal. Chem., 44, 1669 (1972). J. Zupan, D. Hadii, and M. Penca, Kem. Ind., 23, 275 (1974). E. C. Penski, D. A. Padowski, and J. R. Bouck, Anal. Chem., 46, 955 (1974). K. Schaarschmidt, R. Riemer, and E. Steger, Z. Chem., 14, 374 (1974). H. B. Woodruff, S. R. Lowry, and T. L. Isenhour, J. Chem. Inf. Compuf. Sci., 15, 207 (1975). R. C. Fox, Anal. Chem., 48, 717 (1976). J. Zupan, D. Hadii, and M. Penca, Compuf. Chem., 1, 71 (1976). E. M. Kirby, R. N. Jonesand D. G. Cameron, CODATA Bull., 21, 18 (1976). J. Zupan, J. T. Clerc, and D. Hadii, Vesfn. Slov. Kem. Drus., 23, 73 (1976). B. Pettersson and R. Ryhage, A r k . Kemi., 26, 293 (1967). L. R. Crawford and J. D. Morisson, Anal. Chem., 40, 1464 (1968). S. L. Grotch, Anal. Chem., 42, 1214 (1970). B. A. Knock, I. C. Smith, D. E. Wright. R. G. Ridley, and W. Kelly, Anal. Chem., 42, 1526 (1970). H. S. Hertz, R. A . Hites, and K. Beiemann, Anal. Chem., 43, 681 (1971). S. L. Grotch. Anal. Chem.. 43. 1362 (1971). L. E. Wangen, W. S. Woodward, and T:L. Isenhour, Anal. Chem., 43, 1605 (1971). S. R. Heller, DCRTKIS, MSSS Users Manual, Division of Computer Research and Technology, Bethesda, Md, (1972). S . R. Heller, Anal. Chem., 44, 1951 (1972). S . R . Heller, H. M. Fales, and G. W. A. Milne, Org. Mass. Specfrom., 7 , 107 (1973). S. L. Grotch, Anal. Chem., 45, 2 (1973). P. R. Naegeli and J. T. Clerc, Anal. Chem., 45, 739A (1974). R. Schwarzenbach, J. Meili, H. Konitzer, and J. T. Clerc, Org. Magn. Res., 8, 11 (1976). R. J. Feldman, S . R. Heller, K. P. Shapiro, and R. S. Heller, J . Chem. Doc., 12, 41 (1972). J. T. Clerc. C .Jost, T. Meier, and R . Schwarzenbach, Cbimia, 2 7 , (12) 665 (1973). J. T. Cierc and F. Erni, Top. Curr. Chem., 39, 91 (1973). J. T. Clerc, Computers in Chemical Research and Education, Proceedings, D. Hadii, Ed.. Elsevier Publishing Co., Amsterdam, 1973, Vol. 2, p 3/109. L. A. Gribov, V. A. Dementyev, M. E. Elyashberg, and E. 2. Yakupov, J . Mol. Sfrucf.,22, 161 (1974). V. A. Koptjug, 2. Chem., 15, 41 (1975). N. A. Gray, Anal. Chem., 47, 2426 (1975). The NIH-EPA Chemical Information Systems, Status Report No. 4, Dec. 1976. "Codes and Instructions for WYANDOTTE-ASTM", American Society for Testing Materials, 1916 Race St., Philadelphia, Pa., (1964).

ANALYTICAL CHEMISTRY, VOL. 49, NO. 14, DECEMBER 1977

2145

(38) E. G. Smith and P. A. Baker, The Wiswesser Line-Formula Chemical Notation, 3rd ed., CIMI, Cherry Hill, N. J., 1975. (39) OCETH 13C NMR Data Collection, Laboratorium fur Organische Chernie, ETH, Zurich (40) J. T. Cierc, R. Knutti, H. Koenitzer, and J. Zupan, Fresenlus' 2 . Anal. Chem.. 283. 177 (1977). (41) Sadtler' Standard Spectra, Sadtler Research Laboratories, Philadelphia, Pa, (42) DMS, Documentation of Molecular Spectroscopy, Weinheim, Verlag Chemie, and Lond0n:Butterworth. (43) Aldermaston Mass Spectrometry Data Collection, Reading, U.K. (44) J. T. Cierc, private communication. (45) J. T. Clerc and J. Zupan, Proceedings of the IUPAC, International Symposium on Technique for the Retrieval of Chemical Information, London, Nov. 1976.

(46) D. S . Erley, Appl. Spectrosc., 2 5 , 200 (1971). (47) J. Zupan and D. Hadii, "Computers in Chemical Research Education and Technology", E. V. Ludena and F. Brito, Ed., Advances Studies Center IVIC, Caracas, Appartado 1827 (1977). (48) M. Penca, J. Zupan, and D. Hadii, Anal. Chim. Acta, in press. (49) . , T. L. Isenhour and P. C. Jurs. Anal. Chem.. 43 (101. 20A. (1971) and references cited therein.

RECEIVED for review May 9, 1977. Accepted June 29, 1977. Financial support of the Research Community of Slovenia is ~h~ work was supporkd in part also by the Pharmaceutical Factory KRKA.

Positional and Geometric Characterization of Olefinic Double Bonds by Fluorine Magnetic Resonance Spectrometry Michelle V. Buchanan, David F. Hiiienbrand, and James W. Taylor" Department of Chemistry, University of Wisconsin, Madison, Wisconsin

Conversion of n-alkenes to hexafluoroacetone ketals has allowed the characterization of the double bond position and geometry using IgF NMR. Geometry can be determined by the chemical shift, with a 0.775 ppm separation between the cis and trans isomers. Double bond position can be determined by the value of bAB for the A3B3pattern, which decreases as the bond moves to the center of the molecule. A method by which AvABmay be estimated without the use of computer simulation is discussed. Application of these findings to the quantitative determination of geometric mixtures of linear alkenes is examined. Prior separation is required for analysis of positional mixtures but not for geometric mixtures.

T h e determination of both the geometry and the position of the double bond in large alkenes has proven to be a difficult problem because of the minor differences in physical properties of isomeric alkenes. One approach to characterizing these compounds is an analytical technique which provides a derivative wherein the differences between the isomers are maximized. For efficiency, the chemical preparation of the sample derivative needs to be simple, and the analysis should yield both the geometry and the position of the double bond without the use of standards. T h e derivatives which were first used to give information on both geometry and position of the double bond were cyclic esters. The first of these were cycloboronate esters (I) which were prepared from 1,2- and 1,3-diols (1-3). In another

I II m approach, a cyclic acetone ketal (11) was prepared by stereospecific conversion of the alkene to its diol using Os04, followed by condensation of the diol with acetone to form the corresponding cyclic ketal ( 4 , 5 ) . Mass spectral examination of the electron impact induced fragmentation gave ions which could be related to both the geometry and position of the 2146

ANALYTICAL CHEMISTRY, VOL. 49, NO. 14, DECEMBER 1977

53706

original double bond. These isomers could be separated by gas-liquid chromatography to provide geometric identity between isomers of the same olefin, but positional isomers could not be distinguished if the original double bond were near the center of a long chain. Furthermore, this derivative was quite easily hydrolyzed so all handling procedures had to exclude moisture. The hexafluoroacetone (HFA) analogue of the acetone ketal (111) was synthesized to give more easily interpreted mass spectra because it would allow for the distinction between fragments originating in the ring as opposed to those from the hydrocarbon chain (6, 7). This derivative was successful in determining the bond position and was easily handled, but it could not be used to distinguish between geometric isomers without reliance on gas chromatographic retention times and standards. Because the HFA ketal has two sets of three equivalent fluorine atoms in the ring, it was thought that these fluorines might provide a sensitive probe into both the geometry and position of the double bond. The purpose of the present paper is to examine the use of HFA and "F NMR as a tool for the characterization of both the geometry and the position of double bonds in linear alkenes.

EXPERIMENTAL The 2,2-his(trifluoromethyl)-1,3-dioxolanes were prepared by first stereospecifically converting the olefin into the corresponding hromohydrin (IV). Then the bromohydrin was condensed with HFA to form the ketal (V). All olefins were used as received from

ET Y Chem Samples Co., Columbus, Ohio, and hexafluoroacetone was obtained from PCR, Inc., Gainesville, Florida. A detailed account of the experimental process for the preparation of the derivatives may be found in the literature (7). Preparative gas chromatography was performed using a Varian-Aerograph Model 705 gas chromatograph with a 0.95 mm 0.d. x 6.1 m aluminum column packed with 10% OV-1 on 60/80 mesh Gas Chrom Q. Normal operating conditions were: injector, 200 "C; flame ionization detector, 200 "C; detector split ratio, 1 O : l ;