Conversational mass spectral retrieval system and ... - ACS Publications

The system options include a peak/intensity search, a molecular weight search, a complete molecular formula search, an imbedded molecular formula sear...
1 downloads 0 Views 1MB Size
Conversational Mass Spectral Retrieval System and Its Use as an Aid in Structure Determination Stephen R. Heller Heuristics Laboratory, Division of Computer Research and Technology, National Institutes of Health, Public Health Seruice, Bethesda, Md. 20014 An interactive, conversational mass spectral retrieval system consisting of a collection of computer programs designed to give immediate retrieval of mass spectral data is described. The system options include a peak/intensity search, a molecular weight search, a complete molecular formula search, an imbedded molecular formula search, and printout of the peaks and intensities of the entire spectrum. The programs used to generate and search the files, as well as the file structure, are described.

OF ALL THE PHYSICAL DATA available to the organic chemist, mass spectral data are the most readily compatible with the computer. In particular, the so called “low resolution” mass spectra, with peaks at unit mass intervals, are widely available in large numbers and of great value in structural elucidation. Furthermore, when only very small samples of the material are available, such as from a gas chromatograph, the mass spectrum may be the best available information for use in the determination of the structure of the unknown chemical. A number of investigators (1-7) have developed methods for processing and searching mass spectral data. None of the current methods, however, includes an interactive search capability in which the bench chemist can modify his demands while the search is under way. Our experience in the development of a Chemical Information System (8) has led us to the conclusion that computer programs should be highly interactive, conversational, and available over ordinary telephone lines from the chemist’s lab. The system is designed to be used for both routine identification and for assistance in the determination of unknown structures. The programs described here are used daily and routinely from laboratories at NIH. The programs give virtually instantaneous response at the chemist’s teletype. In addition, an extensive user’s manual is available which describes how the program is used and gives numerous examples of the different types of searches possible with the system (9). The philosophy behind the development of the system was to design programs that respond rapidly to the chemist’s query directly from his lab and that stimulate his consideration of the next logical response. The data file being used in the

(1) B. Pettersson and R. Ryhage, Ark. Kemi, 26, 293 (1967). 42, 1214 (1970). (2) S. L. Grotch, ANAL.CHEM., (3) [bid., 43, 1362 (1971). (4) L. E. Wangen, W. S. Woodward, and T. L. Isenhour, ibid., p 1605. (5) L. R. Crawford and J. D. Morrison, ibid., 40, 1464 (1968). (6) B. A. Knock, I. C. Smith, D. E. Wright, and R. G. Ridley, ibid., 42, 1526 (1970). (7) H. S. Hertz, R. A . Hites, and K. Biemann, ibid., 43, 681 (1970). (8) R. J. Feldmann, S. R. Heller, K. P. Shapiro, and R. S. Heller, J . Chem. Doc., 12, 41 (1972). (9) S. R. Heller, DCRT/CIS, “Mass Spectral Search System User’s Manual ,” Division of Computer Research and Technology, Bethesda, Md, March 1972.

search program is the same file as used by Hertz, Hites, and Biemann (7), and was kindly made available by the authors. In this method of spectral data abbreviation, the most intense peaks in each interval of fourteen mass units are selected from the complete spectrum for matching. Other methods include using n peaks per m interval, where n and m are other than 2 and 14, as well as the technique of’selecting five to ten of the largest peaks from the spectrum irrespective of mass value. The abbreviated (or compressed) spectrum used here has the virtue of following the chemist’s thought pattern analysis, which is to look for patterns and peak clusters. For example, a difference of fourteen amu (CH2) between groupings indicates the presence of straight chain molecules. Indeed, the choice of a search interval of fourteen amu was specifically made so that the computer would give a homologous series of ions the same relative importance as a chemist would give the series of ions. Figure 1 shows the distribution of the abbreviated spectra as a function of m/e value. Normally the low mass region contains many intense peaks and the above method of selection clearly discriminates against ions in this region. The main reasons for compressing or abbreviating spectra have been the storage limitations and the speed of the search system. Third and fourth generation computers with vast on-line storage capacity (being shared by many users in a time-shared computer) as well as increased speed make these reasons less valid. In addition, a well structured file can improve the “apparent” speed of the program. While the system described here does use an abbreviated spectrum, it does not appear to be necessary, and, indeed, does prove costly in the time needed for the selection of the peaks that constitute the abbreviated spectrum, as well as the loss of some information. Those masses with more than 1000 occurrences in the selected file are shown in Table I. In addition to a peak and intensity search, it was felt that molecular weight and molecular formula searches would also be of value in both structural elucidation and routine identification problems. An overview of the system is shown in Figure 2. EXPERIMENTAL

Peak and Intensity Search. The main search program in the system is the peak and intensity search program. The data base being searched uses the two largest peaks in every 14 m/e interval, starting at m/e of 6 (Le., 6-19, 20-33, etc.). The original file of 8124 spectra contained 762,162 peaks and their intensities. The first computer program pulled or selected the two largest peaks and their intensities in 14 m/e intervals from the original file. The resulting 185,396 peaks (ranging from 9 to 1337) pulled represent 24.3% of the file us. a theoretical reduction of 1/7 or 1 4 . 4 z . The reason for the file being considerably larger than expected is that in many 14 m/e intervals, particularly at higher mass values, there are considerably less than fourteen peaks and, thus, the two largest must be selected from less than fourteen possibilities.

ANALYTICAL CHEMISTRY, VOL. 44, NO. 12, OCTOBER 1972

1951

Table I. Masses with More than 1000 Occurrences in the Abbreviated File m/e occurrences mle occurrences mle occurrenca 83 1740 105 1680 65 1651 14 1618 71 1613 50 1499 79 1495 97 1465 67 1373 115 1370 119 1365 85 1361 81 1292

27 4344 41 4277 29 3673 55 3271 43 3172 15 3170 39 3084 77 2574 69 2498 91 2478 57 2385 51 2348 18 2271 28 2207 63 1966

mle

Figure 1. Distribution of the selected peaks

ES.

m/e value

It might logically be assumed that the proper way for the chemist to enter data from an unknown mass spectrum would be to similarly select, beginning at mje 6, the two most intense peaks in each fourteen amu. Specifically, selection of more than two peaks within fourteen amu is dangerous, unless the fourteen amu crossover point has been located by ticking off the spectrum beginning at m/e 6. In practice, it is unnecessary to stress this point, because real mass spectra usually have their peaks more uniformly distributed according to mass. An important case where an error might be introduced is in spectra where the fragment ion is so intense as to require inclusion of its I3C satellite in the abbreviated

spectrum. In such a case, another important fragment occurring within fourteen amu would be overlooked. After the peaks were sorted by increasing mass (m/e value), the next program took each mass and the list of references ( i e . , spectrum I D numbers) and generated a disk file of the m/e values with pointers to a second disk file containing the references to these m/e values. It will be helpful to describe some details of the file structure for a clear understanding of the search and retrieval method. The first disk file is the pointer file and contains a cell (being one computer word) for each m/e value up to 1337. The second disk file is the reference file, which contains the references or ID numbers associated with a given m/e value. At the beginning of the file generation all the cells in both files contain a -1. As the generation program proceeds, if there are any references at a given m/e, the -1 in the cell or word of the pointer file is replaced by a pointer number. At the same time, the references (in this case, the I D numbers

intensities references

references

pointer

references

pointers

references

Name lookup

Figure 2. Overview of the mass spectral search sytsem 1952

103 1238 111 1236 53 1199 95 1174 70 1145 121 1133 133 1110 107 1090 56 1077 75 1061 139 1055 93 1040 45 1024

ANALYTICAL CHEMISTRY, VOL. 44, NO. 12, OCTOBER 1972

Table 11. Printout of a Peak Pointer File Block FILE

EXTENStlON=PNT

NAMEaMS

BLOCK N U M B E R I 2 3 4 5 6

7

a 9 10

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

__

1 -1-

-1

-1

-1

-1 10

-1

-1

1

-1 5

5287 8914 9879 28304 22'630 26469 36274 38815 46428 51542 56461 62462 67231 73240 79354 82696 68649 93146 98601 1162471 107214 111360 115265

5524 8916 9748 21229 22187 30747 36391 41164 47506 52017 56867 64076 68061 74533 79905 85165 8932rn 94123 99173 1R378R 197712 112494 116050

23

272 6134

417 8406 8926 16301 21751 23180 34617 36940 42690 50435 54425 5aei7 65325 71472 77874 81320 87005 91673 95992 108813 105444 109319 113870

2036 8902 8991 19975 21009 26265 35249 37315 43156 51154 54809 61316 66169 72968 77992 82090 87474 92583 96928 391796 105843 118685 114294

-1 14093 21632 22354 3i444 36777 41498 49892 52450 58241 64527 70634 75333 88U43 85964 98786 94753 1168264 184523 lBR543 113BPid 116970

~~

Table 111. Printout of the Peak Reference File of ID Number and Intensity F I L E NAMEeMS B L O C K NUMBER

1 2 3 4

5 6 7 8

9 10 11 12 13 14 15

16 17

18 19 20 21 22 23 24 25 26

6203414 7161888 5170499 7175407 7187467 392202 697400 051990 871479 887831 976901) 1202180 1473588 1592345 3621088 1647626 16641105 20yla982 2016265 2021381 2026501 2169004 2 52 8380 2709541 2839638 2880513

EXTENSTIONIREF 1 62C4467 7606i79 5253339 7176366

-i

654729 699431 858121 875571 897029 11111717 1231075 1474596 1595478 1626122 1648650 1678363 2804033 2017291 2a224in 2693059 2194460 2680833 2710568 2848771 2882643

and the intensity of the peak with the ID number packed together into one 36-bit computer word) are being sequentially put into the disk reference file, one at a time. After all the references for a particular m/e value are put in the file, the next word is left as -1 to indicate a breakpoint between reference lists. The pointer number system works quite simply. The first peak found in the file is at m/e 9. There are three references to this value. The next peak is at mje 10. There are four references to this value. Jn cell 9 the program stores the value 1, indicating that entries for this m/e value exist. In cell 10 the program stores the value five, which is the previous pointer value number of references at the 1 for the breakpoint cell between refprevious m/e value 3 1 = 5 . In cell 11, the program erences, which is 1 stores the value 10 ( 5 4 1). To find the number of ref-

+ + + + + +

6205490 7698447 6206551 71783211 11267 671754 744480 86n17m R765A7 905289 im219s3 139RbR2 1588253 1590529 1636363 1652748 1997854 2 00 5U3 8 2818316 2P23438 2119735 2232331 2713377 2814903 2849795 20997'16

6206487 5169436 71 74 34 4 7181347 202779 687163 840715 866308 886790 970754 114578i 1396748 1591321 lbldlb01 1646597 1659959 1999897 2ai5240 2a20355 2a2549~ 2122760 2526221 27115411 2820306 2879496

-1

-1 7161980 7280507 179382 686088 834582 863243 879629 919561 11135271 1393689 1590358 1600557 1641512 1654803 1998935 2009153 2 0 1 95 b9 2824460 21211730 2243611 2704397 2829266 2853890

erences that contain mje 10, one simply goes to the 10th cell, and since it is not - 1, then proceeds to subtract the value in cell 10 (5) from the value in cell 11 (10) minus one. Thus, the value in cell 10 tells one where in the reference file the ID numbers and intensities are. In the PDP-10 computer, the basic disk file unit is a block of 128 36-bit computer words. Therefore, to find the location of the reference to m/e 10, one must go to block ((mass 1)/128) 1, which is 1 in the case of m/e = 10. Because a block number must be an integer, the block number is found by a computer technique called integer divide. Integer divide simply truncates any fractional part of a number. Thus 8/3 is 2 and not 3, and 10/128 or 100/128 or 1/128 are all 0. In block 1, one then proceeds to word: modulo (cell 10 value - 1,128) 1 = modulo (4,128) 1 = 4 1 =

+

+

+

ANALYTICAL CHEMISTRY, VOL. 44, NO. 12, OCTOBER 1972

+

1953

.-.

Intensity Reference Setup h

First Query

and Intensity

references

Second Query

Name Lookup

and length

Pointer and

Figure 3. Schematic diagram of the peak and intensity search program 5 . The modulo function or operation of modulo (a$) is defined as a - (a/b)n,where a/b is an integer division. Also, in the computer, the division is performed before the multiplication, so that when a is a number from 1 to 127, and b is 128, a/b will always be 0. The next cells are read until a cell containing a - 1 is encountered, which indicates the end of that list of references. Thus the file structure is as simple to generate and understand as it is rapid in its retrieval of the data. Printouts of the first block of the peak pointer file and the first block of the peak reference file are presented in Tab!es I1 and 111. A schematic diagram of the peak and intensity search program is shown in Figure 3. One feature considered a necessity was the ability to filter out masses with intensities very different from the intensity of the unknown mass or to look selectively at certain masses [e.g., the base peak (relative intensity 100.0%) at a, given mje value]. Rather than having a fixed intensity factor filter, such as allowing the largest peak in the unknown to be up to 25% larger or smaller than the known spectrum, the program search begins by allowing the chemist to select the value he wants. The values of the intensities of the peaks range from 0 to 100 (Le., 0.01 to 100.0%). Typically, a factor of 2-4 is used. This means that the intensity values entered into the search program will be multiplied and divided by this intensity range factor to give a lower and upper limit for the search. It has been found that for low m/e values, 1954

lower intensity factors should be used for two reasons. First, the intensity variations between instruments at a low rnje (

0

INT

minimum intensity level. An example of the printout is found in Figure 11. To conserve space in the computer, a peak and its intensity are stored in one 36-bit computer word, each occupying 18 bits. Computer Programs, Files, and Times. All of the file generation and file search programs are written in FORTRAN IV (except for a few assembly language subroutines) and run on a time-sharing Digital Equipment Corp. PDP-10. The main memory of the system has a 1.8-psec cycle time. All of the programs and files are stored on disk packs. The collection of 8124 spectra was generously provided by Professor K. Biemann. The reference file contains numerous duplicates, and no attempt was made to remove any of these. Some effort has been made to correct errors in the data base; however, it is felt that the upgrading of the quality of the file will proceed more rapidly as the system is used by more chemists. There are approximately 2400 spectra from the ASTM Committee E-14 Subcommittee IV spectrum collection (II), 2000 spectra from the Dow Chemical Company Uncertified Mass Spectra, Subcommittee IV, ASTM Committee €-14 (1960).

1960

RESULTS AND DISCUSSION

1.

4 4 46 100 44 98

Figure 11. Example of the spectrum printout lookup search

( 1 1)

(12), 1800 from the American Petroleum Institute (13), and the remaining 1900 from the Mass Spectrometry Data Centre (14) and the laboratory of Professor Biemann at MIT (15). The various file generation programs require 6,000-14,000 words of computer core to run. File generation times vary from about l minute of cpu time for the generation of the two molecular weight files up to about 125 cpu minutes to generate the spectrum printout files. The search system requires 12,000 words of core and most searches use from 2-6 seconds of cpu time, including the searching and printout and 5 to 15 minutes of elapsed or human time. In almost all cases, the printout programs require more cpu time than the search programs. While the search system has been designed to be used interactively, it would be possible to change the programs to run in a batch programming environment, with the ability to enter an entire spectrum (or spectra from a GC/MS run) and have the program automatically select the two most intense peaks from every 14 m / e interval, perform the search, and print out a list of results. The files used by the search program are all stored on the PDP-10 RPO-2 disk packs and are called into core by individual blocks, not the entire file. This allows the search system programs to remain small. The sizes of disk files are shown in Table IX. The various intermediary files used by the intersection program require 1-5 blocks. The entire set of files require of a standard PDP-10 disk pack. about

It appears that the abbreviated peak file is a good fingerprint for retrieving a compound for identification, and a valuable guide for human interpretation of a mass spectrum. The ability to sit at a terminal and interact with a highly conversational program has been found to stimulate the chemist’s interpretation. Clearly, some results found are probably unrelated to the compound in hand, but their rejection is in itself useful, since the chemist is now free to consider other alternatives. The extensive options as to types of searches and options within searches (e.g., the variable intensity factor) along with the instantaneous interactive nature of the system, have been found to make the chemist feel that the system has been written and tailored to his needs. As the value of the system depends on the size of the data base, plans are under way to expand the file. Of even greater value to the chemist would be the ability to do a partial (or complete) structure search on the file, rather than a partial or complete formula search, which is not as specific. For instance, a n ability to find examples of fragmentation patterns (12) R. S. Gohlke, Ed., Uncertified Mass Spectral Data, Dow Chemical Company, Midland, Mich., 1963. Distributed through the ASTM Committee E-14. (13) Catalog of Selected Mass Spectral Data, American Petroleum Institute Research Project 44. (14) Mass Spectrometry Data Centre, AWRE, Aldermaston, Berks, England. (15) Professor K. Biemann, MIT, private communication, 1971.

ANALYTICAL CHEMISTRY, VOL. 44, NO. 12, OCTOBER 1972

for molecules with the nitrogen mustard group, N-C-C-Cl, might be very useful. This technique, known as substructure searching, is being developed in this laboratory both for the Wiswesser Line Notation (WLN) (16) and the Chemical Abstracts Service connection tables (17). In the case of the latter type of data base, a computer search system is under study which will allow for interactive file searching. While the search system is small and efficient, the files are quite large and require a large on-line disk storage capacity, available at few computer installations. The largest file is the full spectrum file, containing all the peaks and their intensities. One possible alternative to storing such a large file in the computer is to put the full spectrum file in a microfiche retrieval unit in the chemist’s laboratory driven remotely by the search system program. In such a device, the I D number would be a pointer to a given microfiche card and page number, in a (16) R. J. Feldmann and D. A. Koniver, J . Chem. Doc., 11, 151 ( 1971 ) . (17) R. J. Feldmann and S. R. Heller, ibid., 12, 48 (1972).

manner identical to the pointer system used for the peak, molecular weight, and spectrum lookup disk files described in the previous section. As the file grows in size, the microfiche becomes economically very attractive compared to the cost of on-line computer disk storage. Also, the microfiche reader can be operated manually and used for other storage purposes. ACKNOWLEDGMENTS

The author wishes to express his appreciation to Richard J. Feldmann for the extremely efficient intersecting list algorithm. The author also wishes to thank Henry M . Fales, G . W. A. Milne, Robert J. Highet, D. J. Pedder, and J. W. Wheeler for their generous use and criticism of the search system, and K. Biemann for the data base. RECEIVED for review March 23, 1972. Accepted June 13, 1972. Presented in part at the 163rd National Meeting of the American Chemical Society, Boston, Mass., April 9-14, 1972.

Computer Acquisition and Analysis of Gas Chromatographic Data R. A. Landowne,’ R. W. Morosani,* R. A. Herrmann, R. M. King, Jr.,3 and H. G. Schmus4 American Cyanamid Company, Central Research Division, Stamford, Conn. 06904 A computerized system for a multiple instrument gas chromatographic laboratory is described. Simultaneous operation of all chromatographs i s possible in real time even while the computer performs other functions. A set of resident programs controls the entire process which requires a minimal amount of operator interaction regardless of the complexity of the chromatographic analysis. In either method, development or routine analysis, only a few input parameters are required to choose several modes of data handling, with each instrument capable of operating in its own independent fashion. A teletypewriter is used almost exclusively for outputs of results, while most sample information and mode selection is entered through simple data switch boxes. Peak resolution and baseline determination is accomplished for almost all situations encountered without resorting to special routines.

THEUSE OF DIGITAL COMPUTERS at least in some, if not in all, stages of the handling of gas chromatographic data has been demonstrated and practiced in many laboratories for several years. The possible approaches are numerous depending upon the type of laboratory (e.g., research or analytical control), the volume of the work, the kind of computer and its peripheral equipment, the capital investment to be made, and the caliber and training of manpower available to operate the system. Examples of present day systems that fall into three major categories are: Off-line electronic integration followed by computerized data processing from the data collected on To whom inquiries should be addressed. 2 Present address, Laurel Ridge, Litchfield, Conn.

Present address, Xerox Data Systems, 1701 Research Blvd., Rockville, Md. Present address, Avon Products, Inc., Suffern, N.Y.

tape ( I ) ; on-line system totally dedicated to a large number of gas chromatographs (2); on-line system using a timeshared computer (3). Many of these systems are commercially available from either computer manufacturers or gas chromatograph vendors. This last type of configuration was determined to be most suitable for our laboratory which performs the complete spectrum of gas chromatographic functions from routine quality control analyses to complex method development work, in the midst of other analytical functions that could also be computerized (e.g., mass spectrometry, NMR, etc.). The computer on hand was, in fact, already functioning as an on-line instrument for mass spectrometry data acquisition. Subsequent reduction of data and other batch processes were time shared with the real time acquisition. This software was transferred from an existing XDS 930 to XDS Sigma 2 when the system was expanded to handle gas chromatographs. The chromatography system was designed to operate under the Real Time Batch Monitor of the Sigma 2. This consists of a Random Access Disc of 3 megabytes. Total memory available is 36 K words. New software was developed as a set of programs to cover various analytical situations encountered in the gas chromatography laboratory. These were capable of operating simultaneously with the acquisition of mass spectrometry data and other batch processing. For the most part, the software is written in X-Symbol assembly language. There are six controller tasks which handle (1) C. Merritt, J. T. Walsh, R. E. Kramer, and D. H. Robertson, in “Gas Chromatography 1968,” C. L. A. Harbourn, Ed., Institute of Petroleum, London, 1969, pp 338-40. (2) H. R. Felton, H. A. Hancock, and J. K. Knupp, Jr., I17strum. Contr. Syst., 40, 8 3 (1967). (3) R. D. McCullough, J . Gas Chromatogr., 5,635 (1967).

ANALYTICAL CHEMISTRY, VOL. 44, NO. 12, OCTOBER 1972

1961