Computer Applications in Mass Spectrometry - ACS Symposium

Jun 1, 1978 - High Performance Mass Spectrometry: Chemical Applications ... The modern digital computer (COM) has proved to be ideal for all three of ...
4 downloads 0 Views 2MB Size
17 Computer Applications in Mass Spectrometry

Downloaded by UNIV OF MASSACHUSETTS AMHERST on May 21, 2018 | https://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0070.ch017

F. W. MC LAFFERTY and R. VENKATARAGHAVAN Department of Chemistry, Cornell University, Ithaca, NY 14853

The information content of the mass spectrum of an average organic compound is unusually high, containing at least 50 bits of information (1); further, with modern instrumentation a mass spectrum can be obtained on a nanogram of compound in one second. Obviously, to utilize this tremendous quantity of data it must be acquired, reduced, and interpreted in a very rapid and efficient manner. The modern digital computer (COM) has proved to be ideal for all three of these duties, and its rapid increase in capabilities, and the concomitant reduction in cost, in the last decade has thus led to a similar revolution in mass spectrometry (MS). Acquisition and Reduction of MS Data The book of Waller (2) was very timely in showing the surprising advances in MS/COM systems that had occurred in the few years before 1971. This book contained reports from many leading MS research laboratories concerning the automated data acquisition and reduction systems then in use. There were a wide variety of these, the majority of which had been developed to a substantial extent in the reporting laboratory. In the intervening six years the field has changed dramatically; literally hundreds of mass spectrometry laboratories now have computerized data acquisition and reduction systems, and most of these are manufacturer supplied. Only a small fraction of these acquire data on, for example, magnetic tape for later processing on a central computer; the on-line minicomputer is the rule, but it has increasing competition from microprocessors and even sophisticated calculators. A computer-controlled GC/MS is now available for less than $50,000 (3), and a microprocessor-driven MS is available which gives the confidence of identification as well as quantitative analyses for 16 preselected compounds every second (4).

©0-8412-0422-5/78/47-070-310$05.00/0

Gross; High Performance Mass Spectrometry: Chemical Applications ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF MASSACHUSETTS AMHERST on May 21, 2018 | https://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0070.ch017

17.

MCLAFFERTY

AND

VENKATARAGHAVAN

Computer Applications

311

There are s t i l l challenging areas of high potential for further applications of on-line computers i n MS data acquisition and reduction, such as high-resolution MS, simplification of MS operation, maintenance and record keeping, c o l l i s i o n a l activation MS, and continuous analyzers (e.g., patient monitoring, process control) (5). However, just as mass spectrometry evolved so that most laboratories no longer constructed their own instruments, the field has suddenly progressed to where computer data a c q u i s i tion and reduction equipment which is superior for most a p p l i c a tions i s now available commercially. This i s not true, however, for a highly promising and challenging area, that of the computer identification of unknown mass spectra (6), and this w i l l be emphasized i n the remainder of our discussion. If efficient mass spectrometer systems with totally automated data acquisition, reduction, and identification could be made available at reasonable cost, this would surely open many important new areas of application for MS. Computer Identification of Unknown Mass Spectra The identification of unknown compounds from their mass spectra, whether as pure compounds or i n mixtures, i s a problem for which we feel i t i s better to use two approaches, retrieval and interpretation, with these applied sequentially. The unknown mass spectrum i s first matched against a library of a l l available reference spectra; if no spectrum i s retrieved which matches sufficiently w e l l , the unknown spectrum i s then interpreted to obtain as much structural information as possible. A variety of systems for both retrieval and interpretation have been proposed (6); interpretation using pattern recognition systems have been described at this meeting by Isenhour (7) and by Wilkins (8) and the Artificial Intelligence system (9) by Smith; thus this report w i l l emphasize the "Probability Based Matching" (PBM) (10, 11) and the "Self-Training Interpretive and Retrieval System" (STIRS) (12-15) developed at Cornell for these purposes. Both of these systems are actually available internationally to outside users over the TYMNET computer networking system (16). Probability Based Matching For document retrieval i n libraries (17). it i s w e l l known that optimization requires "weighting" of the descriptors or requirements sought; some are more important than others to the person conducting the search. Low resolution mass spectra contain two principal types of data, masses and abundances; the PBM system (10,, 11) employs a probability weighting of these. As first proposed by Grotch (18), the probability of occurrence of particular abundances (based on 100% for the most abundant peak) should follow a log normal distribution. This was shown to be true for a data base of 18,806 different compounds (19); abundance

Gross; High Performance Mass Spectrometry: Chemical Applications ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF MASSACHUSETTS AMHERST on May 21, 2018 | https://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0070.ch017

312

HIGH

PERFORMANCE

MASS

SPECTROMETRY

ranges differing by a factor of two in their occurrence probability are>0.24%, >1.0%, >3.4%, >9.0%, >19%, >38%, and>73%. The probability of occurrence of the different mass values also varies widely. Because larger molecular fragments tend to decompose to give smaller fragments, higher m/e values are less common in mass spectra, the probability decreasing by a factor of 2 approximately every 130 mass units. Although this is a surprisingly smooth function at high m/e values, lower values show rather large differences in their probability of occurrence, with these variations repeated every 14 mass units. Thus, although m/e 39, 41, and 43 ions of 1% or greater abundance are each found in more than two-thirds of all reference spectra, peaks m/e 33, 34, and 35 each occur less than 10% as frequently (19). Thus in an unknown mass spectrum if abundant ions at m/e 34, 241, and 343 are matched by comparably abundant ions in a reference spectrum, this is far more significant in indicating that a correct retrieval has been made than if the unknown's m/e 39, 41, and 43 peaks were matched in a reference spectrum. These abundance and mass uniqueness weightings are used to calculate the probability that the match occurred by chance and is thus a "false positive"; the reciprocal of this probability (log base 2 "confidence index, K") is used to rate the degree of match. A second unique feature of PBM, which has been proposed independently by Abramson (20), is "reverse searching", and is valuable for the identification of components in mixtures. In this, PBM ascertains whether the peaks of the reference spectrum are present in the unknown spectrum, not whether the unknown's peaks are in the reference. Thus the reverse search in effect ignores peaks in the unknown which are not in the reference spectrum, as they could be due to other components of the mixture. Although reverse searching should thus reduce the capabilities of PBM for matching unknown spectra of pure compounds, this apparently is more than offset by the increased capabilities resulting from the data weighting. The system has been tested with over 800 "unknown" mass spectra taken from a large c o l l e c tion of mass spectra from diverse sources (21), and its performance has shown to be generally superior (22) to the widely accepted Biemann-MIT system (23) which matches the two largest peaks in each 14 mass unit region. Performance has been measured utilizing "recall/reliability" plots, methodology developed for library retrieval systems (17) in which recall (RC) is defined as the proportion of relevant spectra actually retrieved, and reliability (RL, or "% correct" X 0.01) is the proportion of retrieved spectra which are actually

Gross; High Performance Mass Spectrometry: Chemical Applications ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

17.

MCLAFFERTY

Computer Applications

A N D VENKATARAGHAVAN

313 (1)

tt- -

( I

V c

FP = I / P f

+

(2)

¥

(3)

f

relevant; FP = proportion of false positives, I = number of cor­ rect identifications, If = number of false identifications, P = number of possible correct identifications, and Pf = number of possible false identifications (P + Pf = total unknowns examined). The "recall/reliability" plot is then constructed by determining the pair of these values achieved for particular Κ (or "ΔΚ") value thresholds (11); the higher the Κ value demanded by the user for an identification, the higher the probability that it is correct (RL), but the lower the chance of making an identification for a particular unknown (RC). Note that if the evaluation is made considering only the highest Κ selection as the match, the number of compounds tested will be equal to the number of possible cor­ rect identifications (P ), to Pf, and also to (I + If), so that for this evaluation RC = RL = (1 - FP). Two sets of "unknown" spectra of over 400 each were selected at random from the molecular weight (MW) ranges 144-160 and 232-312; at 15% recall these showed reliability values of 83% and 85%, respectively; at 25% R C , RL = 76% and 62%; and at 50% R C , RL = 65% and 42%. As discussed later, the 65% RL value corresponds to a false positives of ~ l / 5 0 , 0 0 0 . The lower results for the high M W set are mainly due to the use of only 15 peaks per spectrum, and this has been increased, de­ pending on M W , to as many as 26. Unknowns present as 30% and 10% components in mixtures gave reliabilities of 60% and 20%, respectively, at the 25% recall level; however, these results were far superior to those from the forward search system on the same unknowns. To repeat, for unknowns giving matches of unsatisfactory reliability, STIRS should be used also. Because retrieval of the spectrum of a compound whose structure is similar to, but not exactly the same a s , that of the unknown can be helpful to the user, four "classes of match" have been defined: I, identical compound or stereoisomer; II, class I or ring position isomer; III, class II or homolog; and IV, class III or an isomer of class III compound formed by moving only one carbon atom. The recall/reliability performance of PBM was evaluated separately for these four matching criteria; at the 50% recall level 95% of the compounds (low M W set) selected matched within the class IV criteria. As has been suggested previously (24), the subtraction of the reference spectrum of an identified compound from the mixture spectrum should produce a residual unknown spectrum which is easier to identify. This approach has been implemented so that at the end of the PBM run the computer automatically subtracts the best matching reference spectrum (or, at user command, some c

Downloaded by UNIV OF MASSACHUSETTS AMHERST on May 21, 2018 | https://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0070.ch017

c

c

c

c

Gross; High Performance Mass Spectrometry: Chemical Applications ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF MASSACHUSETTS AMHERST on May 21, 2018 | https://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0070.ch017

314

HIGH

PERFORMANCE

MASS

SPECTROMETRY

other reference spectrum) from the unknown spectrum and notifies the user of any significant residual peaks. Although PBM is a reverse search system, its recall is lower for components in lower proportion of the mixture; thus subtracting out the contribution of an abundant component improves the PBM recall for other mixture components. Recently In Ki Mun in our laboratory has applied the PBM principles to the identification of the number of chlorine and bromine atoms present in a fragment ion from the abundances of the isotopic peaks. In a test of more than 1,000 "unknown" spectra taken from our large data base, the predictions of this PBM program were correct 90% of the time, and most of these incorrect predictions were due to inaccurate data. Self-Training Interpretive and Retrieval System In some interpretive methods such as Artificial Intelligence (9) the computer is programmed to recognize the mass spectral behavior of particular types of compounds or substructures . In pattern recognition or learning machine approaches, the computer trains itself for such recognition based on reference spectra which do and do not contain a particular structural feature. In contrast, there is no pretraining of STIRS (12) for the mass spectral behavior of particular structural moieties; rather, 15 classes of mass spectral data (Table I) have been selected which indicative of different compound types or substructures, but without designating what these types or substructures are. STIRS then, in effect, trains itself to interpret the unknown mass spectrum by matching the unknown data for each of these classes against the corresponding data for all of the reference spectra; particular substructures or types of compounds which are found in a substantial proportion of the best matching reference compounds have a correspondingly high probability of being present in the unknown. Thus STIRS does not have to be pretrained to recognize the presence of any particular structural feature; however, it cannot identify any feature which is not present in some compounds of the reference file. In further contrast to some applications of pattern recognition methods, we recommend that STIRS be used for positive information only; the presence of a structural feature in many of the selected reference compounds indicates the presence of that substructure in the unknown, but the absence of a substructure in the selections is not a reliable indication of its absence in the unknown. The present Cornell PBM system employs a data base of 41,429 different spectra of 32,403 different compounds. To increase the selectivity of STIRS, it uses a data base of only the best spectrum (25) of each compound, limiting this further to the 29,468 compounds which contain only the common elements H , C , N , O , F , S i , P, S, C I , Br, and/or I. A number of improvements made recently to the system will be described below. a r e

Gross; High Performance Mass Spectrometry: Chemical Applications ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

17.

MCLAFFERTY AND VENKATARAGHAVAN

315

Computer Applications

Table I. Mass Spectral Data Classes Used i n STIRS

Downloaded by UNIV OF MASSACHUSETTS AMHERST on May 21, 2018 | https://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0070.ch017

Data Class

Description, maximum number of peaks

1

Ion Series (14 amu

2-4

Characteristic ions

separation)

Range of mass or mass loss 175)

6B

Five

76-149 (MW>250)

5C

Five

16-20, 30-38, 44-! 59-65, 72-76

6C

Five

26-28, 39-42,5266-70, 80-84

7, 8

11

Secondary neutral losses from most abundant odd-mass (MF7) and even-mass (MF8) loss

99% of cases this w i l l be due to the presence of phenyl i n the unknown compound, i . e . , i n