Computer Identification of Mass Spectra Using Highly Compressed Spectral Codes Stanley L. Grotch Jet Propulsion Laboratory, Pasadena, Calif. 91103 In file search methods, storage is often a significant problem, particularly with mini-computers. Spectral abbreviation alleviates the problem by coding only 2 peaks/l4 amu. This concept may be further exploited by noting that the mass position of any peak in a 14-amu window may be coded using only four bits. For nearly 7000 spectra in the Aldermaston collection, this code requires an average of 48 bits/spectrum. Tests indicate that this code is highly specific, and with appropriate matching algorithms will produce very effective identifications. Further improvements in identification accuracy are obtained when two bits of intensity information are added to the peak position. Using an IBM 360/44,a 7000-spectra library can be searched in less than 10 seconds. Since most computers now manufactured have word sizes which are multiples of four bits, this technique should lend itself well to most machines.
BECAUSE OF THE STEADILY INCREASING use of computers coupled to analytical instruments and the resultant high data rates, much work has been done on the use of computers for the interpretation of chemical spectra (1-20). In interpreting mass spectra, three general approaches have been pursued: 1. File research (1-4, 9,lO-13,16). 2 . Learning machine (5-7,14, IS) 3. Artificial intelligence (17) (1) S. Abrahamsson, S. Stallberg-Stenhagen, and E. Stenhagen, Bioclzem. J., 92,2 (1964). (2) B. Pettersson and R. Ryhage, Arkiu. Kemi, 26,293 (1967). (3) L. R. Crawford and J. D. Morrison, ANAL.CHEM.,40, 1464
(1968). (4) R. A. Hites and K. Biernann, Adoan. Mass Spectrom., 4, 37 (1 968). (5) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, ANAL.CHEM., 41,690 (1969). (6) P. C. Jurs, B. R. Kowalski, and T. L. Isenhour, ibid., p 21. (7) B. R. Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reilley, ibid., p 1945. (8) S. L. Grotch, ibid., 42,1214(1970). (9) B. A. Knock, I. C. Smith, D. W. Wright, W. Kelly, and R. G . Ridley, ibid., p 1516. (10) F. E. Lytle and T. L. Brazie, ibid., p 1532. (11) S . L. Grotch, ibid., 43,1362(1971). (12) L. E. Wangen, W. S. Woodward, and T. L. Isenhour, ibid., p 1605. (13) H. S. Hertz, R. A. Hites, and K. Biernann, ibid., p 681.
(14) P. C. Jurs, Appl. Spectrosc., 25,483 (1971). (15) D. D. Tunnicliff and P. A. Wadsworth, ANAL.CHEM.,45, 12 (1973). (16) R. G. Ridley, in “Biochemical Applications of Mass Spectrometry,” G. R. Waller, Ed., Wiley, New York, N.Y., Chap. 6, 1972. (17) J. Lederberg, in “Biochemical Application of Mass Spectrometry,” G. R. Waller, Ed., Wiley, New York, N.Y., Chap. 7, 1972. (18) A. J. Baker, T. Cairns, G. Eglinton, and R. J. Preston, “More Spectroscopic Problems in Organic Chemistry,” Heyden & Son, Ltd., London, 1967 (problem 12). (19) D. H. Robertson, J. Cavagnaro, and J. B. Holz, Twentieth Annual Conference on Mass Spectrometry and Allied Topics, Dallas, Texas, June 1972. (20) B. S . Finkle and D. M. Taylor, J . Cliromatogr. Sci., 10,312 (1 972). 2
ANALYTICAL CHEMISTRY, VOL. 45, NO. 1, JANUARY 1973
Primarily because of its innate simplicity, generality, and the availability of large spectral libraries, the file search procedure has received the widest attention and is today used in most automated identification systems. An excellent surnmary and comparison of a number of search techniques has been given recently by Ridley (16). In file searching, a library of known spectra is coded, typically by extracting from the complete spectrum some subset of the total data. Generally, the spectrum is transformed into a code directly suitable for computer searching, thereby reducing both computer storage requirements and search times. Each unknown spectrum, coded in the same manner, is compared with either the complete set of library codes or some selected subset of these codes to find those known spectra which “best fit” the unknown according to some criterion. The coding method and the criterion of “best fit” will influence the efficacy of the identification. This paper examines several algorithms used in conjunction with several simple codes. In the learning machine approach, the answer t o a “yes/no” question is generally sought. “Does the unknown compound have a double bond?“ “Does it contain oxygen?”, etc. The technique has demonstrated high reliability in answering suitably posed questions. This approach has several advantages when compared t o the file search procedure. Once the weighting vector for a particular question is obtained by “training,” the answer t o that question requires only a simple vector multiplication. Hence, search times are independent of library size. Since no library as such is used, answers are obtained even when the unknown compound is not in the original training library. A major deficiency of this technique, however, is that it does not directly answer the question which is generally of most interest, “What specific compound gave rise to this spectrum?” In the artificial intelligence approach followed a t Stanford, an ambitious attempt is being made t o model the chemist’s reasoning processes in interpreting mass spectra (17). The method has demonstrated its ability to differentiate between the many isomers of complex molecules using mass spectra. However, because it is still primarily a research tool, it has not been developed t o the point where it is economically attractive for routine spectral interpretation. FILE SEARCH SYSTEMS
The two most generally desirable attributes of any search scheme are: (1) high efficiency and (2) low cost. Low costs are closely related to the computer system needed in the search, primarily in the areas of storage and search speeds. The cost factor has been the driving impetus in most search schemes where spectra are “compressed” or “abbreviated” both to reduce storage and to increase search speeds. It is now generally agreed that mass spectra are highly overdetermined and that much information can be discarded without seriously sacrificing analysis reliability. Earlier studies (8, 12), showed that if intensity information is effectively ignored, good identifications could still be achieved if only the
mass positions of peaks above a threshold were known. In this case, both storage requirements and search times were reduced by an order of magnitude without significantly degrading search reliability. Although it has generally been found that most search techniques yield reliable results, there do appear to be differences between the efficacy of various search algorithms. Further work is needed to more clearly determine which codes and matching algorithms are “best” in a particular application. In this paper, several coding schemes and algorithms are compared. A code reducing spectral storage by nearly two orders of magnitude yields results which may be satisfactory in many applications.
WINDOW 2
WINDOW I
6
19 20
WINDOW 3
33 34
WINDOW 4
47
48
61
N O M I N A L MASS
PEAK INTENSITY
m
m
Figure 1. Four-bit code for expressing peak position
SPECTRAL ABBREVIATION
Several workers (9,13) have exploited a technique known as spectral abbreviation. Here, a spectrum is divided up into N regions (windows), generally each of a constant number of amu. In each window, the Mmost intense peaks are encoded (usually M = 1, 2, 3) with varying degrees of intensity information included. Spectral abbreviation results in: (1) a significant decrease in data storage (and, generally, in search times) and (2) instrumental differences tend to be minimized, yielding improved identifications. To assess the significance of abbreviation, a library of 6880 spectra (about 5000 different compounds) was examined. This library is essentially the Dow, ASTM, MIT, and MCA collections of organic compounds with an average molecular weight of 168 (8). In this library, peak intensities are reported from 0.01% base peak to 100% base peak in increments of 0.01. To fully encode this dynamic range, 14 bits/peak are required. With computers based on a byte structure (Le., multiples of 8 bits), 32 bits would be used for each spectral peak (16 for mass 16 for intensity). Since the average library spectrum considered has 95 peaks, an average of 3040 bits would be required for encoding. For the same library with an abbreviation scheme using window widths of 14 amu and a starting mass of 6 (i.e., window 1 covers the mass range 6-19, etc.), an average of 11.7 windows is required for coding. Using the two most intense peaks in each window, an average of 19.6 peaks/spectrum is encoded (this is less than 2 x 11.7, since many windows contain only one peak). Thus, for this library, retaining the same intensity accuracy, abbreviation requires only 640 bits/ spectrum, or about one-fifth the storage as using the complete spectrum.
+
12-bit code 0000 1000 1010. The extension of this coding to higher masses or to more than one peak every 14 amu is selfevident. It should be emphasized that this 4-bit coding scheme for peak position permits exact reconstruction of all peak positions in the abbreviated spectrum and thus involves no loss in information content. With this method of coding, the subgroup “1 111 ” will not occur, and it may be utilized for special purposes. One possible use might be in indicating the termination of a non-fixed length code (punctuation). A second use might be to flag successive “0000” windows within a code. In both cases, storage could be cut even further using this device. This concept could be extended even further by using more sophisticated schemes which exploit statistical properties of these codes (10). However, in most cases, the computational simplicity of a fixed code will generally outweigh any further storage reductions achieved using more complex schemes. The 4-bit grouping is doubly fortuitous since most computers currently manufactured have word lengths which are multiples of 4 bits. Thus, this code should be readily implemented on most computers. The 4-bit concept considered above encodes only mass positions. Further compression can be achieved by reducing the number of bits allocated for intensity information. However, here a loss in information content occurs since the original intensity information can no longer be reconstructed exactly from the coded data. Intensity information, while important, generally plays a secondary role in the identification process. In any case, in most situations, encoding a dynamic range of IO4, using 14 bits is excessive. In this study, very crude coding of intensity, to effectively only one or two bits, provides accurate identifications.
COMPRESSED CODE FOR PEAK POSITION
This concept may be exploited even further by observing that the number 14 is fortuitously 5 %, the number would increase by about a factor of three. Additionally, any unknown with this condition would also have to be coded twice, requiring examining twice as many answers for identification. Since the alternative of coding two peaks per window would only double the storage required, while probably improving recognition accuracy, it would be preferable to multiple encoding. A second alternative for circumventing this problem is to always arbitrarily encode either the upper or the lower mass position in a window when two peaks have a height ratio less than a specified value. For example, adopting the upper mass convention, for a spectrum with two peaks of comparable intensity at masses 41 and 42, mass 42 would be encoded. With the unknowns tested here, this approach improves the reliability of identification. SEARCH CONDITIONS AND TECHNIQUES
ENCODE ONLY ONE PEAK POSITION PER WINDOW
The question arises, “How unique is a code in which only one mass in each window is given?” To answer this question, all library spectra were encoded using a fixed length code of 128 bits, covering 32, 14-amu windows for the mass range 6453. This 128-bit code was considered as a binary number, and the file sorted on this number. All spectral codes which are identical must be adjacent members in the sorted file. It is, therefore, very simple to find ull codes which are redundant in the sense that different compounds would be indistinguishable in terms of their codes. Since the sorting process takes only a matter of seconds, this technique also provides a practical means for removing redundancies in library files. Here, the library was reduced from 6880 spectra to less than 6000. Excluding compounds which are isomers or close homologs (e.g., ethylphenol, dimethyl phenol) only about 50 sets of compounds were found to have identical codes. In virtually all cases, these were related and generally had the same molecular weight (e.g., l -methyl cyclopentene and 2,4-hexadiene; 1-pentyne and cyclopentene). Statistically, the probability that two compounds picked at random from the file would have identical codes is about low4. These results indicate. at least semi-quantitatively, that this code is likely to be very specific. A second difficulty arises if two peaks in a window have virtually identical intensities. Which one is coded, and what ambiguity results when this is done? To examine this question, each library spectrum wab divided into 16 windows covering the mass range 6-229. In each window, the ratio of the intensities of the two strongest peaks was calculated when both peaks were greater than a specified threshold. These results are summarized in Figure 2. About 50% of the spectra have at least one window in which this intensity ratio is less than 1.20 where both peaks have intensities > 5 z base peak. This result suggests that encoding only the mass of the most intense peak might cause some ambiguity. Two means for alleviating this difficulty can be suggested. First, if the peak height ratio of the two most intense peaks in any window is less than a given value, say 1.20, that spec4
ANALYTICAL CHEMISTRY, VOL. 45, NO. 1, JANUARY 1973
The effectiveness of these codes for identification was tested using a set of 125 “unknown” spectra from an earlier study (11). These unknowns range in molecular weight from 57 to 256 with an average of 121. Each unknown spectrum appears at least once in the library but is a different measurement from that found in the library. Two preconditions could be imposed on any search:
(1) The base peak of the library spectrum must correspond to a peak in the unknown with an intensity greater than Y % of the base peak (generally, Y = 75). (2) The molecular weight of the library spectrum must be the same as the molecular weight of the unknown. In this study, on the average, imposing restriction 1 alone ( Y = 75) reduces the spectra to be searched to 206 per unknown. Imposing restriction 2 alone yields 56, and imposing both 1 and 2, an average of only about 10. Since in many situations the molecular weight is unknown, this study has focused on the case where restriction 2 cannot be imposed. Note that restriction 1 applies only to the library spectrum, and not vice versa; e . g . , the base peak of the unknown is not required to be a major peak in the library spectrum. If the reverse condition is imposed, additional information must be stored giving the masses of all peaks with intensities > Y % for each library spectrum. Earlier ( I I ) , it was found that adding the reciprocal condition contributes little to searching effectiveness. To computationally exploit restriction 1 in the search, the library was first sorted by base peak. A correspondence table was established giving the starting positions in the file for each value of base peak. As an example consider the “unknown” 1-octyne (18). Three peaks have an intensity >75% base peak: masses 41, 43, 81. Restriction 1 requires that only library spectra with base peaks at these masses be considered. From the correspondence table, only library spectra at positions 390-610, 670-1195, and 2870-2932 need be examined. With this technique, searches can routinely be made in less than one second (IBM 360/44). To achieve these rapid search speeds, in each comparison a 4-bit code grouping of the library is placed adjacent to the corresponding 4-bit code of the unknown in one byte. The
Table I. Search Results Encoding Only the Mass of One Peak Every 14 amu 7”of Unknowns Identified with Confusion 5 Co
Confusion,
p = 1 + %
co
p = o
p = l
p = 2
0 1
64.0 74.4 82.4 85.6 86.4 91.2
70.4 80.0 82.4 85.6 88.8 92.0
70.4 82.4 87.2 88.8 91.2 94.4
2 3 5 10 Q
p =
60.8 68.8 74.4 78.4 87.2 88.8
base
p = 3 +log
72.8 83.2 85.6 89.6 92.8 94.4
77.6 85.6 89.6 91.2 93.6 94.4
p = 2a
Complete
77.6 84.8 89.6 92.0 95.2 95.2
83.2 90.4 93.6 94.4 94.4 96.8
Upper mass peak encoded when two most intense peaks in a window have intensity ratio < 1.25.
resulting 8-bit pattern is examined using a table look-up procedure and the value found in the table increments the disagreement criterion. At the conclusion of the search, the ten “best fit” compounds are displayed. Since it is simple to decode the 4-bit mass grouping, a detailed side-by-side comparison of the relevant peak information used in the search can be given. This permits the user to more readily determine which of the ten spectra is the correct answer. MATCHING RESULTS USING ONE PEAK/14 amu
At the heart of any search procedure is the algorithm used to measure “goodness of fit.” In this study, several disagreement criteria were used. One class of algorithm minimizes a weighted function of disagreements minus agreements. Let mZi and m L t denote the masses in the ith window of the encoded peak in the unknown and in the library, respectively (“0” denotes no peak coded). One disagreement criterion is of the form:
c = cs, A-
2=1
6, 6,
m
=
=
0 if mZi = m Z t= 0 (both have no peaks)
(2)
+1 if mZi # m C tand either or both are non-zero
(3)
si =
(4)
-pi
if mZt = mi, and both are non-zero
The summation index i denotes mass window and runs to the maximum number of windows examined, N . The “best” library compounds are those which minimize C. When neither spectrum has a peak in a window, there is no effect; when peaks do not agree, +1 is added to C; and when they agree, p i is subtracted from C. This criterion is analogous to the criterion XOR-p (AND) used in earlier work with onebit coding (11). All unknowns were searched using restriction 1 for the mass range 34-229. The effect of the weighting factor p on these identifications is summarized in Table I. The results are expressed in terms of the 2 of unknowns identified with a “confusion” less than a certain value. Confusion is defined here as the number of “different” compounds found (exclusive of the correct answer) which have a value of the disagreement criterion which is as good as or better than the correct answer. (For purposes of defining confusion here, isomers yielding similar mass spectra are considered to be the same compound.) A confusion of zero implies that the library best fit to the unknown is unique, and is the correct compound. The ideal search system would always yield a confusion of zero. For p the same for all windows, the best identifications result for about p = 2. Noticeably poorer is p = m , where only the coincidences between coded peaks are maximized and
disagreements are not considered in any way. Equivalent results were found earlier using one-bit coded spectra ( I I ) , and maximizing coincidences in the top N peaks (9). For p = 2, with restriction 1 imposed ( Y = 75), if three different answers are considered, the correct compound is one of these in 109 of the 125 cases (87x7,). If, in addition, the molecular weight of the unknown is presumed known, adding restriction 2 improves the above figure to 115 of 125 compounds (92 2). The column labeled “complete” summarizes the matching results using as the disagreement criterion the sum of the absolute differences in intensity levels for all peaks in the spectrum with each intensity expressed as 32 bits. Therefore, these results represent comparable performance using about 100 times more storage. Note that of the seven spectra poorly identified here (confusion >5), four failed to satisfy restriction 1 with Y = 75, and the remaining three did produce identifications similar to the correct answer but were counted as “different” compounds. For example, the unknown, benzil, (1,2-diphenyl-1,2-ethandione) gave as “best” answers : 1-phenyl-l,2-propan-dione, benzoyl chloride, methyl benzoate, acetophenone. This illustrates several potential dangers in accepting these results in a strictly quantitative sense: (a) preconditions will affect the results, (b) the definition of “different” compounds will generally depend upon the application. Since all peak intensities of the unknown are known, using a weighting p which increases with intensity seems logical. A factor, p = 3 $- log,, (intensity), (1 5 p 5 5) appears to improve the results somewhat, particularly for unknowns found uniquely (confusion = 0). A non-constant weighting slows the search, but since search times are still of the order of seconds, this is probably not significant in most cases. These searches were repeated coding the higher mass when the two most intense peaks in a window had an intensity ratio