will allow each scan to be associated with the total ion current observable at the time the scan was obtained. Further, it will allow the intensities observed during a scan to be adjusted for the change in total ion current taking place during the course of the scan itself (14). Another benefit of this interface will be the ability of the computer, under program control, to select the appropriate intensity channel and threshold value.
(14) C. H. Sederholm, IBM Scientific Center, Palo Alto, Calif.,
private communication, 1970.
ACKNOWLEDGMENT
The authors are grateful to Jack Harten for technical assistance. RECEIVED for review March 15, 1970. Accepted August 5 , 1970. This investigation was supported in part by research grants from NIAMD-NIH (AM 12434), DRRF-NIH (RR 00480) and (FR-05656) of the PHS as well as National Science Foundation (GB 7856 and G U 2293). It is published with the approval of the Director of the Michigan Agricultural Experimental Station as Journal Article 5060.
Compound Identification by Computer Matching of Low Resolution Mass Spectra B. A. Knock, I. C. Smith, D. E. Wright, and R. G . Ridley Mass Spectrometry Data Center, A WRE, Aldermaston, Berks, England William Kelly Unilever Research Laboratory, Colworthl Welwyn, Colworth House, Sharnbrook, Bedford, England
Computer programs have been written to identify an unknown compound by matching its mass spectrum against a library file of 8000 standard spectra. Various methods of matching have been compared and extensively tested. Peak intensity values are not used directly; only the relative order of intensities of a few of the total number of peaks is needed. Successful identifications have been made even when there were variations in the spectra due to instrumental and other factors. An essential oil, geranium oil, was analyzed by GC-MS and most of the components were identified by computer matching.
’
A LOW RESOLUTION mass spectrum is to an increasing extent the first or only piece of structural information available on an unknown compound. Complex mixtures after analysis by combined GC-MS present the chemist with large numbers of unknown spectra. Any rapid method of identification from the low resolution mass spectrum alone is thus of considerable assistance to the analyst. The computer analysis of the spectrum, from first principles ( I ) , demands some a priori knowledge of the compound. This information is not always available, and as yet only compounds of relatively simple structure can be dealt with. This paper deals with the development and testing of computer programs which are designed to match unknown spectra against a library file of standard spectra and to print out a list of compounds according to the degree of matching. It is well known that the different operating characteristics of mass spectrometers ( 2 ) , variations in inlet systems and in sample handling prior to ionization (3), can affect the “crackingpattern” of a compound. Any matching system must be able to cope with these spectral variations. Various methods (1) J. Lederberg et at., J. Amer. Chem. SOC.,91, 2977 (1969). (2) V. J. Caldecourt, J. Appl. Spectrosc., 12, 167 (1958). (3) See, e.g., G. Spiteller and M. Spiteller-Friedmann, Ann. Chem.,
690, 1 (1965). 1516
of matching have been described (4-7),although no extensive testing has been reported. Peak intensity data are an important parameter in the calculation of degree of matching in these methods. The present paper shows that matching routines which compare mass peaks arranged in order of intensity are effective. MATCHING PROCEDURES
Four methods were tested against a range of compound types. For a considerable time, the mass spectroscopist has found printed indexes to mass spectra, ordered on the six or ten strongest peaks (8, 9), extremely valuable for manual compound identification. As it is unusual to have an exactly matching order, the user of tabulations places more emphasis on the mje values than on their order. Method 1 follows this approach., The mje values of then largest intensities are compared, irrespective of order, and the number of agreements A is noted. The “degree of matching” is then given by
PI
A
= fI
(4) S. Abrahamsson et al., presented at the 14th Annual Conference on Mass Spectrometry and Allied Topics, Dallas, Texas, May 1966, See also Chimia, 20, 354 (1966); Sci. Tools, 14 (3), 29
(1967). (5) (a) C. Cone, K. Biemann et ai., presented at the 15th Annual
Conference on Mass Spectrometry and Allied Topics, Denver, Colo., May 1967; (b) R. A. Hites and K. Biemann, “Advances in Mass Spectrometry,” Vol. 4, E. Kendrick, Ed., Institute of Petroleum, London, 1968, p 37. (6) L. R. Crawford and J. D. Morrison, ANAL.CHEM.,40, 1464 (1968). (7) B. Pettersson and R. Ryhage, Ark. Kemi, 26, 293 (1967). (8) “Index of Mass Spectral Data AMD 11,” ASTM, Philadelphia, Pa., 1969. (9) “Compilation of Mass Spectral Data,” A. Cornu and R. Massot, Heyden and Son Ltd., London, 1966.
ANALYTICAL CHEMISTRY, VOL. 42, NO. 13, NOVEMBER 1970
Table I. Ten Strongest Peaks of Cassione in Spectra from Twelve Different Mass Spectrometers Instrumentn mle
(1)
1 135 (100) 192 (50)
149 (19) 119 (15)
43 (13) 91 (10) 77 (8)
136
(8)
51
(5)
65
(5) a
Instrument No. 8, 11, 12 1, 3, 5 , 6, 9 2, 10 4 7
2 3 4 135 135 135 (100) (100) (100) 192 192 192 (68) (48) (53) 43 43 43 (39) (32) (44) 119 149 119 (25) (25) (23) 149 149 119 (21) (23) (23) 17 77 91 (16) (21) (21) 42 91 77 (16) (16) (17) 51 51 77 (13) (14) (15) 65 65 91 (13) (10) (13) 39 136 136 (10) (12) (11) Make AEI-MS2 AEI-MS9 AEI-MS12 LBK 9000 60” SECTOR-D-F.M.S.
5
135 (100) 192 (49) 43 (38) 119 (23) 149 (22) 77 (16) 91 (16) 51
(14) 65 (10) 39 (10)
The second method uses the n strongest peaks of the spectrum in order of decreasing intensity and makes a n allowance for the relative position of peaks with equal mje. If they are not at the same position the contribution t o Pz is reduced proportionately to the difference in position. The degree of matching is given by
where A is the number of agreements irrespective of order, and i and j are the positions in the respective sets of the kth pair of equal mje values. The above methods take no account of possible mass discrimination effects due to type of instrument. The two following methods were designed t o overcome these effects. The spectrum is divided into R equal ranges, each containing m mass units. Within each range the n most intense peaks are selected in order of decreasing intensity. I n method 3 the number of agreements A , in mje values, irrespective of order, is counted in each range r . The matching factor for each range is calculated as in method 1 and then averaged over the R ranges, t o give 1 A, P (3) -
E -n
Zr=l
Method 4 treats the n strongest peaks within each range in a similar manner to method 2 and the degree of matching is given by (4) The mass peaks obtained in methods 3 and 4 when n = 2 and m = 14 are similar to those in the “abbreviated” mass spectrum used in the matching routines in (5). For each method, the ten library compounds with the highest values of P are ranked in decreasing order of P and printed out with index number, molecular weight, molecular formula, and name. Only spectra of compounds in a selected molecular weight range are matched. Modifications are made t o all four for-
6 135 (100) 192 (51)
43 (33) 119 (29) 149 (22) 91 (21) 77 (20) 51 (14) 65 (13) 191 (12)
7 135 (100) 192
135 (100) 192
(67)
(47)
149 (20) 119 (20) 43 (13) 91 (11) 77
8
(8)
43 (39) I19 (25) 149 (21) 91 (19) 77 (18) 51 (15)
51 (5)
(12) 39 (10)
(10)
136
65 (5)
65
9 135 (100) 44 (79) 192 (50)
119
(28)
149
(27)
43 (25) 77
(19) 51 (16) 91 (14) 39 (12)
10 135 (100) 192 (56) 43 (26) 149 (21) 119 (19) 91 (14) 77 (13) 51 (11) 136 (9) 65 (9)
11 135 (100) 192 (54) 43 (25) 119 (23) 149 (21) 91 (15) 77 (14) 51 (10) 136 (9) 65
51 (20) 149
(8)
mulas to deal with the case where there is an insufficient number of peaks n in the library spectrum, unknown spectrum, or particular range of either. Computer Program. The program was written originally for the IBM 7030 computer and coded 99% in the S2 dialect (10) of FORTRAN and 1 in STRAP machine language. FORTRAN IV versions for the IBM 7030 and the IBM 360 have since been developed. The latter takes advantage of Type 2314 disk storage units for the libraries. Enquiries regarding the availability of these programs should be made to the Data Centre. Computer Times Required. Apart from the choice of method of file organization and access, computer search times are determined by several factors: the number of unknowns being matched at one time, the range of molecular weights being searched for each unknown, the number of unknowns that are being matched against the same library compound, and whether a filtering process to reduce the detailed amount of comparison is used or not. One filter which is being used at present is the criterion that the strongest peak in the unknown must be one of the six strongest peaks in the library spectrum, before further comparison may be made. Use of this filter can reduce search times by a factor of 3. By using a direct access method of disk file handling on a n I B M 360150, computer search times can vary between 3 and 30 seconds for each unknown depending on the particular combination of the above factors. TESTING OF THE MATCHING ROUTINES Effect of Instrument Variability. Initially, to examine the effects of different instruments and conditions, three compounds were run o n twelve instruments of four different types. The compounds were Cassione [4-(3 ’,4’-methylenedioxyphenyl)-butan-2-one], citronellol (3,7-dimethyl-2-0~tenol), and methyl 3-(methy1thio)propionate. It would be impractical to reproduce all the spectra here. Table I,
(10) Mary U. Thomas, AWRE Report No. 059167, 91 pp (1968).
ANALYTICAL CHEMISTRY, VOL. 42, NO. 13, NOVEMBER 1970
1517
Table 11. Mass Instrument 1 Bendix Time-of-flight mle I 27 6.7 29 17.3 39 9.9 41 79.7 43 17.1 44 9.4 51 53 10.5 55 18.7 67 18.3 68 12.8 69 100.0 70 7.6 77 5.4 79 9.6 34.6 81 7.5 91 24.2 93 94 95 11.7 4.7 105 10.1 107 9.3 109 5.3 119 6.9 121 133 135 161 3.0 189 191 0.8 203 204 205 206 1.3 222
Spectra of Farnesol 2 3 AEI AEI MS9 MS2 I I 9.2 9.3 11.2 9.8 11.4 12.4 55.0 65.5 26.3 14.4 1.7 2.7 2.6 9.6 11.2 14.7 23.8 11.6 18.1 12.2 8.1 100.0 100.0 1.2 6.9 3.0 11.1 7.2 21 .o 21.8 23.1 5.6 15.5 20.5 61.2 4.0 10.1 8.3 8.7 3.6 11 .o 8.0 20.2 7.7 10.8 5.0 15.7 5.9 9.8 16.8 13.2 2.9 11.2 1.o 2.6 1.1 2.3
4
CEC 21-108 I 31.2 26.3 31.6 100.0 34.1 11.2 12.5 31.5 49.1 21.4 10.5 82.7 18.5 18.8 23.6 20.9 18.7 59.6 10.7 8.7 12.7 17.6 10.2 18.4 11.3 10.2 2.5 10.8 1.7 0.3
0.8
0.4
Table 111. Typical Computer Output from Identification of Farnesol
Method 4 Mol wt 222 222 222 222 222 220 222 222 204 222 206
Degree Formula Compound C15H260 Farnesol C15H26O Driminol C15H260 Cedrol C15H260 Ledol C15H160 Patchouli alcohol C I ~ H ~ ~ OSantalol ClsH26O Nerolidol CljH260 Eudesmol C15H24 p-Selinene ClsH260 a-Caryophyllene alcohol C15H26 Ledane
of match 0.69 0.62 0.58 0.56 0.49 0.47 0.41 0.44 0.40 0.38 0.38
showing the masses and intensities of the ten strongest peaks for each spectrum of Cassione, a relatively stable compound, illustrates the variation in “cracking-pattern” observed. F o r example mje 43 ranges from the second strongest to the sixth strongest peak and varies in relative intensity from 12.8 to 50.6%:. When this study began, a complete file of spectra was not available on magnetic tape, so for these trials a special library was formed from the spectra of as many compounds as possible with the same molecular weight as the compound of interest. The respective numbers of isobaric compounds were 19, 53, and 83. Each spectrum was in turn used as a library spectrum and the others were run as “unknowns.” 1518
Table IV. Summary of Results from Identifications of Farnesol Position of the Library Farnesol Spectrum in the Output List Unknowna Library 1 2 3 4 AV 1 2 Method 1 1 6 3.0 1 2 7 6 4.7 8 3 3 6 5.1 4 2 3 4 3.0
Unknown Library Method 2
1
1 2 3 4
1 2 4
Library
1
2 1 6 2
3 2 4
-
4 3 3 5
3
Av 2.0 2.1 4.3 3.0
Unknown Method 3
1 2 3 4
1 4 1
Library
1
2 1 1 1
3 1 1
4 1 1 1
1
Av 1 .o 1 .o 2.0 1 .o
Unknown Method 4
1 2 3 4
1 1 2
2 1
3 1 1
1 1
1
4 1 1 1
Av 1 .o 1.0 1 .o 1.3
The unknown and library numbers 1-4 refer to spectra from instrument numbers 1-4 in Table 11. a
For methods 1 and 3, the correct compound was retrieved in the first position in 93 of the trials. For methods 2 and 4, every trial was successful. Farnesol. To pursue further the examination of the variation between spectra, it was decided to consider the case of farnesol, where Reed (11) had drawn attention t o the extreme variations in the duplicate “standard” spectra taken on different instruments (Table 11). One farnesol spectrum was added to a library of 19 other sesquiterpene alcohols and 470 spectra of widely differing compound types, all having molecular weights in the range 200-240. Each of the other farnesols was then run against this library, using all four methods, with n = 15 for methods 1 and 2 and n = 3, m = 46 for methods 3 and 4. The values were chosen so that approximately the same total number of peaks were compared in each method. The process was repeated for the other three farnesol spectra. A typical result is shown in Table 111. The results are summarized in Table IV, which shows the position of the “library” farnesol spectrum in the output list. For these spectra, which are visually quite different, methods 3 and 4 gave an excellent retrieval whereas the other two methods were markedly less successful. Although no attempt was made to optimize these results, a few other cases were examined. For methods 1 and 2, a greatly improved performance was obtained, compared with Table IV, by using n = 5 or n = 6, whereas the results of n = 10 were similar. For methods 3 and 4, n = 2, m = 14 gave much the same results but n = 3, m = 20 was not so satisfactory. It appears that in particular cases, the performance of a method can be improved by optimization. Although (11) R. I. Reed, in “Mass Spectrometry,” R. I. Reed, Ed., Academic Press, London and New York, 1965, p 401.
ANALYTICAL CHEMISTRY, VOL. 42, NO. 13, NOVEMBER 1970
spectra of very different compound types were being examined, in nearly every case the sesquiterpene alcohol spectra were clearly separated from the nonterpene spectra. Terpenes. To further extend the study of terpene identification a file of 250 terpenes was compiled from the standard collections and from the literature. This file was used to analyze an essential oil, geranium oil, by the use of a GC-MS combination (a Pye 104 attached to an AEI-MS12). Although previous results suggested that all four methods gave approximately equal success, method 4 alone was used in this part of the study as it simplified the manual extraction of data from the spectra. The three strongest m/e’s in order of intensity were noted in every interval of 20 mass units (n = 3, m = 20), directly from the chart. Figure 1 shows the result of this analysis. Over 40 unknown spectra were matched against the computer file and in 30 cases the correct structure appeared in the output listing. In 24 of these, it occupied 1st (or 1st equal) position, 3 were 2nd, 2 were 3rd, and 1 was 4th. The identifications were confirmed by detailed visual comparison of the spectra, chromatographic retention data, and prior knowledge of the components of the oil. The remainder of the spectra were unidentified either because the standard spectra were not on the library file, or because only spectra of mixtures were obtained because of unresolved G C peaks. In the case of the unresolved peaks 39 and 44, spectra taken at different points during the elution of the peak were sufficiently characteristic for successful identification of the unresolved components. The reasons for the correct compound appearing at 2nd, 3rd, or 4th position, rather than lst, were varied. They included the close spectral similarity of isomers, such as isomenthone preceding menthone in peak 27, normal spectral differences, or the presence of inipurities. These impurities may be either in the library spectra or the unknown. In the latter case, they would arise from incompletely resolved peaks. It is worth noting that none of the terpene spectra in the library file had been obtained on the spectrometer used in this GC-MS analysis. The majority of spectra on the file had also been obtained by batch handling methods. Thus considering the many points of similarity between terpene spectra and the differences possible through the use of many different types of instruments, this is an excellent result. Hydrocarbons. The Data Centre, in cooperation with Dr. Boettger, Jet Propulsion Laboratory, Professor Biemann, Massachusetts Institute of Technology, and Dr. Faust, Imperial Chemical Industries, Dyestuffs Division, assembled 8000 spectra (12-16) on magnetic tape. With these spectral data, the study was extended to a much wider range of compound type and molecular weight. Many compounds on the tape are represented by at least two spectra. In general these spectra have been produced o n different types of instruments and under various conditions. They can thus be used to provide a realistic test of the matching methods. One hundred seventy-four hydrocarbons were selected. These cover a wide range of structural types. One spectrum of each compound was run as the unknown using methods 2 and 4, the former (12) American Petroleum Institute Research Project 44, Catalog
of “Selected Mass Spectral Data.” (13) Thermodynamics Research Center Data Project, Catalog
of “Selected Mass Spectral Data” (formerly the Manufacturing Chemists Association Research Project). (14) Uncertified Mass Spectra, Subcommittee IV, ASTM Committee E-14. (15) Dow Chemical Co., Uncertified Mass Spectral Data, distributed through ASTM Committee E-14. (16) MSDC Mass Spectral Data, Mass Spectrometry Data Centre, AWRE, Aldermaston.
P
I
I
U
W W I 151 S4 LIYONENE
I
A-ANENE
+ L
1RC. mmD*mw
c Figure 1. Chromatogram and identification of components of geranium oil GC conditions: Sample 2 PI. 50 ft X 4 mm i.d. column of 10% Carbowax 20 M on 80-100 mesh celite. Isothermal at 65 “C for 60 min, then programmed at 0.5”/min to 200 “C. Total chromatograph time, 7 hours
with n = 20 and the latter with n = 3, m = 20. Arbitrarily, each compound was matched with those compounds within =k 3 amu of the molecular weight of the unknown. Table V shows a case where positional isomers with closely similar mass spectra were retrieved together. Table VI shows an
ANALYTICAL CHEMISTRY, VOL. 42, NO. 13, NOVEMBER 1970
1519
Table V. Computer Output for Identification of 1,3-Dimethylbenzene Dow 308 Method 4, iz = 3, m = 20, 205 spectra in library, Mol wt 103-109 Ref No. Mol wt P Compound Formula DOW 308 106 1.00 1,3-Dimethylbenzene CsHlo 106 0.96 DOW 310 1,4-Dimethylbenzene CaHi0 106 0.94 DOW 31 1 1,2-Dimethylbenzene CsHlo API 254 106 0.88 1,3-Dimethylbenzene CsHlo 106 0.87 API 253 1,2-Dimethylbenzene CsHlo API 178 106 0.85 1,2-Dimethylbenzene C8H10 106 0.85 API 179 1,3-DimethyIbenzene CsHlo API 180 106 0.85 1,4-Dimethylbenzene CSHio 106 0.81 API 255 1,4-Dimethylbenzene CsHlo 106 0.81 MCA 4 1,3-Dimethylbenzene CsHlo Table VI. Computer Output for Identification of n-Dodecane API 404 Method 2, n = 20, 197 spectra in library, Mol wt 167-173 Mol wt P Compound Formula Ref No. 1.00 ri-Dodecane API 404 170 CI?H?6 0.87 ri-Dodecane API 23 170 Cl?H?6 0.83 ri-Dodecane API 981 170 Cl?H?6 0.82 ri-Dodecane API 1028 170 CnH26 API 1598 170 0.80 ii-Dodecane Cl?H?6 0.78 2-Methyl undecane 170 AST 2012 C1& 170 0.75 4-Methyl undecane AST 2013 C12H26 0.74 2,5-Dimethyldecane AST 201 1 170 Cl?H?6 0.72 2,5-Dimethyldecane 170 API 1944 C12H26 170 0.64 2,2,4,6,6-PentamethyI- ClZH26 API 405 heptane Table VII. Summary of Results for Identification of Hydrocarbons Mol wt range 56-150 151-200 201-250 251-506 No. of compounds tested 53 41 27 53 Average no. of library spectra matched for each unknown 270 175 85 40 Method 2 4 2 4 2 4 2 4 Percentage with correct compound in 1st position 67 59 87 81 78 70 98 98 Percentage with correct compound or isomer in 1st position 98 98 100 96 100 100 100 100 Table VIII. Summary of Results for Identification of Nonhydrocarbons Mol wt range 78-130 131-190 191-350 No. of compounds tested 38 18 19 Average no. of library spectra matched for each unknown 250 280 120 Method 2 4 2 4 2 4 Percentage with correct compound in 1st position 84 70 84 84 58 80 Percentage with correct compound or isomer in 1st position 95 84 100 100 68 84 example where duplicates were retrieved successfully followed by branch chain isomers. I n many other cases were spectra sufficiently similar to rank such isomers above the duplicate. Table VI1 summarizes these results for both methods. There is very little to choose between the two methods, both of which show a high degree of success. Several variations were tried, e.g., method 2, n = 10; method 4, n = 2, rn = 14 and n = 6, rn = 40 with no significant change in the results. 1520
Nonhydrocarbons. A similar set of results was obtained for as many other compound types as possible for which a duplicate spectrum was available. Forty-two compounds contained oxygen, 6 compounds contained nitrogen, 17 compounds contained halogen, and 10 compounds contained sulfur. The overall results are summarized in Table VIII. The results are very similar to the hydrocarbons, but somewhat less successful. Some of the poorer results were accounted for either by the presence of a considerable number of impurity peaks in one of the spectra, or because the two spectra being compared were presented at widely different “discriminator” levels, so that the number of peaks present was very different. Again some of the possible variations were examined and found to give similar results. CONCLUSIONS These studies have shown that comparatively simple mathematical procedures can produce a n excellent computer identification of low resolution mass spectra. No allocation of significant peaks is necessary with these methods. Over a wide range of compound types and molecular weight, and using a library of over 8000 spectra, the retrieval of duplicate compounds was 80 %. Counting similar isomers as successes, the retrieval was 97%:. It can be expected that where the correct compound is not contained o n the computer file, but closely similar compounds are available, these will be retrieved. Molecular weight restrictions have been used extensively in the above testing. Experience has shown that even where wide molecular weight ranges are used, successful retrieval can be achieved. However when large files of data are being handled, considerable economies in computer time can result by restricting the number of compounds being matched as much as possible. If one is certain of the molecular weight of the unknown, then one need only try to match it against library spectra of the same molecular weight. There is the risk, however, that if the library file does not contain the desired compound, then indications as to the structure of the unknown which would be obtained by retrieval of compounds of similar structure, but different molecular weights, would not be obtained. The economics of any particular computer system will usually decide which are the overriding factors. Variations on the basic methods used d o not show very significant changes in the results, but it is still likely that in practical applications of the method, refinements can be added to improve the performance in particular areas of interest. It is possible that the weighting of the larger peaks or the peaks of highest mass to charge ratio may be advantageous. I n many cases, it will be possible to utilize other chemical or spectroscopic information. ACKNOWLEDGMENT We are grateful to A. Brickstock and D. C. Maxwell for useful discussions.
RECEIVED for review January 26, 1970. Accepted July 13, 1970. The work was briefly presented at the 15th Annual Conference on Mass Spectrometry and Allied Topics, Denver, Colo., May 1967, and the 17th Conference, Dallas, Texas, May 1969. This study and the Mass Spectrometry Data Centre a t Aldermaston are supported jointly by the Office for Scientific and Technical Information, Department of Education and Science and the Ministry of Technology.
ANALYTICAL CHEMISTRY, VOL. 42, NO. 13, NOVEMBER 1970