Computer techniques for identifying low resolution mass spectra

In the past few years, a number of computer techniques for the identification of low resolution ... Since computers are already serving data handling ...
0 downloads 15 Views 1MB Size
Computer Techniques for identifying Low Resolution Mass Spectra Stanley L. Grotch Jet Propulsion Laboratory, California Institute of Technology, Pasadena, Calif. 91103 A number of computer programs have been developed for identifying low resolution mass spectra through search of an extensive library file. One set of programs used as a disagreement criterion the sum of the absolute values of the differences in peak height levels when eak height was encoded to 2, 8, or lo4 levels at eac! nominal mass. Another program emloyed the maximum coincidence of the top N peaks. he programs were tested using 125 unknowns and the recognition performances were compared. The maximum coincidence criterion was significantly poorer in reco nition performance than the other techniqwes whicf incressed In religbility as the number of levels increased. However, even the two-level system attained ver high reliability. Since computer requirements an!( economic costs are likely to be minimal for this case, it might suffice far many applications.

!

INTHE PAST FEW years, a number of computer techniques for the identification of low resolutinrl mass spectra have been described (I -7). This xntinued interest is spurred by instruments such as the gas chruinatograph/rnass spectrometer (GCIMS) which Droduce large numbers Gf spectra for interpretation. Since computers r,re already serving data handling (8) and control functions (9), spectral data are already in computer-readable format, greatly facilitating further detailed computer interpretation. Generally, the two most desirable features o f any computer identificatioh scheme are high reliability of identification and low economc cost. In most situations, reliability of identification can be improved by increasing the complexity of the criterion used which generally increases computer costs. For example, Knock et al. ( 7 ) found improved identification when considering not only agreements in the N most intense peaks irrespective of order, hut also considering the ranking order in the agreement criterion. Since it has been found in many studies that even very simple criteria can produce excellent results, it is important for any potential user of these techniques to be aware of the simplest of these, as (1) S.Abrahamsson,G.Haggstrom, and E. Stenhagen, 14th Annual Conference on Mass Spectiometry and Allied Topics, Dallas, Texas, May 1966 [see also S. Abrahamsson, Sci. Tools, 14, 29 (1967)l. (2) C. Cone, P. Fennessey, R. Hites, N. Mancuso, and K. Biemann, 15th Annual Conference an Mass Spectrometry and Allied Topics, Denver, Colo., May 1967. (?) R. A. Hites Hnd K. Biemann, ‘‘Advances in Mass Spectrometry,’’ Vol. 4, E. Kendrich, Ed., Institute of Petroleum, London, 1968, p 37. (4) L. R. Crawford and J. D. Morrison, ANAL.CHEM.,40, 1464 (1968). (5) B. Pettersson and R. Ryhage, Ark. Kerni, 26, 293 (1967). (6) S. L. Grotch, 18th Annual Conference on Mass Spectrometry and Allied Topics, §an Frencisco, Calif., June 1970. (7) B. A. Knock, I. C. Smith, D.E. Wright, and R. G. Ridley, ANAL. CHEM. 42, 1516 (1970). (8) C. C. Sweeley, B. D.Ray, W. I. Wood: J. F. Holland, and M. I. Krichevsky, ibid., p 1505. (9) W. E. Reynolds, V. A. Bacon, J. C. Bridges, T. C. Coburn, B. Halpern, J. Lederberg, E. C. Levinthal, E. Steed, and R . B. Tucker, ibid., p 1122. 1362

they may very well satisfy his reliability needs at minimum cost. Workers in this field have found that very high reliability of identification (>90-95%) can be achieved through a file search with low resolution mass spectra. Much still remains to be done in optimizing search procedures for specific applications. For example, computer-related factors such as available memory size, peripherals (disk, magnetic tape), batch processing us. time sharing, will exert a profound influence on the optimal scheme for a particular application. For this reason, it is important to have available alternative techniques which can be tailored to specific applications. It is also important to compare various schemes in detail to delineate more clearly their respective advantages and disadvantages. In the present work several identification algorithms have been implemented into computer programs which have been tested using low resolution mass spectral data obtained from five different sources (10-14). These spectra have been compared against a library of 6880 known spectra. This library is substantially the same as the one used by Knock et a/. (7). Statistics regarding the library are presented in Table I. The results of these tests as well as the assets and liabilities of the matching techniques will be discussed. Principles of Library Search. In the technique of library searching, an unknown spectrum vector X is compared in turn against the jth individual library member Lj,generally over a specified number of channels M. Let x, and l,, represent the individual elements of the vectors X and L. A criterion of agreement or disagreement C,, is usually calculated as the linear function: z r M

The functions F represent the desired criterion for the individual vector elements i . The “best” spectral match is that library number yielding the minimum value of C for a disagreement criterion (or the maximum value of C for agreement). If, for example, a least-square criterion is used with respect to peak height over channels M Ito Mz:

c, =

I

=

M1

‘1

=

MI

(x,

- IJ2

(2)

Here x , and I,, are the peak heights in channel i (Le., nominal mass i). In this case the “best” spectral fit yields the mini(10) R. M. Silverstein and G. C. Bassler, “Spectrometric Identification of Organic Compounds,” 2nd ed., Wiley, New York, N. Y.,1967. (11) A. J. Baker, G. Eglinton, F. J. Preston, and T. Cairns, “More Spectroscopic Problems in Organic Chemistry,” Heyden & Son,

Ltd., London, 1967. (12) K. Biemann, Massachusetts Institute of Technology, Cambridge, Mass., personal communication,June 1970. (13) H. Boettger, Jet Propulsion Laboratory, Pasadena, Calif., personal communication, Oct. 1970. (14) Mass Spectrometry Data Centre, Aldermaston, England.

ANALYTiCAL CHEMISTRY, VOL. 43, NO. 11, SEPTEMBER 1971

mum value of C. It is not necessary that the range MI to Mz be contiguous, although this is generally the case. With Equation 1, it is also possible to consider a criterion such as the maximum number of agreements of the N most intense peaks in the spectrum. This criterion forms the basis of the manual table-lookup methods used with spectral tabulations (IS, 16). Several computer versions of this technique have been reported by Knock et al. (7). This approach was also investigated in the present study. To recast Equation 1 for this criterion, let x, = 1

if channel i is among the N most intense peaks in the unknown

(3)

x, = 0

otherwise

(4)

Similar conditions hold for the channels l,j in each library member L,. The criterion of agreement in this case is simply:

cj

i = M

=

Xili j

i = l

Since an agreement criterion is used here, the “best” spectra are those maximizing Cj. It should be noted that the product criterion in Equation 5 is equivalent to the logical “AND” process. This fact is exploited in the computer program described later. Algorithms Investigated. I n the present study two simple criteria have been investigated. The first of these is that mentioned earlier (Equation 5 ) : given the masses corresponding to the Nl most intense peaks,in an unknown spectrum, and the masses corresponding to the Nz most intense peaks in the library, (Nl not necessarily equal to N2),find those library members which maximize the number of coincidences in mass. By utilizing the techniques described below, extremely rapid search speeds can be achieved with this criterion. The second criterion is the minimization of the sum of the absolute differences in levels between the unknown and the library.

i = l

In this case, both the unknown and library peak heights are assumed known to only K levels; i.e., in each of the M channels considered, x,, 1,) = 0, 1 , 2 , 3,4, . . . , K - 1. Equivalently, the peak height at each mass is represented by log& bits. In the present study, peak height was encoded to either 2, 8, or lo* levels. [The last condition retains peak height to the maximum reported range in the library (0.01-loo), 10’ levels.] The choice of the absolute difference criterion coupled with peak height encoded to a small number of bits results in several beneficial effects: very high search speeds can be achieved by packing many channels into each computer word and utilizing essentially parallel processing ; computer memory requirements are reduced ; the quantity of data to be entered into the computer for both unknown and library is substantially reduced, significantly reducing data input/output time. Computer Program Design. In all of the computer techniques described here, an unknown spectrum is compared (15) A. Cornu and R. Massot, “Compilation of Mass Spectral

Data,” Heyden and Son, Ltd., London, 1966. (16) “Index of Mass Spectral Data, A M D 11,” American Society for Testing and Materials, Philadelphia, Pa., 1969.

Table I. Spectral 1,iBrarj CJhartlcteu.htics (6880 Spectra) Molecular weight Minimuni 2 Maximum 1318 Averagz

169

Averhge nuri.lbei oi peaks reponed 2 0 . 0 1 %base 92.6 2 1 base 40.4 No. of compouiids containing only C,H I795 C, H, 0 246 1 C, H, N 133

against a library of hiown slic.crrs stored an ir:agnetic tape. As the search proceeds, a RcGId is rnai1itimed of th.- ten compounds in thz librriry yicldirly the sw,;rnum vaiiic: af thi: selected criterion. These “tcp IO’, qectra ~2 surnnarize.! at the conclusion of the search. All of the programs c2n accommodate up to 30 unknowns on each p ~ s sthough the library. The search programs have been written almost exclusively in Fortran IV and tested on an PBM 360/44 computer. (Only a few essential subroutines are coded in assembly launguage.) Each complete search package consists of approximately 500 Fortran statements. Since all programs follow the same search strategies, a common subroutine structure has been found to be highly valuable. With this structure it i s a reiatively simple matter to modify programs to iiivestigate various search algorithms or other Parameters. Fach pi’cigain i s uwided i t i t 0 the following five subprograms. 1. A main control prcgr BIII p s f c mirig initk :hation of variables and the cailicg of the other med. sub-

routines. 2. A subroutine for readizik krknown spectic., .r i Q. ~ ki, I either cards or tape) and e x a d i n g these data in a format compatible with thz search alpcvit!!rn and the spectral library. 3. A subroutine in wbict: the aci,uai library search i3 performed. Fcr each mknown, after suitable prefiltering (see below), th.5 criteria1 cf disagreement is calculated by a c.;i~ipIlsonwitn mch 1ibial.Y member. 4. A sabroutine which insiI.lains a dethiled record of relevant library infSrillatioll far the 10 compounds with the minimum value of tht: criterion or disagreement. 5 . A subroutine summarizing at ihe end of the search the spectral characteristics 0:‘ the “top 10” library members for each unknown (see Figure i for E typical example and the Appendix ifir furrhcr details). Although all programs liiiimd the abovc gcoss structure, in an attempt to optimize perfornisnce and to better investigate various parameters, each of the algorithms resulted in rather different spectral packing arrangements. This, in turn, necessitated different calcuiition schemes in the c vmparison subroutine. Some of the ~i,o:e impaitani dzr& of these procedures are discusseS iri r i : ~.&ppendix. Unknown Data. In crder i n t o t and ccrnpare reliability of identification and prog r n i a x x (speed, computer m a s spectra were conrequirements), 125 iow r tra !./ere collected from sidered as unknssns. I five different sources (10-14) u:,d were hewn to be diflerent measurements from those in \ h e library Some characteristics of these unkrivivns are StimiLTarized iri Table 11.

ANALYTICAL CHEMISTRY, VOL. 43, NO. 11, SEPTEMBER 1971

1363

SEQ = 21 U N K N O W ID =630 ISO-AMYLACETATE (8630)C7.Hl4.02. MOLWT= 130 M O L W RANGE SEARCHED-69 TO 1M)o PEAKS = 1 1 0/1 TRANSITION = 1.250 PCT TOTAL I O N CURRENT HI M A S S PK = 87 FIVE HIGHEST INTENSITY PKS = 43 55 41 70 39 RESTRICTIONS O N SEARCH BASE PEAK OF UNKNOWN IN TOP FIVE OF LIBRARY BASE PEAK OF LIBRARY IN TOP FIVE PEAKS OF U N K N O W N MIN MOLWT LIB.GE. 0.8 * HIGHEST MASS PEAK UNKNOWN INPUT MASSES.GT. TRANSITION =

39 41 42 43 55 57 61 69 70 73 87

I

NDlS 7

S EQ 5%

2 3

9 9

4

IO

5 6 7 8 9 10

10 12 15 15 15 16

3034 2049 2088 3700 551 5726 5769 5790 5499

NO.

MOLWT 130

CMPD NAME ISOAMY 1-ACETATE N-AMYL ACETATE 4CE N AMYL ACETATE N HEPTYL ACETATE N-HEPTYL ACETATE N-AMY L-AC ETAT E NOR-TETRADECANE NOR-PENTADECANE NOR-HEXA DECA NE NOR-DODECANE

130 130 158 158

130 198 212 226 170

MOLION

BASE

SECD

THRD

I:OURTH

FIFTH

SIXTH

130. 131. 131. 117. 117. 117. 205. 215. 229. 172.

43. 43. 43. 43. 43. 43. 43. 43. 43. 43.

70. 70. 70. 56. 56. 70. 57. 57. 57. 57.

55; 42. 42. 70. 70. 61. 41. 41. 41. 41.

42. 55. 55. 41. 41. 42. 71. 71. 71.

61. 27. 27. 55. 55. 55. 65. 85. 85. 85.

41. 15. 15. 61.

71

I

61.

73. 55. 55. 55. 55.

SPECIFIC CMPDS DESIRED = 3 3035 ISOAMYL ACETATE 16 2046 ISOAMYLACETATE 16 550 ISOAMYL-ACETATE 7 HISTOGRAM OF DISAGREEMENTS=

0 0

1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0 0 0 0 0 0 1 0 2 2 0 I O 0 311 191419303845302736

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 37 36 32 34 47 27 35 23 33 28 14 18 24 15 8 9 7 7 19 10 5 7 3 3 0 ~~

MU USED I N SEARCH = 3

Figure 1. Typical printout from search program

Table 11. Characteristics of the 125 Compounds Used as Unknowns Source reference (10)

Total spectra used Minimum molecular weight Average molecular weight Maximum molecular weight

28

(11) 25

(12) 20

(13 34

(14) 18

73

58

112

57

68

117

125

158

104

111

176

210

256

197

234

Two preconditions were imposed on these data before selection. Only spectra of compounds (or close isomers) known to be in the library were included; the library spectrum corresponding to a given unknown had to have a significant peak (>20% base) at the mass of the unknown’s base peak. This condition provides some assurance that obviously incorrect spectra are not considered. Matching Results-General Procedure. For each type of encoding, each of the 125 unknowns was compared against the library of 6880 spectra. For each criterion of disagreement, two modes of searching were employed: Unrestricted search-all members of the library were compared; and filtered search-only library members satisfying certain spectral preconditions were considered. The filtering preconditions used in this study were:

1 . The mass corresponding to the maximum intensity peak (base peak) of the unknown must be among the 5 highest peaks of the library member. 1364

2 . The mass of the base peak of the library member must be among the 5 highest peaks of the unknown. 3. The molecular weight of the library member must be greater than an arbitrary specified fraction (generally, 0.7) of the mass of the highest unknown peak with an intensity greater than 0.5 % of the total ion current. Conditions 1 and 2 require that both the library and unknown spectra have significant peaks at the mass of the base peak. Knock et al. (7) used similar prefiltering conditions with the top 6 peaks. Hites and Biemann (3) required that the base peak in the unknown correspond to a peak of at least 25% intensity in the library, and vice versa. For the library used here, the distribution of the average % base peak corresponding to the Nth most intense peak is presented in Figure 2. The average height of the 5th highest peak is 26.6% base peak and the 6th is 22.1 % base. On the average, for the library used here a threshold of 20% of the base peak would require about 6.3 masses to be stored for each library member. Condition 3 removes lower molecular weight compounds from consideration. One potential danger of imposing this condition is that if significant higher molecular weight impurities are present, the prefiltering might restrict the search to too high a minimum molecular weight. However, it is not as stringent a requirement as that imposed by Knock et ai. (7) who generally assumed that the molecular weight was known and searched in a narrow range (