Computer Methods in Analytical Mass Spectrometry Development of Programs for Analysis of Low Resolution Mass Spectra L. R. Crawford Dicision of Chemical Physics C.S.I.R.O., Melbourne, Victoria 3000, Australia
J . D . Morrison' Dicision of Physical Chemistry, La Trobe University, Bundoora, Victoria 3083, Australia
A computer program which analyzes low resolution mass spectra of organic compounds of molecular weight less than 200 i s described. The program operates ab initio, that is, it does not require a reference library of mass spectra, except in its original compilation. The unknown mass spectrum is interrogated to yield successively the molecular mass if possible, the presence of functional groups, groups adjacent to the functional group, and finally the molecular skeleton. The conclusions are repeatedly checked for consistency. As details of the structure emerge, they are stored in a structure matrix. As output the program produces a conventional diagram of the molecular structure. Identification takes less than ten seconds. Obviously the program is not always successful but it does demonstrate the feasibility of carrying out in a general way the entire process of structure determination. THECOMBINATION of mass spectrometry and gas chromatography can produce hundreds of mass spectra a day, and a trained chemist may take two or three weeks o n the results of a single run to separate the common or simpler compounds from those he wants to investigate more fully. Computer search of a library can be used and has been shown to be a very valuable technique ( I ) , but the compounds with which the user is concerned may not be in the library; also such a library needs constant attention t o keep it up to date and down to a reasonable size. A very desirable addition is a computer program which embodies all the rules for determining a structure from a mass spectrum ab initio, yet is capable of being used in a computer of modest size. How far one can go in programming a computer to duplicate the performance of a n organic mass spectrometrist is of course a very interesting question in itself, and some aspects of this problem have already been discussed. Computer routines have been developed by several authors (2-5) for interpreting the mass spectra of specific classes of compounds. These routines are capable in some degree of dePresent address, Department of Chemistry, University of Utah, Salt Lake City, Utah 84112 1
(1) I,. R. Crawford and J. D. Morrison, ANAL.C H ~ M40, . , 1464 (1968). (2) A. M. Duffield, A. V. Robertson, C. Djerassi, B. G. Buchanan, G. L. Sutherland, E. A. Feigenbaum, and J. Lederberg, J. Amer. Cheni. SOC.,91, 2977 (1969). (3) G. Schroll, A. M. Duffield, C. Djerassi, B. G. Buchanan, G. L. Sutherland, E. A. Feigenbaum, and J. Lederberg, ibid., p 7440. (4) M. Barber, P. Powers, P. Wallington, M. J. Wolstenholme, Ncr/rrrr, 212, 784 (1966). (5) K. Biemann, C. Cone, B. R. Webster, and G. P. Arsenault, J . .?mor. Chem. SOC.,88, 5598 (1966). 1790
termining the chemical class of a n unknown (6, 7). A detailed account of some of the problems of this approach has been given by McLafferty and his colleagues (8). I n attempting t o write a general purpose program, it is essential to have a scratch pad o n which structural information can be stored and manipulated as required. A structural code was devised which gave a framework within which this could be done (9). This structure code, consisting of a nearest neighbor table, had the advantage that it at all times defined certain limitations in building up a structure. After each addition to the table a check could be made as to whether the structure was fully, although not necessarily correctly, specified. This paper reports o n a n attempt to combine the various separate routines described above, so that the computer itself, o n the basis of the mass spectrum alone, decides on the molecular mass, the molecular class, calls in the sub-routine available for that particular class, gathers the information together, checks it and any other information it can acquire for consistency, then draws out the structure. Failing this, it gives informative printout. I n its present form the program can deal with molecules containing only the elements C, H, 0, and N.
DESCRIPTION OF THE METHOD The program which has been developed is a n attempt to systematize the approach of a n organic mass spectrometrist. As such, it differs in some respects from the more systematic approach of Lederberg and his colleagues (IO, 11). I n the latter, at a relatively early stage in the analysis, the computer generates all possible structures, then proceeds to eliminate these systematically o n the basis of the mass spectral information. This approach could be expected t o be very SUCcessful with small structures, but the computing time required should increase rapidly with the molecular weight. [This has been modified by these authors in later work (12, 13) (6) B. Pettersson and R. Ryhage, ANAL.CHEM.,39, 790 (1967). (7) L. R. Crawford and J. D. Morrison, ibid., 40, 1469 (1968).
(8) . , R. Venkataraghavan, F. W. McLafferty, and G. E. Van Leu, Org. Mass Spec%om., 2, 1 (1969). (~, 9 ) L. R. Crawford and J. D. Morrison, ANAL.CHEM.,41, 994 (1969).
(IO) J. Lederberg, G. L. Sutherland, B. G. Buchanan, E. A. Feigenbaum, A. V. Robertson, A. M. Duffield, and C. Djerassi, J. Amer. Chem. SOC.,91, 2973 (1969). (11) J. Lederberg and M. Wightman, ANAL.CHEM.,36, 2365 (1964). (12) A. Buchs, A. M. Duffield, G. Schroll, C. Djerassi, A. B. Delfine, B. G. Buchanan, G. L. Sutherland, E. A. Feigenbaum, and J. Lederberg, J. Amer. Chem. SOC.,92, 6831 (1970). (13) A. Buchs, A. B. Delfine, A. M. Duffield, C. Djerassi, B. G Buchanan, E. A. Feigenbaum, and J. Lederberg, Helv. Chim. Acta, 53, 1394 (1970).
ANALYTICAL CHEMISTRY, VOL. 43, NO. 13, NOVEMBER 1971
but still appears to play a very significant part in their method of structure determination.] The approach of the present method is very close t o that of the human chemist, in that it is much less systematic, but may be better suited to larger structures. I t is desirable that any such program should be capable of continual modification and improvement. Whenever a n incorrect identification takes place, it should be relatively simple to modify the weighting factors given to various logical decisions, in order to rectify this error, and hopefully avoid it in future runs. To obtain maximum flexibility in this way, the program was written in F O R T R A N IV in the form of a relatively short main control program, which is able t o call o n a series of minor programs. The whole program is of approximately 26,000 words, small enough to allow it to be used on a small computer by chaining methods. The initial development of the program was carried out o n a Control Data type CDC3200 computer. Later developments have run in a segmented form in a Digital Equipment type PDP9. The main program reads in the mass spectrum, and preprocesses it, then calls into core the first sub-program. This returns to the main program information such as the molecular weight and main chemical class of the compound. O n the basis of this information, the main program may call into core a second overlaying sub-program to analyze mass spectra belonging t o a specific class of molecule. If this sub-program fails to identify the compound satisfactorily, or the main program decides otherwise, a third overlaying sub-program which interrogates other aspects of the mass spectrum in a n attempt t o deduce the structure may be called. The second and third subprograms leave the results of this interrogation in the form of a nearest neighbor table accessible by the main program, which may then call into core the final sub-program t o draw out the chemical structure. Main Program. The maqs spectrum, recorded as integral mass numbers and peak intensities o n cards or magnetic tape, is read and the intensities are normalized. A low resolution largest peak plot is produced and the group analysis routines are called iato core. A table of probabilities of the compound belonging to any one chemical group and the molecular weight is returned. If the second most probable group has probability less than l o % , then it is decided that the compound has only one functional group, and one of the sub-programs giving a deductive routine for that specific group is called into core. If the routine fails to analyze the compound, or the compound appears to contain more than one functional group, the general purpose subprogram is called. If any one of these succeeds, the main program calls the structure drawing sub-program, and on return from this is ready t o process the next mass spectrum. The generalized flow chart for the whole program is given in Figure 1 . Group Analysis Routines. The control routine calls subroutines which form tatles of probability of the compound belonging to the twelve chosen groups: Aromatic, Ester, Ether, Acid, Ketone, Aldehyde, Alkene, Alkane, Alcohol, Cycloalkane, Diene, Amine. These probability tables are based on the four largest peaks in the spectrum, the four most significant peaks, fourteen peak condensation average spectra, and group mass spectral hypersphere coordinates (7). The control routine weights the probability values and sums them. A somewhat similar approach to this has been employed by Pettersson and Ryhage (6). Molecular Weight Routine. The highest value of ,'e at
READ I N UASS SPECTRUM O F UWKNDYW
READ
MASS
in
S?ECTRA
LIST SIWILARITIES
I
UOL.
PORI.
A
GENERAL ROUTINE
NO
ENTER
TEST
f COMMENTS, STRUCTURE
Figure 1. Generalized flowchart of program, showing relation of main program and routines which a significant peak (significant = background +20 %) is recorded, H, is taken initially to be an upper limit to the molecular weight. It is most often an isotope peak. The computer compares the peak height at this highest mass with the peak at one mass unit lower, and checks whether P(H) could be a n isotope peak of P(H - 1). If the answer is positive, P(H - 1) is compared with P(H - 2), and so on until a negative answer is returned. This process allows the mass M of the molecular ion to be determined in many cases with fair certainty. A number of consistency checks are then applied. First, the characteristics of the peak believed t o be the parent ion are checked for consistency with the molecular class. A second test is for the presence of significant peaks a t masses M-3 and M-13, where a significant peak is considered to be one greater than 15 % of the assumed parent peak, or greater than the surn of the peaks at + 1 mass unit from it. Another test, based o n the greater stability of the even-electron ions, is that if the molecular mass is even, the surn of the odd mass fragment intensities is greater than that of the even and vice versa. If evidence suggests that the molecular peak is not observed, printout occurs. Empirical Formula Routine. If high resolution mass spectra are available, the previous routine is much less necessary (14). However, it is possible t o write a series of restrictive relations between the integral molecular mass and the possible atomic formulas, and when these are combined with other internal evidence in the mass spectrum, the possible atomic formulas are limited t o a surprising degree. The value of M sets a n absolute upper limit to the number M of C atoms, q c > - (integer division) in the molecule. It 12 (14) K. Bieman and P. V. Fennessey, Clzirnicrr, 21 ( 6 ) , 226 (1967)
ANALYTICAL CHEMISTRY, VOL. 43, NO. 13, NOVEMBER 1971
1791
also of course sets upper limits for every other kind of atom, but these limits are less useful. The relative heights of P(M) 1) give a less precise lower limit to qc, and may and P(M indicate the approximate number of oxygen and nitrogen atoms qo and q N .
+
40
+ qh. + M13- -
qc
In the group identification, alcohols and esters, alkanes and ketones, etc., are frequently confused so a list of equivalent groups is referred t o and the redundant classes are removed from the list of the three most probable classes. In this present program, the three most probable class identifications are checked against the molecular weight in the following way. O n dividing the molecular weight by 14, the remainder is:
where R is the number of rings and double bonds present.
A list of q c , 40, and qs and R values (with q c a minimum) for all functional groups is maintained. Depending on what functional groups have been identified, an S, value can be calculated using this formula, e.g. 4c
Alkane Alkene Ketone Aldehyde Cycloalkane Alcohol
1
1 3
40 0 0
1
1 1
3
0
1
1
4-u
R
s,
0 0 0 0 0 0
, o
2 0 2 2 0 4
1
1 1 1 0
If several groups are believed t o be present, the S, value will be the sum of the separate values for each. If the S o b s value obtained from the molecular weight is equal t o this, it implies that either 40, ys, and R values for the molecule are the same as those for the functional group (or combination of groups) or excess qo or q r in the group minimum formula must be balanced by unsaturation. If ( S o b 3 - S,) is zero, the combination is acceptable, and is greater than S, and no hetero atoms it is stored. If Sobs are expected, then the combination is rejected. This information is then used to calculate a list of possible molecular formulas for the compound. The maximum number of hydrogens in the formula is calculated (11) and then qc is decreased from its maximum l)/P(M) in value to its minimum as restricted by P(M steps of one, and yII is decreased in steps of two to zero, being reset to its maximum at each change in qc. qc is increased by one from zero to the difference between the minimum and maximum qc. qN is calculated from:
+
(4) and the unsaturations from:
R
= 4C
+ 1 f (qN
- qd/2
(5)
If qs is neither zero nor integer, or twice (R - 2 ) is greater than qc, o r R is not integer, or any value of qc, qo, or R is less than any corresponding value in the list of possible group combinations, the formula is rejected. 1792
Special Chemical Class Routines. Many classes of molecule have fairly well defined breakup patterns, often involving the production of rearrangement ions. The control program selects the appropriate specific deductive routine for each chemical class if available, and calls it into core. One of these routines for alkyl benzenes, was based with very little modification on that described by S. Meyerson (15). Others for alkanes and for aliphatic esters were based on the descriptions given by Pettersson and Ryhage (6, 16). Others for amines, alcohols, and ethers (17), and aldehydes, ketones, acids, and esters (18), have been especially written. These routines (to the best of their ability) return the structure in the form of a nearest neighbor table, as well as informative printouts, and a true/false value for successful structure determination. General Interrogation Routine. Satisfactory specific deductive routines have not yet been written for some chemical classes. Also, it has not yet proved possible t o organize in such a systematic form much of the empirical information already known about breakup patterns. I n interrogating a mass spectrum, two methods may be employed. I n the first, the serial method used in the specific rearrangement routines, the interrogation consists of a sequence of questions each of which depends o n the answer to the previous question. In the second, the parallel method, a number of quite independent questions are asked, the results are weighted and summed. While apparently less efficient, this second method surprisingly appears to be a t least as successful, and is sometimes more so than the first. When a mass spectrometrist has failed to deduce a structure by all the systematic methods, he is reduced to inspired guesswork. The present routine fulfils this role to a certain extent. I t is based primarily o n the use of diagnostic mass peaks in a way similar to that employed by McLafferty (19). A list of 64 diagnostic masses is held in store. Each diagnostic mass has associated with it a table of peak characteristics, and structure addresses. The peak characteristics are a flag t o indicate if the fragment to be considered is the ion, or the neutral part being formed at the same time, and whether the observed ion has the expected intensity, e.g., base peak, within a typical range, less than 4 z of base peak, etc. The basic structural storage is a code matrix which is decoded by appropriate subroutines into a n atom connection matrix describing the structure. When referring to structures in the interrogation routine, the address of the basic structural code matrix in store is used. The mass spectrum is scanned, and each mass number is checked for diagnostic masses at the mass modulo 14 plus a n integral number of 14 mass units up to the mass number. The molecular mass minus the mass number is similarly checked. The characteristics of each observed peak are compared with those associated with each structure. If they do not agree, or if the formula of the structure is not a sub-set of the parent formula, then it is ignored. A list of identified structures is created and maintained and if a new structure is found it is added to the list. If the structure has (15) S . Meyerson, Appl. Spectrosc., 9, 120 (1955). (16) B. Pettersson and R. Ryhage, Ark. Kemi, 26 (25), 293 (1967). (17) J. F. O’Brien and J. D. Morrison, Org. Muss Spectrom., in press. (18) J. D. Morrison, J. F. Smith, and J. Taranto, ibid., in press. (19) F. W. McLafferty, “Mass Spectral Correlations,” American Chemical Society, Washington, D. C., 1963.
ANALYTICAL CHEMISTRY, VOL. 43, NO. 13, NOVEMBER 1971
COMPARE EXPECTED PROPERTIES W I T H PEAK
t
YES
PROB. C A L L ,
P
t INCREMENT RELATIVE PROBABILITY
Figure 3. Flowchart of diagnostic peak interrogation part 2
SORT L I S T I N O R D E R OF P R O B . AND
Figure 2. Flowchart of diagnostic peak interrogation part 1 been noted previously, then a frequency of occurrence is recorded in descending order of fragment mass, see Figure 2. The resulting list of structures is then used to compile a table of likely fragment combinations. Each molecular formula is considered in turn. The list of structures is searched for one which is a sub-set of the formula, then
searched again for one which is a sub-set of the remaining formula and so o n until no more can be found. A probability value is calculated from the number of atoms remaining in the structure, the probabilities of the contributing structures and the probability of the molecular formula. This is repeated for each molecular formula, using different structures as the starting fragments. The resulting tables of formulas and structural combinations are sorted in order of probability value (Figure 3). This whole process may be summed up briefly by saying
HOUSE GROUP OHBlliAl loti WITCH - 1 F-110. I
I SlRUC.
CODE
-
,NONE THE M A S S
COMPUTE THE MASS. (M)
Figure 4. Flowchart of general interrogative routine part 1
ANALYTICAL CHEMISTRY, VOL. 43, NO. 13, NOVEMBER 1971
1793
YES LDIJ DIJ 1 LLPRBfJI PRBfJIi TO S T P I J C T U P E (,O EACY 1 PEkY
t
I
Figure 5. Flowchart of general interrogative routine part 2