Probability based matching system using a large collection of

large sets of data presents some problems and tends to dilute the attention of the mass ... The Probability Based Matching (PBM) system, which utilize...
0 downloads 0 Views 972KB Size
tinely we use 10 to 100 times as much sample as for a normal low resolution scan. Two major advantages of the computerized over the manual method of acquiring metastable data are evident from our experience. First, an order of magnitude in convenience is gained by using the computer. Second, masses are determined with considerably higher accuracy. In a practical sense, the experiments involving smoothing and averaging cannot be run without a computer. The determination of the masses of weak metastable peaks cannot be performed manually on our mass spectrometer to better than fl amu; whereas, with the computerized system, much better determinations are routinely achieved. A fully automated system for processing metastable ion data produced by the Barber-Elliot-Major defocusing technique has been described in the literature (IO). In that system, all the metastable transitions for all ions in a spectrum can be detected. For our system, it would have been quite simple from the programming point of view to let the computer find the top of each normal ion, from Ma+ downwards, and to take an ESA scan on each of them. However, the management of such large sets of data presents some problems and tends to dilute the attention of the mass spectrometrist. In our system, we have taken the approach that only ions of interest, selected by the operator, will be scanned. Each ion of interest is rapidly brought in focus using the magnetic field and the digital mass marker display. The total time for scanning and processing

each spectrum is usually less than 2 min. It was pointed out to us by Prof. Beynon that, as with low or high resolution mass spectra, the mass spectrometrist is usually selective with the choice of ions that he considers. These are either intense ions or structurally significant ions of lower abundance. The same should be true for metastable peaks if we want to use them on a routine basis.

LITERATURE CITED (1) R. G. Cooks, J. H. Beynon, R. M. Caprioli, and G. R. Lester, “Metastable Ions”, Elsevier Scientific Publishing Co., Amsterdam, 1973. (2) J. H. Beynon,,R. G. Cooks, J. W. Amy, W. E. Baitinger, and T. Y . Ridley, Anal. Chem., 45, 1023A (1973). (3) T. Wachs, P. F. Bente 111, and F. W. McLafferty, lnt. J. Mass Specfrom. /on Phys.,9, 333 (1972). (4) K. H. Maurer, C. Brunnee, G. Kappus, K. Habfast, U. Schroder, and P. Schulze, 19th Annual Conference on Mass Spectrometry, ASTM Committee E-14, Atlanta, Ga., May 1971. (5) M. Barber and R. M. Elliott, 12th Annual Conference on Mass spectrometry, ASTM Committee E-14, Montreal, Canada, June 1964. (6) L. Baczynskyj and R. J. Wnuk, 22d Annual Conference on Mass Spectrometry and Allied Topics, Philadelphia, Pa., May 1974. (7) L. Baczynskyj, J. F. Zieserl, M. D. Kenny, J. B. Aldrich, and D. J. Duchamp, 23d Annual Conference on Mass Spectrometry and Allied Topics, Houston, Texas, May 1975. ( 8 ) See Chapter 2 of Reference 1. (9) F. W. Mclafferty, “interpretation of Mass Spectra”, 2d ed.,W. A. Benjamin, Inc., Reading, Mass., 1973, Chapter 7. (IO) J. E. Coutant and F. W. McLafferty, lnt. J. Mass Spectrom. /on Phys., 8, 323 (1972).

RECEIVEDfor review December 4,1975. Accepted April 26, 1976.

Probability Based Matching System Using a Large Collection of Reference Mass Spectra Gail M. Pesyna,’ Rengachari Venkataraghavan, Henry E. Dayringer, and F. W. McLafferty * Department of Chemistry, Cornel1 University, Ithaca, N.Y. 14853

The Probability Based Matching (PBM) system, which utlllres a reverse searching procedure and weightlng of mass and abundance data, has been modifled to permlt matching of an unknown spectrum against a large data base not restricted to spectra taken under the same experimental condltions. The performance has been evaluated quantitatively using recall/ reliability plots following procedures developed for informatlon retrleval systems. Sensltlvity to lmpuritles and other errors in the reference spectra have been decreased by “flagging” up to four peaks to Ignore them In the calculation. PBM has been tested with four classes of criteria for structural matchlng, showing that the great majority of incorrect matches are of structurally related compounds. The utlllty of PBM Is partlcularly striking for giving useful performance in the identification of components in as low as 10% concentration.

Research in the area of document and information retrieval has firmly established t h a t system efficiency is increased by the proper weighting of the relative importance of items used for identifying each member of a library ( I ) . Similar conclusions have been reached by those studying mass spectral retrieval systems (2-6): for example, Crawford and Morrison ( 3 )showed that using a logarithmic scale of peak abundances gave a substantial improvement in retrieval performance, in Current address, Science and Technology Committee, U.S. House of Representatives, 2321 R a y b u r n Building, Washington, D.C. 20515.

1362

ANALYTICAL CHEMISTRY, VOL. 48, NO. 9, AUGUST 1976

part by reducing the importance of the base peak in the spectrum. A “Probability Based Matching” (PBM) system has recently been proposed ( 5 ) which also weights the mle values of the peaks according to their “uniqueness”, applying this on a small reference library (less than 100 spectra) from one instrument using a reverse search. This paper describes the application of the PBM concept to a large data base (23 879 spectra) (7) and testing of the resulting system using a statistically large sample of unknown spectra. This system was designed in particular to emphasize high reliability in retrieval, as those unknowns for which only low confidence matches can be achieved usually must also be examined by a mass spectrometrist or an interpretive algorithm such as the “Self-Training Interpretive and Retrieval System” (STIRS) (8).The results show that weighting of the mass as well as the abundance values improves retrieval performance, and confirms the value of “reverse searching” for the spectra of mixtures (5, 6). “Probability Based Matching” is based upon the “General Rule of Multiplication” of probability theory (9) which states t h a t if n independent events occur with probabilities p1, p z , . . . ,pn then the probability of all n of these events occurring is given by Equation 1. Thus if peaks with

n pi n

overall probability =

i=l

(1)

masses rnl and r n z having intensities i l and iz occur in mass spectra with probabilities p 1 and pz, the probability that both occur a t random in an unknown spectrum is p 1 X p2. If this

product, is small, it is much more likely that t h e presence of peaks ml and m2 of intensities i l and iz is due to t h e identity of t h e unknown spectrum with that of the reference com-

pound.

A low value of this probability provides a high confidence that an identication is correct, measured by a “confidence value” or “ K value” ( 5 ) .This measure, as well as all the individual probabilities, is expressed as t h e corresponding base two logarithm for convenience of calculation; inverse probabilities are also used t o simplify t h e calculations and t o produce a final result which is a direct measure of “confidence.” In this reverse search, there is computed for each reference spectrum matched against t h e unknown a confidence value, K , equal to t h e sum of t h e individual KJ values. KJ is calculated for each peak in the unknown whose intensity agrees within a predetermined range t o that of the corresponding peak in the reference spectrum. KJ combines four terms (Equation 2 ) ,

KJ = UJ 4- A, i- WJ - D

(2)

where fJJ is the contribution t o the probability of the “uniqueness” of the m/e value of peak j ; A, is the contribution t o t h e probability of the abundance value of t h e peak as i t appears in the reference spectrum; W,, t h e “window factor,” is a measure of the agreement required between the abundance of the peak in t h e reference and in the unknown; and D , the “dilution factor” for mixture spectra, is a measure of t h e overall reduction of peak intensities in the unknown due to the presence of other components (if the unknown spectrum is of a pure compound, D = 0). A peak in t h e unknown which does not agree within t h e window tolerance is ignored in t h e cumulative calculation of confidence, If it is more intense than would be expected, it is termed “contaminated”. However, peaks of intensities less than the minimum allowed are treated differently than in the earlier PBM system (5).In that system, all the reference and the unknown spectra were recorded on the same instrument, and t h e background level was known for each unknown sample; therefore it was assumed that a peak in the unknown could not be of lower relative intensity if that reference compound was present in the unknown. In the present system, on the other hand, which uses a large data base of spectra from diverse sources, this could be true because of experimental variation or even impurities in the reference compound; thus a limited number of less intense peaks are “flagged” in the match t o ignore this discrepancy. T h e assumption that mass spectral peaks are independent events, which is essential t o t h e rigorous application of t h e General Rule of Multiplication, is, of course, far from exact for many mass spectral peaks; for example, it is much more common t o find m/e 41 in a spectrum containing an abundant m/e 57 peak. It would also be expected that the molecular ion and other high mass peaks would show less cross-correlation, and so these are given extra preference.

EXPERIMENTAL A more complete rationale for the experimental procedure and details on program optimization are reported elsewhere (10). The Data Rase. The “Registry of Mass Spectral Data” ( 7 ) ,consisting of 18 806 spectra of different compounds and 5073 lesserquality spectra of some of those compounds, was used in the creation of the PBM library. Although a large number of errors had been eliminated during the original preparation of the data base, checking of individual cases of poor PBM results showed that a significant number remained. The following metastable, multiply-charged, and impurity peaks are eliminated from each reference spectrum before it is condensed; all peaks having non-integral masses, peaks at mle 18, 28, and 32 which may be due to water and air, and peaks found at masses higher than those in the molecular ion cluster, with the limit defined as the

+

molecular weight + 3 2(No. of C1atoms + No. of Br atoms) + %(No. of S atoms + No. of Si atoms). If the compound does not contain elements other than H, C, N, 0,F, Si, P, S, C1, Br, and I, and if its molecular weight is greater than 50 amu, peaks due to the illogical neutral losses of 4-12 amu and 21-23 amu are also excluded, plus the loss of 18 amu if the compound does not contain oxygen, the loss of 19 amu if no oxygen or fluorine, the loss of 20 if no fluorine, and the losses of 13,14,24,and 25 amu if no chlorine or bromine. The U value of each peak is obtained from the reference table (11).All peaks below mass 29 are arbitrarily assigned U values of 1, a value which is low enough to actively discriminate against the selection of these peaks but still permits 1;hemto be used if the spectrum contains very few peaks at higher mle values. The peak abundance percentages have been divided into standard ranges assigned to specific A values (11). For the reference spectrum, the A value of each peak is determined by the range to which its abundance falls. Thirty-two of the amu values have abundance probability distributions significantly different from the standard distribution, so that special abundance ranges must be used for these A value determinations. All peaks in the spectrum are ordered by decreasing U A values; within each set of peaks having the same value of U + A, the peaks are ordered on the basis of decreasing mle values. The 15peaks at the top of this ordered list are checked for the presence of the base peak, for the most abundant isotopic peak in the molecular ion (M.+) cluster, and for the peak (or two peaks if M.+ is not present) corresponding to the neutral loss(es) of 18,20,27,28,30,32,34,36,42,44, 46,48, 56,60, or 64 amu having the largest U + A value(s).If any of these three are not already included in the list, they are substituted for the peaks of lowest ( U + A ) value. For each reference spectrum the serial number, lowest recorded mass in the spectrum, Me+, and the values of mle, abundance, and U A for the 15 selected peaks are packed into 32 computer words and stored in a file which occupies 1494 blocks of 512 16-bit words each of disk storage. The disk file structure is optimized to reduce access time. The Search Algorithm. For each reference spectrum, the search algorithm begins by examining the unknown for the presence of the reference peaks from highest to lowest m/e values. If a peak is not present, it is flagged; if the number of missing peaks exceeds the number of allowed flagged peaks, the program proceeds to the next reference spectrum. If reference peakj is found in the unknown 0’ = I, 2, . . . , 15), the ratio, p i , of its abundance in the unknown to its abundance in the reference is calculated, and if pj is less than the specified “minimum percent component” (which for these studies was 10%for pure compounds and 1%for mixtures unless otherwise specified), peak j is flagged; p j values are calculated for all such reference peaks unless the maximum number of flagged peaks is exceeded. The smallest p j not associated with a flagged peak (pmin) is determined, and the confidence value Kj is calculated for this peak. pmin specifies the smallest percentage of this reference compound which could be present in the unknown sample and thus directly determines the dilution factor, D . The product of the abundance o f each reference peak and pmin is the abundance expected for that peak in the unknown spectrum. The reference abundance also determines the window tolerance that is demanded of the match. For these studies a f30% tolerance was permitted for peaks of abundances ?9%, &39% for 3.4-9% peaks, f46% for 1-3.4% peaks, f51% for 0.24-1% peaks and 171% for peaks less than 0.24%;this gives eight abundance ranges, so W = 3 ( 5 ) .The expected abundance of the unknown peak is set at the bottom of the fx% window and the top of the window is determined. If the actual abundance of the unknown peak falls within these limits, Kj is calculated for it from Uj + Aj (determined by the peak in the reference spectrum) - D W, and added to the accumulated K value. If the abundance of the peak in the unknown spectrum is higher than the top of this window, it is termed “contaminated,” and Kj = 0. (From the definition of pmin, the abundance of the peak will never fall below the window.) After the entire set of reference peaks has been examined, one factor of W is subtracted from the K value because the peak which gave rise to pmin is guaranteed to fall within the window. The K value resulting from the match is compared with the threshold K value (an optional threshold, 25 being used in this study); if K is smaller than the threshold, it is not stored as one of the results. Otherwise the “percent contamination” is calculated, and if it does not exceed a specified maximum, the K value is stored. The “percent contamination”, which is only an estimate of the amount of sample components other than the reference compound in the unknown, is calculated using the ten peaks of highest U A in t h e u n k n o w n with Equation 3,

+

+

+

+

ANALYTICAL CHEMISTRY, VOL. 48, NO. 9, AUGUST 1976

1363

I .o

I

I

I

I

I

I

I

I

Table I. Arbitrary Categories f o r Degree of Structural Match

I

A K = O O\

I-

110

\

Class of match

I

I1 I11 IV

Example of ref. compounds matching Relationship of reference cis- 1,4-dimethylcyclohexane and “unknown” structures as the unknown Identical compound or stereoisomer Class I or ring positional isomer Class I1 or a homologue Class I11 or an isomer of a class 111 compound formed by moving only one carbon atom

trans- 1,4-Dimethylcyclohexane 1,3-Dimethylcyclohexane Diethylcyclohexane Trimethylcyclopentane

Table 11. Compounds Retrieved by PBM f o r a Mixture of 60% +Methoxyindazole, 30% Carbon Tetrachloride, and 10% tert-Butyl-3-ketobutyrate 0.5 RecqlI Figure 1. PBM performance for unknown mass spectra of pure

compounds LMWS, AK values: 0, maximum percent contamination (MPC) = 20; 0 , MPC = 70. HMWS, AKvalues: 0, MPC = 20, and 0 MPC = 70. HMWS, K values: 0 , MPC = 20

Compound

K value

3-Methoxyindazole” (33%, 33%) Carbon tetrachloride” (66%, 66%) tert-Butyl-3-ketobutyrate= (96%) 1-Methyl-3-indazolone (48%) Chloropicrin (66%) 4-Amino-l-methyl-1,2,3-benzotriazole (71%) 3-Phenyleicosane (83%)

92+, 83+ 55,41 42+ 34*+, 25*+ 29** 26*

+

26**

(3)

a Correct answer. b Value of “percent contamination” (%C) found by PBM; note that (1- %C) is only an approximation of the actual concentration of the component present.

where h equals the number of the 10 unknown peaks which are absent in the reference spectrum, k equals the number of peaks present in both spectra but not falling within the window tolerance, and A, and Bi equal the A values of peak i in the unknown and reference spectrum, respectively. Note that the calculated contamination will be 100%if none of the 10 unknown peaks are contained in the reference spectrum. If a maximum percent contamination less than 100%is specified, reference compounds of molecular weights less than the masses of any of the ten peaks are not examined. Unless otherwise specified,for this study the maximum percent contamination was set at 2Ph for pure compounds and 100%for mixtures. If the number of flagged peaks allowed has not been reached, the peak which gave rise to pmin is flagged and dropped from consideration (in effect, removed from the reference spectrum), the next lowest value of pmin substituted, and the matching algorithm is executed again. This determines a new K value for this reference spectrum which is stored in place of the previous value if it is higher. When no more flagged peaks are allowed, the next reference spectrum is examined. In the studies reported here, a maximum number of three flagged peaks was allowed for pure compounds and two for mixtures unless stated otherwise. With each K value reported, a AK value is also calculated and displayed. The AK value is the difference between the K value found and the maximum value that could have been achieved by a perfect match (a peak in the unknown whose abundance falls within the required window for each peak in the reference spectrum). Methods for Evaluating PBM Performance. A “low molecular weight set” (LMWS, mol wt 144-160 amu) and a “high molecular weight set” (HMWS, mol w t 232-312 amu) of unknown spectra were created for testing PBM’s performance on the spectra of both pure compounds and mixtures. The sets for pure compounds were composed of 433 and 415 spectra (LMWS and HMWS respectively)which are other spectra of compounds represented in the 18 806 spectra of the data base. These test spectra represent all those available in the “duplicate” portion (serial numbers 188807-23 879) in the Wiley magnetic tape (7) except that a small number of spectra of impure and isotopicallylabeled compounds were excluded. A spectrum in the “Registry”file which is of one of these compounds was combined with two other such spectra in the ratio of 60:30:10 to create LMWS and HMWS sets synthetic mixture spectra containing 102 and 80 mem-

bers, respectively. Results from the examination of these unknowns with the best other retrieval systems are reported separately (IO). To analyze the results obtained by the retrieval system, two parameters taken from the field of document retrieval (1) are defined: the recall, or the proportion of all possible matches which are actually retrieved, and the reliability, or the proportion of the retrieved spectra for which the compounds actually match. [In the document retrieval field (I), the term “precision” is used instead of “reliability”, but “precision”has a different implication in analytical chemistry.]These recall-reliability pairs are computed at various retrieval levels: e.g., at particular K value levels. The trade-off between recall and reliability is evident when recall is plotted on the x-axis, reliability on the y-axis, of a plot such as that in Figure 1;a change to the system which raises the plotted line (increased reliability) in at least some areas of recall is thus an improvement to the system. Such a twodimensional evaluation of system performance has been found ( I ) to be much more valuable than a single “accuracy” value, as different system applications can require different trade-offs in recall and reliability. For example, for PBM we have attempted to maximize reliability (especially for recall values 60% by the use of flagged peaks, is >90% with the more lenient restriction of 70% maximum percent contamination. Thus, use of such a higher value is recommended even for unknown spectra of pure compounds; this helps for “incompatible data” (vide supra) of unknown peaks whose abundance is greater than predicted in the same way that flagged peaks help for those whose abundance is too low. Class of Matches. Adjusting the criteria for a structural AIVALYTICAL CHEMISTRY, VOL. 48, NO. 9, AUGUST 1976

1365

I

I

I

I

I

0.5

I

I

I

I

I

1.o

Reca I I

Reca I I

Flgure 3. Effect of structural matching criteria on AK values for unknown LMWS spectra of pure compounds

Flgure 5. PBM performance for unknown LMWS spectra of mixtures

Class of match: 0, I: 0,11; 0 , Ill; A , IV

10%

AK values and proportlon of component present: 0, 60%: 0 , 30%; 0 ,

Table 111. Compounds Retrieved by PBM for a Mixture of 60% l-(2-Methylcyclohexyl)-3-phenylurea,30% 1,2’Binaphthyl, 10% 0,O-Dimethyl- 0-( 4-nitro-m-tolyl) phosphorothioate Compound

K value

1,2’-Binaphthyla (51%,55%)b 89+, 59+ 83*+, 39*+ 1,l-Binaphthyl (51%,69%) O,O-Dimethyl-O-(4-nitro-m-tolyl)-phospho-83+, 38+, rothioatea (84%,88%, 93%) 35* l-(2-Methylcyclohexy1) -3-phenylureaa (61%, 73+, 57** 61%) 2,2/-Binaphthyl(70%,69%)b 44*+, 43+ a-Phenyldibenzofulvene(53%)b 41+ 37** 37** 3,4-Benzpyrene(89%,89%) 39**; y-terpinene (65%)

+

11 Flgure 4. Effect of structural matching criteria and molecular ion information (K+) on AK values for unknown HMWS spectra of pure compounds Class of match: 0, I; e, I+; 0 , II: 4 , Ill; A,IV; A,IV+

match (Table I) is equivalent to changing the relevancy decisions in a document retrieval system ( I ) . Figures 3 and 4 clearly show that there is a very high probability that if a spectrum retrieved with a low aK value is not that of the compound identical to the unknown, it is actually that of a ring positional isomer, a homologue, or a compound whose structure differs only by the position of one carbon atom. In fact, for the higher K (or lower AK)values, most of the small proportion of remaining retrieved compounds even not ih class IV are of related structure, such as dimethylhexadecanoic acid matched with octadecanoic acid. This behavior, which is also found for other retrieval systems (I, 3 , 4 , 6 ) ,shows that there are substantial cross-correlations of peak uniqueness values, as postulated in the explanation above of the reliability achieved vs. that predicted for a particular K value. 1366

ANALYTICAL CHEMISTRY, VOL. 48,

NO. 9, AUGUST 1976

a,b

See Table 111.

The relative effects of changing data classes on the results for the LMWS and the HMWS are significantly different: the largest proportion of the class I mismatches for the LMWS are ring positional isomers (Figure 3), whereas in the HMWS the homologues of the unknowns are the most significant (Figure 4). For example, a t a AK of 30 for the LMWS, two thirds of the class I mismatches are ring isomers, half of the remaining mismatches are homologues, and about one third of the remainder belong to class IV; for the HMWS a t hK = 30, well over half of the mismatches are homologues, while nearly half of the remainder are isomers which can be formed by moving only one carbon atom. This differing importance of the structural classes for the LMWS and the HMWS appears to be mainly an artifact of the makeup of the reference file. For example, the LMWS contains numerous dimethylnaphthalenes and dimethylindoles, with the positions of the two methyl groups occurring in various permutations on the rings; the spectra of these isomers are very similar. The HMWS contains a number of homologous long-chain aliphatic hydrocarbons and their derivatives such as primary alcohols and esters. Although the LMWS also contains many spectra of homologues, the peak abundances of these spectra are much more sensitive to the addition of a methylene group; the fragmentation patterns of methyl acetate and methyl propi-

Table IV. Compounds Retrieved by P B M a for a Mixture of 90% Methyl n-Octadecanoate Octadecenoate Confidence value Compound Methyl n-octadecanoate,b C1gH3802 Methyl behenate, C23H4602 Methyl 16-methylheptadecanoate, C19H3802 Methyl arachidate, C21H4202 Methyl heneicosanoate, C22H4402 Methyl nonadecanoate, C20H4002 Methyl cis-9-octadeceanoateb Methyl 13,14-dideuteriooctadecanoate Methyl myristate, C15H3002 Methyl palmitate, C17H3402 Methyl heptadecanoate, Cl~H3602

K

L l K

+ 10% Methyl cis-9-

Percent contamination

Percent component

134+, 95,92+, go**+, 78+ 112*, 79,61 103+

0,7,10, 12,24 2, 27, 17,22, 24 0,23,41 0

27,30,66 23

63,54, 23 65

101,77*, 73** 101** 95**,65 85*+, 62**+ 81**

1,25, 29 1 7, 31 17,40 21

30, 27,46 27 21,50 84,91 24

54,56,27 87 78,56 14,lO 73

73**, 71** 70** 66**

29,31 32 36

32,34 34 36

100,97 80 61

PBM specifications modified to include > 15 peaks for reference compounds of molecular weight tra searched. Correct answer. onate are easily distinguishable, while those of methyl heptadecanoate and methyl octadecanoate are nearly identical except in the molecular ion region. This can account also for the much larger effect of class IV data on the HMWS than on the LMWS, as a single “misplaced” carbon atom will tend to have a much smaller effect. Molecular Ion Information ( K + ) . The molecular ion provides additional information which is especially valuable for distinguishing between homologues, as seen for HMWS data in Figure 4. The increases in reliability obtained by examining those class I matches retrieved with a K+ value are nearly commensurate with the increases obtained by using class I11 matching criteria with K values; obviously, the molecular ion should be uniquely effective in distinguishing between homologues. The same effect is significant in the consideration of class IV data as well. Thus for a high molecular weight unknown (Figure 4) a AK+ value I< 40 provides a 95% confidence of a t least a class IV match, while for a low molecular weight unknown, a AK+ I 30, which is obtained for nearly 50%of all possible matches, provides >99% confidence of a class IV match (IO).Note, however, that a high K+ value does not necessarily ensure that the reference compound has the same molecular weight as the unknown. The molecular ion of a lower molecular weight compound can occur as an oddelectron fragment ion in the spectrum of a higher molecular weight unknown; matching the molecular ion is a necessary but not sufficient condition to prove identical molecular weights. The LMWS data (10) indicate t h a t the K+ values are of little benefit in distinguishing between ring position isomers (class 11),as would be expected. K VI. AK Values. The recall/reliability performances using K and AK values show appreciable differences. For the HMWS (Figure l ) ,the reliability achieved using K values is superior for recalls of 50-80%, while the opposite is true for the LMWS (10) for this recall range. Because the best AK value (zero) is the same for all reference spectra, a t this value 12-15% of the possible matches are already retrieved; higher reliability can be achieved for the LMWS (10)using K values 2 100 (for K 1 110, no mismatches and 4% recall were found for the spectra studied). These reliability results a t low recall levels are bused on samples that are small statistically; for the HMWS (Figure l),the decrease in reliability a t the highest K values is probably an artifact of the small data set ( I ) . Here the observed 50% reliability is due to the fact t h a t one of the two spectra retrieved a t K 2 130 is a mismatch, the spectrum of hexachlorofulvene retrieved for hexachlorobenzene as an unknown (actually a match by class IV criteria); the close

90, 76,67,91, 53

> 170; 35 828 reference spec-

similarity of these two spectra has been pointed out (13). Reliability Value as t h e Criterion of Match. The K and AK values found for a particular selected reference compound can thus give substantially different levels of confidence based on the recallheliability performance. Also (vide supra) the reliability found for a particular value of K or AK is substantially dependent on the molecular weight, number of flagged peaks, maximum percent contamination, class of match, and inclusion of the molecular ion. Based on these recallheliability studies (IO), we are a t present modifying PBM to convert the various types of K and AK values found for each reference spectrum to a “predicted reliability” value which can be used in place of the K value for ranking the matches found, and which should provide a more direct measure of the confidence which the interpreter can place in the result with respect t o each class of match. Application t o S p e c t r a of U n k n o w n Mixtures. The reliability achievable by PBM, not surprisingly, is reduced approximately in proportion to the concentration of the unknown compound in a mixture (Figure 5). However, for unknown mixtures, the reverse searching (5,6)is of substantial value (IO), PBM showing recall/reliability performance for components present in 30% concentration that is substantially above that of a forward-search system for 60% components. The PBM performance for 60% components appears to be superior to that shown earlier for the spectra of pure compounds (Figure 1).Although different data sets have been used, this is due mainly to the fact t h a t the spectra used in making up the unknown mixture spectrum were not eliminated from the reference file. Relaxing the matching criteria improves the precision for mixture spectral retrieval in the same fashion as observed for the spectra of pure compounds (10). Examples. Table I1 and I11 show the compounds retrieved using PBM for spectra of “unknown” LMWS and HMWS mixtures. The first spectrum was created by combining the spectra of 3-methoxyindazole, carbon tetrachloride, and tert- butyl-3-ketobutyrate in a 603010 proportion. PBM has identified the major component, 3-methoxyindazole, with high confidence [AK+values corresponding (10) to >95% precision]. Although the confidence associated with the 30%and 10%0 components is much lower, molecular ion information and no flagged peaks were utilized in retrieving the tert- butyl3-ketobutyrate spectrum, so that the confidence of that match is much greater than the confidence in any of the incorrect retrievals, for all of which the use of flagged peaks was necessary for matching. ANALYTICAL CHEMISTRY, VOL. 48, NO. 9, AUGUST 1976

1367

Table I11 presents the results of a mixture of the herbicide Siduron, 1-(2-methylcyclohexyl) -3-phenylurea (60%), 1,2’binaphthyl (30%), and the insecticide Sumthion, 0,O-dimethyl-0-(4-nitro-m-tolyl)phosphorothioate(10%).All three components are retrieved by PBM, and the other compounds selected are structurally similar. PBM Improvements. Inclusion of other data, such as GC retention indices ( 1 4 ) ,should improve PBM performance in specific applications. Retrieval was also tested on the same data set without weighting or reverse search using -30% and -100% more peaks for the LMWS and HMWS, respectively; although in the high reliability (>50%) range the PBM performance was clearly superior for the LMWS, PBM gave only ‘ an equivalent performance for the HMWS. Because of this, we have increased the number of peaks for the system presently used: for the molecular weight range beginning at 170 ’ amu, 16 peaks; 180, 17; 195, 18; 215, 19; 240, 20; 270,21; 305, 22; 350,23; 420, 24; 500, 25; and L600,26. Results using this modified PBM system with 35 828 reference spectra are shown in Table IV. For an unknown spectrum made by combining 90% methyl stearate and 10% methyl oleate, the 22 spectra retrieved with highest K values were either correct answers or closely related molecules; note that the correct compounds, but not homologues, have been retrieved with K + values. For our use, high performance a t high (>50%) reliability levels was a prime objective for PBM, as an unknown spectrum for which a match of high confidence could not be obtained should also be interpreted by a mass spectrometrist or a computer interpretive system such as STIRS (8).If instead better PBM performance is desired a t high recall values (Figure l),this should be helped by “skewing” the unknown spectrum, either increasing or decreasing the observed peak abundances as a function of mass; this should compensate for instrumental mass discrimination or for changing sample concentration during spectral recording which has occurred for either the unknown or reference spectrum. The PBM results thus confirm the advantages of both the reverse search strategy (5,6) and the weighting of mass and abundance values of peaks ( 5 )for matching unknown mass spectra. Reducing the number of peaks necessary to achieve

relatively high reliability also yields a significant reduction in search time requirements; the recent increase in the Cornell data base to >41000 spectra has made this especially valuable. Also, it should be possible to do such a PBM search in real time for GC/MS. For example, matching against a reference file of 1500 spectra during quadrupole MS data acquisition and reduction by a DEC PDP-8 computer (16K words core, 1.6M words disc storage) should require -2 s for an unknown mass spectrum.

ACKNOWLEDGMENT We thank Gerard Salton, Robert Hertel, and Robert Villwock for valuable discussions.

LITERATURE CITED (1) G. Salton, “Automatic InformationOrganization and Retrieval”, McGraw-Hill, New York, 1968. (2) G. M. Pesyna and F. W. McLafferty in “Determination of Organic Structures by Physical Methods”, Vol. 6, F. C. Nachod, J. J. Zuckerman, and E. W. Randall, Ed., Academic Press, New York, 1976. (3) L. R. Crawford and J. D. Morrison, Anal. Chem., 40, 1464 (1968). (4) H. S.Hertz, R. A. Hites, and K. Biemann, Anal. Chem., 43, 681 (1971). (5) F. W. McLafferty, R. H. Hertel, and R. D. Villwock, Org. Mass Spectrom., 9, 690 (1974). (6) F. P. Abramson, Anal. Chem., 47, 45 (1975). (7) E. Stenhagen, S.Abrahamsson, and F. W. McLafferty, “Registry of Mass Spectral Data” (magnetic tape file), Wiley-lnterscience, New York, 1974. (8)K.-S. Kwok, R. Venkataraghavan, and F. W. McLafferty, J. Am. Chem. Soc., 95, 4185 (1973). (9) J. E. Freund, “Mathematical Statistics”, Prentice-Hall, Englewood Cliffs, N.J., 1962, p 46. (10) G. M. Pesyna, Ph.D. Thesis, Cornell University, 1975. (1 1) G. M. Pesyna, F. W. McLafferty, R. Venkataraghavan, and H. E. Dayringer, Anal. Chem., 47, 1161 (1975). (12) F. W. McLafferty, “Interpretation of Mass Spectra”, Seconded., Benjamin Addison-Wesley, Reading, Mass., 1973, p 81. (13) S. Meyerson and E. K. Fields, J. Chem. SOC.B, 1001 (1966). (14) C. E. Costello, H. S.Hertz, T. Sakai, and K. Biemann, Clin. Chem., 20, 255 (1974).

RECEIVEDfor review September 4,1975. Accepted April 21, 1976. We are grateful to the Environmental Protection Agency (grant R-801106) and the National Institutes of Health (GM 16609) for generous financial support, and to the National Science Foundation (grant MPS-74-19871) for partial funds for purchase of the computer.

Extraction of Mass Spectra Free of Background and \Neighboring Component Contributions from Gas Chromatography/Mass Spectrometry Data R. G. Dromey,’ Mark J. Stefik, Thomas C. Rindfleisch,’ and Alan M. Duffield2 Departments of Computer Science, Genetics, and Chemistry, Stanford University, Stanford, Calif. 94305

An effective, minicomputer-based method is described for systematlcally extracting resolved mass spectra of mlxture components from GC/MS data. Using tabular peak models derived directly from the raw data, the spectra have column bleed backgroundremoved and are corrected for Interference from neighboring elutants and peak saturation. Individual components are detected in the data by means of a pair of histograms which statistically characterize the positions of mass fragmentogram peak modes. These data-adaptive cor-



Present address, Research School of Chemistry, Australian National University, Canberra, A.C.T., Australia. Present address, School of Physiology and Pharmacology, University of New South Wales, 2033, Australia. 1368

ANALYTICAL CHEMISTRY, VOL. 48, NO. 9, AUGUST 1976

rections avoid costly iterative numerical procedures and allow obtaining representative mass spectra from GC/MS data of complex mixtures on a routine bask. Using this approach, components that elute within less than two spectral scan tlmes of each other can be detected and their mass spectra well resolved.

With the increasing application of gas chromatography/ mass spectrometry (GC/MS) systems to mixture component identification in biomedical research (1,2)and other areas ( 3 ) , it has become important to be able to systematically isolate and identify minor components in the complex mixtures being analyzed. Because of instrumentation limitations, the mass