Exact mass probability based matching of high-resolution unknown

Unknown mass spectra measured with mllllmass accuracy can be matched against a comprehensive unit-mass-resolution data base of electron Ionization ...
0 downloads 0 Views 596KB Size
Anal. Chem. 1991, 63, 546-550

546

Exact-Mass Probability Based Matching of High-Resolution Unknown Mass Spectra Stanton Y. Loh' and Fred W. McLafferty* Baker Chemistry Laboratory, Cornell University, Ithaca, New York 14853-1301

Unknown mass spectra measured wlth mllllmass accuracy can be matched agalnst a comprehensive unitmass-resdutlon data base of electron ionization spectra by utilizing its information on molecular elemental compositions and known correlations of common neutral species lost in ion dissociations. Adding thls exact (E) mass capability to the probability based matching (PBM) algorithm provides substantial performance improvements. Using matching criteria that retrieve 80% of the correct answers, EPBM increases the reliability of retrieving a spectrum of the same structure from 23% to 39%; accepting structural differences to which mass spectrometry is insensitive (class I V matches), EPBM increases the reliability from 44% to 71 %, halving the number of wrong answers. Similarly, for EPBM only 6 % of best matches are Incorrect (class I V ) versus 10% by PBM.

INTRODUCTION Using high-resolution mass spectrometry to measure ion masses with millimass accuracy greatly enhances the chemical specificity of the spectral information versus that from unit-mass accuracy instruments. Peaks representing ions such as C2H30+,C2HsN+,and C3H7+(m/z 43.0184, 43.0421, and 43.0547, respectively) can be easily distinguished, with the number of possible elemental compositions at a unit resolution peak rapidly increasing with mass. Although computer algorithms for matching against a mass spectral library (1-10) are sophisticated and widely used, none exploits such exact mass data despite its availability for many years on gas chromatography/mass spectrometry (GC / MS) instrumentation. A basic problem is that no commercially available data base contains high-resolution spectra. We describe here an extension of the probability based matching system (PBM) ( 2 , 5 1 0 ) in which elemental compositions for the peaks in the unit resolution data base (12) are assigned or restricted according to the molecule's elemental formula and expected ion dissociations. The successful application of PBM requires weighting factors for the occurrence probability of such elemental composition assignments within the data base; these factors are derived from two different sources. PBM, a "reverse search" system not requiring pure samples, is designed to predict the match reliability between two spectra from the probability that their degree of similarity would occur by chance ( 2 , 8 , 10). The match confidence ( K ) value is the sum of K ivalues for each matching peak; K i= (Vi A i+ W - D ) , where U i(uniqueness) and A i(abundance) are the -log, probabilities that a peak of such mass and abundance will occur in the reference file, W is the window tolerance for an abundance match, and D (dilution factor) represents the fraction of reference abundance used in the match ( 2 ) . These probabilities are further adjusted for molecular ion matching, the number of reference peaks not found ("flagged") (5), the abundance adjustments required ("quadratic scaling") (5),and

+

'Present address: Palisade Corp., 31 Decker Rd, Newfield, NY 14867.

Table I. Neutral Losses Allowed in Assigning Fragment Ion Compositions mass 1 2 3 4 14 15 16 17 18 19 20 26 27 28 29 30 31 32 33 34 35 36 38 39 40 41 42 43 44

formula

mass

formula

H Hz H3 H4

N

CH3 CH,, HzN, 0 CH5, H3N, HO H7O HiO, F HF CZH2, CN C2H3, CHN C2H4, Nz, CO CPH5, CH,N, CHO C2H6, NO, CHzO CqH,. CHqO CH40, S " CH50, CH,F, HS CH,O, H7S

67 CiH,, CH4C10 68 C5H8,CH5C10, C4H6N

C H ~ FH$, , CI H402,HCl HGOy C3-H, C3H4, CzHzN C3H5 C3H.5, CzHzO CqH,, CHNO, CqHiO

45 46

79 80 81 82 83 84 85 86 87

CiHiCl, C5H5N,Br C3HgCl C6H9, C2H&10 C6H10, CbHEN, CzH7C10 CnH,,

88 47 48 49

89 90 91

50 51 52 53 54 55 56 57

92 93 94 95 96 97 98 99 100

58

the subtraction of retrieved reference spectra of possible impurities ("forward searching of mixture spectra") (6-8). A preliminary report of this work has appeared (9). EXPERIMENTAL SECTION Program development was done on a DEC Microvax I1 (2 Mbyte memory, VMS operating system) in VAX Fortran and VAX C. The reference file searching contained up to 27 (depending on molecular weight) of the most important (highest U + A values) peaks (2) of each spectrum given in ref 11,containing 139859 spectra of 118144 different compounds from which 3893 spectra of isotopically labeled compounds were excluded. Elemental composition restrictions were assigned for reference file

C 1991 American Chemical Society 0003-2700/91/0363-0546$02.50/0

ANALYTICAL CHEMISTRY, VOL.

63, NO. 6, MARCH 15, 1991 547

Table 11. PBM Class 4 Matching Criteria"

50

100

200

250

I 300

Figure 1. U,values from (0)ref 16 and (asterisk) 118 144eompound data base for (lower curves) C H2n-B*f,(middle curves) C,H,,_,O+, and (upper curves) C,HZn+,S+:

peaks formed by the loss of common neutral species such as CH3, H,O, and HCl; the common neutral losses (12) used are in Table I. The occurrence frequency F of each elemental composition c was used to calculate its probability P(c)at the i m/z value from eq 1. Individual values (e.g., Figure 1)for U, 5 7.6 were stored

in a look-up table. As a trivial example, a t m / z 13 the only assignment found is CH; thus its U, value is 0, indicative that determination of the m/r 13 elemental composition provides no gain in information. The EPBM input for the unknown spectrum is a PBM format file (unit resolution m / z and abundance pairs) followed by the measured exact masses and their elemental composition assignments. EPBM performs a basic PBM search (2,5-8) with some additions. If peaks corresponding to M" (molecular ion), (M 1)+ (loss of hydrogen), or (M - 15)' (loss of methyl) are present in the reference spectrum, the corresponding compositions are sought among those of the unknown by the flagging routine; a total of three missing peaks are allowed ( 5 ) before the match is cancelled. For each reference peak composition consistent with one of those of the unknown, the K value is incremented by the corresponding U, value; if more than one composition is possible for the unknown or reference peak, a smaller U, value obtained by adding the probabilities of the possible compositions is used. EPBM performance was evaluated (13-15) by using the previous set (7) of 385 randomly selected reference spectra as unknowns. High-resolution unknowns were simulated by assigning to each peak the most probable composition consistent with the molecular formula. Each unknown spectrum was removed from the data base, but a t least one spectrum of the corresponding compound remained. Retrieval was evaluated according to two matching classes (2,6, 7): class I denotes an identical compound or stereoisomer, and class IV includes compounds whose structural differences usually have little effect on the mass spectrum; our previous definitions of acceptable differences have been made more specific (Table 11). The reliability (proportion of retrieved spectra that are correct) is plotted as a function of recall (proportion of class I matches retrieved). RESULTS AND DISCUSSION Increased Peak Uniqueness f r o m Exact-Mass Data. Matching the elemental composition, not just the mass, of an unknown peak increases the probability of a structural match. This increase should be reflected by the corresponding increase in the uniqueness of the peak in the reference file, U,, evaluated from the occurrence probability P(c)of that composition c (eq 1)versus all compositions possible at that unit mass. The only tabulation of such elemental composition probabilities found was that in ref 16, derived from a rigorous interpretation of 32 830 Wiswesser-encoded spectra; however, only data t o m / z 150 are included. Probability data were also obtained from the 118 144-compound file used here by restricting the peak elemental composition assignments by that of the compound and the common neutral losses of Table I. Fortunately,

Same compound, stereoisomer, tautomer, ring positional isomer (pyrazole is a match for imidazole), or homologue created by adding or subtracting (-CH,-),,, where n is 525% of the total number of carbons in the larger molecule, and a match formed by any one of the following operations on one of the molecules. (1) Closing a ring by addition to a double or triple bond: cyclohexane is a match for 1-hexene but not for 2-hexene. (2) Moving the position of a double or triple bond by not more than 25% of the number of carbon atoms in the molecule: CH2=C=CH(CH2),CH3 is a match for CH,=CHCH,CH=CH(CH,),CH, or HC=C(CHACH,. (5) Movingup to 25% oP&e carbon atoms (wi'ti an; attached halogen atoms) within a nonaromatic hydrocarbon part of the molecule: C6H5(CH2)5C(CH3)20H is a match for C6H5(CH2)80H but not for C6H5(CH2)5CH(CH3)OCH3 or CH~CGH,(CH,)~CH(CH~)OH. (4)Moving one halogen atom within a nonaromatic part of the molecule if the molecule contains only C, H, and halogen: Cl(CHz)3Clis a match for CH3(CH2)2CHC12. ( 5 ) Removing H20 or HCl: CH,(CH,),CH=CH, is a match for CH,(CH,),OH. "These criteria represent a more careful definition but not an exDansion. of those used Dreviouslv (2. 5-7).

Predicted Reliability

Figure 2. Actual versus predicted class 4 match reliability for retrieval of 385 unknown spectra for average U, values of (A) 0.3, (B) 0.6, (C) 1.1, (D) 1.8, and (E) 3.5, and (F) that for all spectra after correcting the RL values according to the correlation lines of A-E. the resulting U, values agree surprisingly well with those of ref 16, as illustrated by the Figure 1 comparison of U, values for three homologous ion series, CnH2n-8*+,C,H2n-IO+, and C,,H2,+1S+. The C,H2,,,S+ series represents t h e worst agreement of 25 series compared; even here the increase of U , with mass shows the same high slope, and the lower U, values used will give only a few percent more conservative reliability predictions. This justifies the less rigorous use of Table I data for elemental composition assignment, as well as providing evidence that the composition distribution of the 32 830-compound data base is quite representative of more comprehensive files ( & l o ,14,17). Occurrence probabilities of (U, = 7.6) are of poor statistical significance; compositions of such probabilities are assigned the maximum U, value of 7.6. P r e d i c t e d vs Actual Matching Reliabilities. To test the statistical validity of these U, values within t h e PBM context, the resulting predicted values of matching reliability are compared with those actually found. Using reliability (RL) values from evaluating EPBM with the 385 randomly selected unknowns, the agreement was found to be relatively satisfying but dependent on the magnitudes of both the predicted RL and the average U , values (Figure 2A-E, representing equivalent numbers of retrieved answers). As done for pre-

ANALYTICAL CHEMISTRY, VOL. 63, NO. 6, MARCH 15, 1991

548

Table 111. Incorrect Class 4 Matches a t t h e 18% Recall Level PBM

EPBM

unknown 99 99 99

72

unknown 99 96

unknown 98

i

98 96

ai

91

unknown 95