Prediction of substructures from unknown mass spectra by the self

Prediction of substructures from unknown mass spectra by the self-training interpretive and retrieval system. Kevin S. Haraki, Rengachari. Venkataragh...
0 downloads 0 Views 909KB Size
386

Anal. Chem. 1981, 53, 386-392

Prediction of Substructures from Unknown Mass Spectra by the Self-Training Interpretive and Retrieval System Kevin S. Haraki,’ Rengachari Venkataraghavan,‘ and Fred W. McLafferty* Department of Chemistry, Baker Laboratory, Cornell University, Ithaca, New York 14853

Llsts of 190 and 397 substructures have been selected from 7000 on the basis of ablllty of the computer algorithm “selftralnlng lnterpretlve and retrieval system (STIRS)” to recognlre them In unknown mass spectra. I n comparlson to the previous best list of 101 substructures, the new list of the best 190 shows only 20% as many Incorrect identlflcatlons, with a 20% increase In recall. Statistlcs based on the performance of 899 “unknown” spectra selected at random are used to calculate and rank the reliability of the STIRS predlctlons of the presence of each of the 587 substructures. This list also represents an extensive evaluation of the capability of mass spectrometry for molecular structure elucldatlon.

er-generated list of substructures (9) found to correlate with specific mass spectral peaks of the data base (11). The presence of each substructure, and combinations of up to four of these, in each reference compound was determined (10) using its Chemical Abstracts connection table (12),with this information stored as a bit map with each reference spectrum. The basic STIRS program used has been described (6,8,10). Match factor values (MF) expressing the degree of match found for a particular data class are calculated with the new eq 1in which M

C [(Ai + Aj)k -‘/2(1AAkl + IAAk - AAl)1

MF =

k=1

..

ZAi + 2 A j

1=1

The identification of unknown components in complex organic mixtures is a problem of increasing importance in many areas, such as clinical, forensic, and environmental chemistry. The major use of automated gas- or liquid-chromatography/mass spectrometry systems (1-3) for such analyses has led to the rapid development of computer systems for the identification of unknown mass spectra (4, 5). An appropriate reference file is usually searched first by using a retrieval algorithm; if a sufficiently good match is not found, an interpretive algorithm can be used to help identify substructures (6),molecular weight (7), or other features of the unknown. The only broadly applicable interpretive algorithm which is generally available is the self-training interpretive and retrieval system (STIRS) (8);substructure information is predicted by inspecting the best matching reference compounds found by using 16 classes of mass spectral data selected to be indicative of a variety of structural moieties. For a specific list of 179 substructures these predictions can be made automatically based on a random-drawing model (6). This paper describes improvements to STIRS which include a new list of 190 substructures for which recall has been improved by >20% and the occurrence of false positives has been reduced to -20% of that of the previous list. These substructures, plus a secondary list of 397, were selected as the best from an original list of 6389 and their combinations found in a computer-aided spectra-structure correlation (9). These 587 substructures also provide the first comprehensive list of the types of molecular structure information best determined by mass spectrometry.

EXPERIMENTAL SECTION A more extensive description of this work is available (10). All programs are written in IBM 370 assembler language or FORTRAN for execution on an IBM 370/168 computer running the Virtual Machine Facility/370-Conversational Monitor System. The data base of mass spectra of 25598 different compounds was taken from ref 11, selecting those compounds containing only the elements Br, C, C1, F, H, I, N, 0, P, S, and Si, and only those for which Chemical Abstracts connection-table descriptions of the structures were available (12). A list of 6389 substructures was manually selected as the most promising from a larger computPresent address: Lederle Laboratories,Pearl River, NY 10965.

(1)

j=1

the peak abundance5 are incorporated as A values (13): abundance 0.24-1.0%, A 1; -3.4%, 2; -9%, 3; -19%, 4; -38%, 5; -73%, 6; :loo%, 7. A A k is the difference in A values for the kth match, AA is the average of such differences, m and n are the number of m/z values in the data class for the unknown and reference, respectively, M is the number of m/z values in common, and Ai and Aj are the A values associated with the kth matching m/z value for the unknown and reference, respectively. The capability of STIRS for predicting each substructure was only evaluated using the overall match factor, MF11.0, which represents a weighted combination of the match factor values from the individual data classes. For substructure evaluations, neutral-loss data class MF6B (14)was added to those used previously (a slightly improved performance also results from the further addition of MF5C and MF6C). As test unknowns 899 spectra were selected from the reference file at random. The statistical validity of this sample was demonstrated in a separate test in which doubling this set size gave average results which were the same within -1% (10). Only those substructures occurring at least 20 times in the reference file and twice in the unknown set of 899 compounds were tested; 484 substructures and 365 additional combinationsof these fulfiied these requirements, so that a total of 849 substructures were actually tested. Evaluations used previous definitions (15) for values of recall (RC: the number of correct identifications, I,, divided by the number of possible correct identifications,PJ, false positive (FP: the number of false identifications, If,divided by the number of possible false identifications,Pf),and reliability, RL (eq 2). The RL = I,.(Ic

+ If)= PcRC/(PcRC + PfFP)

(2)

utility of a specific substructure for STIRS elucidation of an unknown’s molecular structure was measured by a descriptivity index, DI (eq 3), where N is the number of nodes (number of DI

(N - 0.5E) OC RC

(3)

non-H atoms) in the substructure, E is the number of free bonds (single bonds attached to only one atom), and OC is the fractional occurrence of that substructure in the STIRS reference file. For OC a substructure is not counted as occurring in a reference compound if the compound contains another listed substructure which includes the first substructure; CH3- is not counted if its “descendant” CH3CO- is present. For the combination substructures the occurrence in the unknown file was used as OC without checking for larger substructures in the same compound. The RC and FP values for each of the tested substructureswere determined for predictions at the 299.8% confidence level (0.6% ), and whether the substructure description was or was not modified to reduce the observed F P value (IO);these had surprisingly little effect on curve shapes. Thus the relationships on Figure 1and data of Table I for each substructure can be used to derive its actual RC and FP values (and thus its RL value, eq 2) based on the proportion of the 15 best matching compounds found by STIRS which contain the substructure. Tables showing the resulting RL values corresponding to the number out of 15 for each of the 589 substructures have been prepared (IO). For example, the presence of (CH3I3SiO-in 7 of the 15 best matching compounds predicb this substructure with 96% reliability, while the presence of methyl in 14 of 15 predicts CH3- with 95% reliability.

388

ANALYTICAL CHEMISTRY, VOL. 53, NO. 3, MARCH 1981 ~~

Table I. The Most Descriptive Substructures Predicted by STIRS substructurea monosubst phenyl, C,H,CH,(CH,),- or other alkyl -0-Si(CH,), aryl-OCH, -(CH,),-COaryl-CH,, could be arylCH- or arylFCH, C,H,-CH,-, could be other pheny1-C40-OCH, aryl-c1 CH,(CH,),-, or other alkyl steroid C ring, could be other fused polycyclic 17-sub steroid C t D rings, could be otherwise substituted -CO-0-Si(CH,),, could be -0-Si(CH,), aryl.CO. or aryl-COmethyl aryl-0-Si( CH,),, could be other -Si( CH,), aryl-0CH,-CO-0steroid B(de1ta-5) + C rings steroid B ring or other fused polycyclic CH,O-CO-CH,-, could be other -CH,O-CO-CC,H,: benzo, could be other fused aromatic aryl-OH C,H,-CO- or C,H,C-0-, C,H,-N,-, aryl-COphenylazo-: C,H,-N,or -C,H,-N,HO-C,H,-COstyryl: C,H,-CH=CHcinnamoyl: C,H,CH=CH-COaryl-CH,CH,-, could be other C,H,17-sub steroid D ring, could be other fused polycyclic aryl-"-, could be other ary1-N-C,H,-C,H,-CH,or fused or other diary1 C,H., could be other -all;Yl CH,-CO-OCH,CH,(CH,),-, could be other alkyl bridge .C.CHCH,CH,.CH., could be other fused polycyclic -C(CH,), aryl-CO-0- or other aryl-coaryl-CH,-, could be myl-CH, -CH,-CO-H C,H,-CH,-CO- or -C,H,-CH,-COC,H,O,: ring .O.(CHO-),., could be other sugar ary1.s. bridge .CH.C.CH. or other polycyclic alkyl C,H5-CH,0-

refb

occurrence unique

unknd

3757 1625 1584 1664 772 1846

2223 1625 1265 1343 772 1663

155 33 54 78 14

798

IC

81

112 29 54 44 13 39

798

30

22

2057 1035 2073

1366 931 808

63 40 44

1021

1021

I'f=

If

RC

FP

RL

10zdi

72

87 91 93 86 87 85

34.5 30.7 22.2 20.7 19.6 18.8

30

17 3 4 7 2 7

100 56 93 48

2.28 0.35 0.47 0.85 0.23 0.86

9

1

73

0.12

96

14.9

44 23 37

39

15 6 5

70 58 84

1.79 0.70 0.58

75 79 88

13.0 12.6 11.9

23

19

15

4

83

0.46

83

11.5

17

12

9

5

71

0.57

71

11.3

439

13

13

8

2

100

0.23

87

11.1

911 19791 252

911 10087 252

13 664 9

6 337 9

10

1

46

0.11

9

2 4

51 100

0.85 0.45

86 99 69

10.7 10.0 9.9

2515 1176

801

1041

103 26

58 17

8

0.80

11

8

1

56 65 73

0.11

88 71 89

9.7 9.3 8.9

439

31

7

88

1.01

566

566

5

4

7

3

80

0.34

57

8.8

813

813

25

15

10

5

60

0.57

75

8.6

668

668

28

18

18

4

64

0.46

82

8.4

1270 356

772 356

74 24

32 18

8

5

43 75

0.97 0.57

80

23

78

7.8 7.8

13

10

2

77

0.23

83

7.8

13 19 10

8

9

5 2 1

0.56 0.23 0.11

62 82

7

62 47 70

88

7.6 7.5 7.4

564

479

20

11

8

0

55

0.00

100

6.7

811

811

20

14

7

2

70

0.23

88

6.7

594

571

53

28

9

3

53

0.36

90

6.5

8

5

9

0

63

0.00

100

6.4

16

2

40

0.28

98

6.4

61 52

2711

195

78

401 2826

401 699

9 89

37

2 6

89 62

0.23 0.74

80

55

90

6.3 5.9

818

818

21

15

7

4

71

0.46

79

5.7

651 429

651 3 09

22

14

18

11

19

4 9

64 61

0.46 1.02

78 55

5.7 5.6

1844

552

86

40

11

4

47

0.49

91

5.5

2265

1482

92

35 6

8

2

38 55

0.99 0.23

81

11

75

5.5 5.3

8

248

248

10

9

6

0

90

0.00

100

5.2

351 1234

351 1234

6 13

4 9

19

5 2

67 69

0.56 0.23

44 82

5.0 5.0

7

6

1

86

0.11

86

5.0

ANALYTICAL CHEMISTRY, VOL. 53, NO. 3, MARCH 1981

Table I. (Continued)

substructure -CO-0-CH,-OCH, -CH,-C,Ii,-O-CH,CH,-C( CH,),, could be other alkyl -CH,CH,-CH( CH,),, could be other alkyl -CO-OCH,CH, Br-C,H,-

-co-0-

'o-HO-C,H,-CO-O-CH,-C,H,-OH .N(phenyl). or other ary1-N-CH,-C,H,-OCH, indolyl aryl-CO-, could be aryl-C-0- or aryl-N,-CO-CF, C,H,-CH,CH,NH- or -C,H,-CH,CH,NH-C,H,-SO,-CO-C,H.-OH or -c-o"-c~H,-oH, -N,C,H,-OH .C,H,.NH. -~H,-c,H,-N(cH,), -0-C,H,-N(CH,), -NH-CO-C,H,-OH O-CO-C,H,:O~H, -CH,-CH( CH,)-CH,-, could be other alkyl bridge .C.CH,CH,CH., could be other fused polycyclic aryl-N=, could be aryl-NH-NC,H,-CO-0-CH,- or C,H,-CO-0-CM,.CH.O.CH. -CO-OHf .CH.CH.O. -N(CH3)2

aryl-O-CH,aryl-CO-OH or other aryl-CO-0-

bridge .C.C.CH. -CH,-0-CH, C,H;: benzb (fused t o aromatic) .O.CH. .OS.( CH,);. -0-CH,CH,CH,CH, .CH(OHI.CH(OH1. . , ary 1-C OLH C,HN,: N-sub-imidazolo (fused to aromatic) aryl-CO-CH,-S-CH,aryl-S&H, .CH.CH( OH). -NH-CH,CH,-CO-CH,CH, ary1-N.CH,.CH.O. aryl-I

ref b

occurrence unique C

unknd

Iff

IC

If 11

RC

FP

RL

lO*DI

1.29 0.95 0.34 0.00

71 91 67 100

4.9 4.7 4.7 4.5

27 71 6

517

47 160 13 16

8

6

3 0

57 44 46 50

1101

650

22

13

11

0

59

0.00

100

4.5

548

422

5114

1031

15 7 149 5

6 4 9 2 1

264

0.68 0.45 1.20 0.22 0.11 0.34

60 60 90 67

264

9 6 79 4 5 6

67

4.5 4.3 4.3 4.2 4.2 4.1

1600 4704

733 1804

517

8

9

269

214

214

355

355

221

221

8

3

60 86 53 80 63 67

35

3 2 10

50 57 55

0.34 0.22 1.26

57 67 86

3.8 3.8 3.8

1 2

80 67

0.11 0.22

80 67

3.7 3.6

1 3

100

0.11

80

57

0.34

57

3.6 3.6

4 2

0.45 0.22 0.00 0.22 0.22 0.45 0.11

3.5 3.5 3.5 3.2 3.2 3.2 3.2

0.11

67 60 100 60 60 43 75 75

3.1

7 104

4 4 57

5 6

4 4

4 7

4 4 8 3 4 3 3 3 3 3

4

1 1

42 100 67 75 75 60 75 75

8

1311

7

5

83

357

357

19 3 6 4 4 5 4 4

lb86

601

38

25

20

8

66

0.93

76

3.1

254

190

24

18

9

4

75

0.46

82

3.1

6

3

4

50

0.45

43

3.0

6 10 6 3 2 8

70 41 64 69 42 44

0.69 1.19 0.70

0.34 0.23 0.91

79 70 82 70 71 56

3.0 2.9 2.8 2.7 2.6 2.6

3 4 2 8 4

73 56 75 42 100 65 75 44

0.34 0.46 0.23 1.00 0.47 0.69 0.34 0.45

84 79 82 84 67 71 67 64

2.5 2.3 2.3 2.0 2.0 2.0 1.9 1.9

38 50 67 67 47 88

0.91 0.22 0.23 0.23 0.23 0.22

53 67 86 75 80 78

1.5

28 62 71 48 60 50 82 42 86

0.34

63 65 56 67 60

718 901 928 483 297 190

718 727 752 399 247 190

33 56 44 16 12 23

23 23 28

529 927 120 3317 1729 574 259 245

248 432 120 1236 145 513 259 245

22 27 12 102 56 23

16

1184 269 192 215 117 98

1027 245 109 160 96 98

24

142 488 56 513 219 150 234 417 67

142 457 56 380 219 150 61 393 47

8

16 8 18

9 17

11 5 10 15

9 43 56 15 6 7 9 4 12 6 8

0

2 2 4

16

6

3

4 8

2 2 2 2 2

8

7

18

5 13 5 12 9 4 9

3 7 4 6 6

8

3 5

21 7 25 15 8 11

19 7

6

1 3

0.80

0.45 0.69 0.68 0.11 0.34 0.34 0.56

80

75 73 55

1.4 1.3 1.3 1.2 1.2 1.2 1.1 1.1 1.1 1.0 1.0 1.0

1.0 1.o

380

390

ANALYTICAL CHEMISTRY, VOL. 53, NO. 3, MARCH 1981

Table I (Continued) other substructures (DI x 5,17-disub steroid B + C t D rings (2.9), -CO-C,H,-Br (2.8), -OC,H,-N(CH,)(2.7), -0-C,H,-CO-H (2.7), C,H,N: indolyl, fused t o alicyclic (2.7), SO,-C,H,-N(CH,),, (2.6), C,H,: 1,l-disub4(or 7)-sub. indan (2.5), aryl-CF, (2.4), -C,H,-CO-0-C,H, (2.3), C,H,-CO-0-CH,CH,- or -C,H,-CO-0-CH,CH,(2.2), CH,-C,H,-CO-CH=CH- (2.2), HO-C,H,-CO-CH=CH- (2.2), -N-C,H,-0- (2.2), -CF,-CF, (2,1), dichloroC,H,-OCH,- (2.1), CH,-C,H,-CH,-CO(2.1), C,N,Cl,: triazene-CC1, (2.0), -N,-C,H,-CO- (2.0), -C,H,-CO-CH= CH- ( l a g ) ,HO-C,H,-CO-H (1.9), 8,14,17-trisub steroid B(de1ta-5) t D rings (1.9), aryl-CN, could be aryl-CN + 2H (1.8), C,H,-CH,CH,-CH(1.8), C1-C,H,-OCH, (1.8), -NH-C,H,-OCH, (1.8), -CO-C,H,-N= (1.8), C,H,O, or C,H,O,: .O.C(phenyl).O. (1.79,C1-C,H,-CH,- (1.q C,H,-CS- or -C,H,-CS- (1.6), Cl-C,H,-0- (1.6), C,H,OI: (1.3), -N(CH,)-C,H,-OCH, -(I-)benzo(-0-)- (fused to aromatic) (1.5), -NH-C,H,-0- (1.5), -CH,-CO-0-CH,CH, (1.1), -N,-C,H,-OCH, (1.1), -CH,-C,H,-CO-NH( l . O ) , -0-CH,CH,-0-CH,CH,( L O ) , C,H,: 5,lO-disub steroid (0.9), -N-C,H,-OCH, (0.9), -N;-C,H,-OH (0.9), B ring ( l . O ) , CH,C,H,-CH=CH- (0.9), -CH,-C=N-0-CH,N,C,H,-O- (0.9), -C(CH,),-0- (0.9), arylS0,- (OB),C,H,S: 2-subst. thiophene (OB), CH,-C,H,-N= (0.8), HOC&,-N= (0.8), C,H,O,: -0-benzo-CO- (fused to alicyclic) (OB), .O.C(CH,).O. (0.8), C,H,: bridge .C.CH.CH,.CH. (9.7), -C,H,-N,- (0.7), -CH,-( 2,2-dimethyl-l,3-dioxolane) (0.7), C,H,: -CH,-C=CH (0.7), -CO-CH(NH,)- (0.7), 1,3-dioxolo (fused to aromatic) (0.6), .CH,.CO.O. (0.6), aryl-CO-NH- or other aryl-CO- (0.6), 430-CH=CH-CH, (0.6), aryl-N(CH,)- (0.5), bridge .C.C.C. or other fused polycyclic (0.5), -PO(OCH,), (0.5), -CCl,- (0.5), -N-CH, (0.5), bridge .C.C(CH,),.CH. (0.5), -CH,-NH-CH(CH,), (0.5), aryl-N(CH,), (0.5), -NH-CH,- (0.4), .CH(OH).CH(OH). CH,. ( 0 4 , .S.CH:CH. (0.4), -PO( OCH,CH,), (0.4),.CO.C(CH,):CH. (0.4),40-CH=CH-CO- (0.4), C,H,O: 2,5disub furan (0.4), C,H,N: 3,3-disub a-fused piperidine (0.4), -COS-CH,CH,- (0.3), .C.O.CH,. (0.3), -CH(NH,)CO-O-CHZCH, (0.3), -CH,-PO- (0.3), -CH=CH-CO-CH, (0.3), -S-CO-CH, (0.2), -N-CH,CH,CH, (0.2), -NH-NH, (0.2), 8-sub steroid B ring (delta-5) (0.2), C,H,N: 3,N-disub pyrrolo (fused t o aromatic) (0.2), S - S - (0.2), iodo (0.1) total DI 721 average I O 0.31 79 occurrence-weighted 58 0.30 85 average a Periods between atoms indicate a ring bond; dashes are acyclic bonds. Number of compounds containing the substructure of 25 598 reference compounds. As in b , except no other listed substructure containing the substructure is present. Number of compounds containing the substructure in the 899 unknown compounds. e An entry under If’indicates the substructure description was modified by “or” or “could be”; If’and If represent data before and after modification, respectively. f Data for substructures of DI x 100 < 3.0 occurring < 7 times in the 899 unknowns and all of DI X 100 < 1.0 are restricted t o the description and DI X 100 value. Recall Value

2.0 -

0

0

I .o -

W

-3 >

0.3-

ln W

.P .-t v)

0.1 -

0

a W

-0

V)

0.03 -

P

lL

I/ /

lo-‘on3o-o 0001 0 0 0 3



001

I

I

1

I

I

03 10 30 O b s e r v e d F a l s e P o s i t i v e s V a l u e s , ‘IO 003

01

Flgure 1. The relationship of observed false positlve (FP) and recall (RC) values to theoretical FP values for Table I substructures: small circles, RC (top abscissa) as a % of its value at FP (theor) 10.2%; large circles, such RC values for the 71 substructures of >1% occurrence: small squares, observed FP values; large squares, observed FP values for the 29 substructures for which FP (theor) >OB%.

For an unknown spectrum STIRS now examines the 15 best matching compounds formed by MF11.0 for the presence of these 589 substructures, ranking those indicated by the magnitude of the corresponding RL value. With this method the size of the substructure file obviously does not affect the prediction reliability for any individual substructure. STIRS results using the spectrum of n-propyl p-hydroxybenzoate as an unknown are shown in Table 111. The most structural

L!

0.001 0

0 0’

I

40

I

I

a0

I

I

I20

I

I

I

I60

Recall a s a % of Value at F P ( t h e o r . ) s 0 . 2 % Flgure 2. The relationship of observed false posklve values to recall values. See Figure 1 for symbol descrlptlons.

information results from the prediction of H O - C 6 H 4 - C O ~ however, the prediction of H&C6H4-C& and -426H4-Co-owith much higher reliability values obviously provides important corroborative evidence. The correct STIRS prediction of molecular weight 180 (7) makes possible the assignment of molecular structure except for the position of ring substitution.

Structural Information Available from Mass Spectra. The substructures of Tables I and I1 have been selected from

ANALYTICAL CHEMISTRY, VOL. 53, NO. 3, MARCH 1981

391

Table 11. Other Substructures of Useful Performance with STIRSa 250 Single Substructures: ar-C,H,, C,H,: benzo (fused to a h ) , cyclohexyl, ar-CO-0-CH,, ar.O., ar.N., o-sub phenyl, ar-Br, ar-CH,-0-, .CO.CH,.CH,., -(CH,),-, .CH(OCH,).CH( OCH,),, ar.NH., -CH,CH,-CO-0-CH,-, -CH,CH,-CO-0-CH,, -CO-NH-CH-CO-0-, C,H,: 2-sub benzo (fused to alic), -CH(CH,),, bg .CH.CH,.CH., ar0-CH,CH,, -0-CH,-CH-0-CO-CH,, -NH-CO-CH,, ar-NH,, -NH-CO-CF,, bg .CH.C.CH,CH,.CH., C,H,N: .benzo.N., ar-NH-CO-, -NH-CH(C0-0-CH,)-, bg .CH.CH,.CH,.CH., -NH-CH-CO-NH-, .O.CH(ar).O., C,H,: 1sub benzo (fused to alic), ar-CH= CH-CO-, C,H,ON: -CO-1-pyrrolidinr, ctr-N=N-, -CH,-O-(CH2)3-, C,H,,N: N. sub piperidine, ar-CH(0H)-, .CH,.N.CH,.CH,, ar-CO-0-CH,CH,, -(CH,),-CO-0-CH,, -CH,CH,-CO-0-SI-, ar-SCH,-, .C).CH(CII,).CH,., .CH.C€I,.O., -CH(OH)-CH(OCH,)-, C,H,N: pyrido (fused t o ar), ar-S-, cyclopentyl, C,H,: A,C-difused benzo, C,H,: 2-sub benzo (fused to ar), .CH(OH).CH,., .CH,.CO.CH,., 1,2,3-trisub ph, .CO.CH,.CH( CH,)., ar-CH=CH-CO-0-, -C( CH,),-, ar-0-CH,CH,-, -CH,CH,-OH, -CO-CH,-NH-CO-, .CH( OCH,). CH,., -CO-CH,CH,CH,, -0-CO-( CH,),-, ar-CCl,, -0-CO-CH-Q-, ar-CH,CH,-NH-, ar-CH,-CO-, -CH,-CO-0(CH,),, .C.CH( OH)., 4-(CH,),-CH,, -CH,-N( CH,),, .C.C(OH)., -CF,-CO-NH-CH,-, -CH-CO-O-CH,CH,, -CH,Cl, ar-CS-, -CO-CH( CH,),, ar-CH.O., C,H,-NH-, C,H,-, C,H,: 1,a-disub benzo (fused to alic), .N., %(or 4-)pyridyl, .CH,.CH,,O.,.-CO-CH,-CH( CH,),, .CH,.C( OH)., -CH=CH-CO-0-CH,, =:C-N-( CH,),, 40-N-CH,-, C,H,N: A,Cdifused piperidino, ar-CH,-OH, -0-CO-CH,CH,-, 2-pyridyl, could be 3- or 4-, -CH,CH,-CO-CH,, -CH,-CH(NHCO-CH,)-, -CO-NH-(CH,),-, .CH(OH).CH(OH).CH(OH)., .CO.C(CH,),., ar.O.CH,., -CH,CH,-0-CH,, -CH( 0H)CH,-, -N(CH,)-CH,CH,-, -CO-CH(NH3)-CH,-, -CO-(CH,),-CH,, .NH.CH,.CH,., -CC1,, br .CH.N.CH., .O.CH( CH,), -NH-CH.CH( OCH,)., -NH-CO-CH,-, -N-CH,CH,-, -0-CH,CH,-0-, ar-CH,-CO-OH, 2,5-disub thiophene, -NCH,CH,, -CH,-O-CH,-, -CO-NH-CH,CH,-, 1-cyclohexenyl, -NH-CH( CH,),, -CH,-CO-CH,-, -SO,-, -0-COCH,CH,, -S-CH,CH,-, -CO-H, .CO.CH( CH,).CH,., -CH2-2-thiophene, ar-CO-CH=CH-, -CH-CO-OH, .CH,.CO.O., 4-(0r S-)pyridyl, .CH,.C=N-0-CH,, -0-CO-CH-NH-, -S-CH,CH,CH,, bg .C.CH,.CH.CH., -CH-N(CH,),, -CH=CN(CH,),, -CO-CH,CH,-, .C(CH,) (OH).CH,.CH,., -CH,-CO-OH, -S-CH,, CH,-0-CH(CH,)-, C,H,O,: 2-methyl2,5-disub-l,3-dioxolane, -NH-CI-I-CO-OH, -NH-CH,-CO-0-CH,, -S-(CH,),-, -CH,-CH, -C=N, C,H,N: 3-subA,B-difused piperidino, -S-CH(CH,)-, N-sub pyrrolidinyl, -CO-CH,-CO-0-, -S-CH,CH,, C,H,: 1,l-disub cyclopenteno (fused to ar), .CH,.N.CH,., -CO-NH-CH,-, ar-C=CH,, -CH( OH)-CH,, -CH,-CO-S-CH,CH,-, .CH.O.CH,., C,H,: 1,2-disub benzo (fused to ar), ar-N( CH,CH,)-, 2,3-pyridyl, -CH,-O-(CH,),-CH,, -C( CH,)( OH)-, -CO-NH-, bg .CH.CH,.CH,.CH,.CH., -CH,-.l-furyl, -0-CH.CO.O., .CH,.N.CH.CH,., -C-0-CH,-, ar.N:N., -NH-CO-0-CH,-, -CH,CH,-SHY ar-P-, bg .C.CH,.CH., .O.CH,.CH,.CH,., -CH,-C=NO-, C,H,N: 2-pyrido (fused to ar), 2,4,6-trisub s-triazine, -0-CO-NH-, bg .C.CH.CH.CH., -CH,-C=N, -NH-CH,-CH( OH)-, -CH=N-0-CH,-, bg .CH.CH.CH.CH., C,HN,: 4-sub pyrimido (fused 1.0 ar), -S-C-, bg .C.CH,.CH,.CH,.C., -CH,-CH(NH,)-, -CH-0-CH,-, .CO.CH( CH,)., .N(CH,).CO., -CO-CH,-CO-CH,, C,N,: 1-sub 1,2,3-triazolo (fused to ar), -0-CH,CH,-0-CH,-, -NH-CH(CH,CH,)-, .CH.N.CH,.CH,., C,HN: 1,3-disub pyrrolo (fused to ar), -CH,-CO-CH,, -CH,I, -CH,SH, C,H,N: N,2-disub pyrrolidinyl, C,H,O: 1-furyl, .C'H(CH,).S., -N-CH,-, .CH,.C=N-OH, -0-CO-CH,-NH-, -C=CH, -S-S-CH,-, -CO-CH,-CO-, -S-CH-CH,-, -S-( CH,),-, -N(CH,)-CH-CH,-, -CH,-NH-CH( CH,)-, -CH(NH,)-CO-0-, 4-CH-, -C(CH,),-OH, ar-CS-, C,H,: 1,4-disub benzo (fused to ar), ar-CH=CH,, C,H,: l,4-disub benzo (fused to alic), C,H: 1,2,3-trisub benzo (fused to alic), C,H: 1,2,4-trisub benzo (fused to alic), C,: 1,2,4-trisub benzo (fused to alic), C,: 1,2,3,4-tetrasub benzo (fused to alic), .S.CH,., bg .CH.CH.CH,.CH., bg .C.CH,.CH,.C., bg .CH.C.CH,.CH, -NH-NH,, -CH,-NH,, -CH,CH,-NH,, -N-CH(CH,),, -NH-C(CH,),-, .NH.CH(CH,)., ar-CH.NH., -NH-NH-, -0-CH,C:H,CH,, -CO-C=C-, -CO-S-CH,CH,, ar-CO-CH,, .CO.CH:CH., -NH-CO-H, -N(N=O)-, -N(CH,)(N=O)-, -NH-CO-NH-, -CH,CH,-CO-OH, -CH(CH,)-CO-O-CH,, 40-CO-, -CH=CH-CO-OH, .CH(OH).CO.

147 Combination Substructures: C,H,-CO- or -C,H,-CO-, C,H,-CH, or -C,H,-CH,-, C,H,-OCH,, C,H,-0C,H,-CH=CH-CO-- or C,H,-CH=CH-CO, -C,H,-Cl, -C,H,-OH, C,H,-CH=CH- or -C,H,-CH=CH-, -C,H,-NH-, C,H,-NH or -C,H,-NH-, -C,H,-0-, -C,H,-OCH,, C,H,-N- or -C,H,-N-, -C,H,( -CO)-, o-HO-C,H,-CO-, -O-C,H,-CO-, -CO-C,H,-OCH,, C,H,-N-, o-HO-C,H,-, CbH5-N2-, -(HO-)C,H,-, -((CH,),N-)benzo-(fused to ar), C,H,-N- or -C,H,-N-, -(-0-)benzo-(fused to ar), HO-C,H,-CO-0-, -C,H,-N(CH,),, -(-N(CH,)-)benzo- (fused to ar), C,H,-SO,- or -C,H,-SO,-, o-CH,O-C,H,-, -C,H,-CO-OCH,, CH,O-C,H,(-CO-H)-, C,H,-N( CH,)- or -C,H,N( CH;)-, -C,H,(-CO-OCH,)-, -C,H,-F, -C,H,-NH,, -CH,-CO-C,H,, -CO-C,H,-Cl, -CO-C,H,-NH,, -0-C,H,-CHOY -( -N-)benzo-, -0-C,H,-CH=CH-, C,H,-CO-CH,- or -C,H,-CO-CH,-, C,H,-CO-NH- or -C,H,-CO-NH-, 0-0C&-CO-, -NH-CH,CH,-C,H,-N( CH,),, 40-CH=CH-C,H,-OCH,, -CsH3( -CO-0-)-, -CH,-0-CO-C,H,-OH, -N( CH,)-C,H,-CH,CH,-NH-, -N( CH,)-C,H,-SO,-, -O-C,H,-CH=CH-CO-, -0-C,H,( -CO-OCH,)-, C,H,-NH-, -CH2-CsH4-N-, C,H,-CH,-CO-0- or -C,H,-CH,-CO-0-, 5-sub steroid B + C rings or other fused polycyclic, -O-C,H,.-CH,-CO-, -0-C,H,-SO,-, -(Cl-)C,H,-, -(-CO-)benzo- (fused to alic), C,H,-CO-NH-, CH,O-C,H,-NH-, -N-C,H,.-CH,CH,NH-, -N-C,H,-SO,-, CH,O-C,H,-N-, HO-CO-C,H,-, CH,O-C,H,( -OH)-, -O-C,H,( 4 0 - 0 ) - - , C1,-C,H,-, -(HO)benzo- (fused to alic), -N-C,H,-CO-, C,HS-NH-CO- or -C,H,-NH-CO-, -C,H,( -CO-OH)-, CI-C,H,-NH-, NC-C,H,-, H-CO-C,H,, CH,-C,H,( -OH)-, -0-C,H,( -CO-)-, -C,H,(-CO-0-)-, -(-NHCH,CH,-)benzo -N-C,H,-0-, C,H,-OCH,- or -C,H,-OCH,-, (fused to ar), -( -8O,-)benzo (fused to ar), pyridyl-CO-, -NH-C,H,-0-, -"-C,H ,( -C1)-, -0-C,H,( -OH)--, -C,H,( -CO-H)-, -0-C,H,( -Cl)-, -( CH,O)benzo- (fused to ar), 17-sub steroid B,(delta-5) + C +. D rings, C,H,-0-, I-CsH4-, -C,H,(-OCH,-)-, -(CH,O-)benzo( -subst)- (fused to ar), pyridyl-0-, C,H,-S- or -C,H,-S-, HO-C,H,-CO-0-CH,CH,-, CH,O-C,H,( -CO-OCH,)-, -0-C,H,-CH= CH-CO-0-, -(-N-)C,I-I,-, -0--C,H,(-CH,CO-OCH,)-, -(I-)benzo( -subst)- (fused to ar), C,H,-CH=CH-CO-0-, CH,O-C,H,-CO-( -0-)benNH-, CH,O-C,H,-NH-CO-, CH,O-C,H,(-CH,-CO-)-, CH,O-C,H,(-CO-0-)-, -(CH,O-CO-CH,-)C,H,-, zo( -subst)- (fused to aromatic), -( -0-)benzo- (fused to alic), H,N-C,H,-CO-0-, HO-C,H,(-CO-OH)-, -CO-C,H,CH=CH-, -0-C,H,-CO-NH-, -0.-C,H,-NH-CO-, CH,O-C,H,( -CO-)- unsub steroid C + D rings, C,H,-CH,CH,"-, C,H,-CH,-(=O-, C,H,-NH-CO-, C,H,-CO-0-, -( CH,O-)benzo( -CO-)- (fused to alic), pyridyl-CO-NH-, -NH-C,H,-CH=CH-, 14,17-disub steroid C + D rings, -CH,-pyridyl-CO-, 6,7-disub purine, C,H5-0-CH,-, Cl-C,H,-O-, HO-CH,-C,H,-, -C,H,( -CH,-CO-)-, 2-pyridyl-C0-, C,H,-CH=CH- or -C,H,-CH=CH-, CH,-C,H,( "-)-, CH,-C,H,( -N-)-, sub pyridyl-CO-, -0-C,H,( -CH,-)-, -( -N,-)benzo- (fused to ar), -( -CHOH-)benzo(fused to ar), -(CH,O-)benzo- (fused to alic), pyridyl-CH,-, sub pyridyl-CH,-, C,H,-P- or C,H,-P-, -( -N-)benzo(fused to ar), -( C1-)benzo- (fused to alic). or -C,H,-0-,

-

RC FP RL z DI single substructures 26 0.40 41 1.9 combination substructures 22 0.10 54 4.3 a See Table I. The substructures of this list are ordered separately by their DI values, highest first. Abbreviations: alic, alicyclic; ztr, aryl; bg, bridge; 0,ortho; ph, phenyl; sub, substituted. Occurrence-weighted averages.

-

392

ANALYTICAL CHEMISTRY, VOL. 53, NO. 3, MARCH 1981

T a b l e 111. STIRS S u b s t r u c t u r e Predictions for n-Propyl p-Hydroxybenzoate substructure

HO-C,H,-CO-C,H,kOHO-C.H,-C,H,kb-O-

% reliability

98 98 97

substructure

% reliability

40-0-

84

HO-C,H,-(20-0-0-CH.-

77 75

96

approximately 7000 preselected possibilities as those most readily identified by STIRS. Thus they are indicative of the type and specificity of structural information which can be deduced by mass spectrometry. Overall the results are consistent with well-known interpretive generalizations, but no quantitative data have been available previously. Most aliphatic hydrocarbon substructures show relatively high false positive values, unless the specificity of their descriptions is substantially reduced. STIRS poorly distinguishes the n-hexyl substructure from isomeric alkyl moieties; of the 60 identifications of this substructure, 31 are false, but of these only three are compounds that do not contain other large alkyl groups. The four C4H9+isomers were evaluated; not surprisingly, tert-butyl is the only isomer identified by STIRS with useful specificity, with a false positive value of 0.46% compared to 4.6%, L O % , and 1.1%for the n-, iso-, and secbutyl isomers. The presence of heteroatoms greatly increases the probability of a low false positive value for a substructure without modification of its description. As expected, the highest specificity is found for substructures giving the most unique mass spectral data. For trimethylsiloxy, STIRS shows an amazing 100% recall (54 of 54) with 93% reliability; the presence of silicon in organic compounds is unusual, as are the masses and abundances of this element’s isotopes. The methoxy group is identified with 91% reliability but only 44% recall; methyl ethers produce ions such as C,H2n+10+ at relatively unique masses, but many functionalities influence mass spectral behavior more strongly, so that their presence causes the peaks due to methoxy to be insignificant. STIRS is able to identify the methoxy “descendant” substructure CH3-0-CO- with higher recall (70%) but lower reliability (75%); carbomethoxy can produce a number of additional fragmentation pathways helpful in characterization, but the combinations of resulting masses can also arise occasionally from similar functionalities such as CH3-CO-0-. The behavior of the aryl-OCH3 substructure is intermediate, with 56% recall and 86% reliability. This general designation of aromatic methoxy substructures was used because more specific descriptions gave poorer performance; p-CH3O-CeH4gave 30% Tecall and 47% reliability. This was also true for most simple aromatic substituents, which gave substantially higher D1 values for the “aryl-” substructure description than when combined with a more specific aromatic ring. This is in keeping with the well-known insensitivity of mass spectral

data, such as the “low mass ion series”, to aromatic type and ring position. Further Improvements. These substructure performance data are based on STIRS predictions utilizing the overall match factor MF1l.O. Other individual and combination data classes are known to be more selective for the prediction of specific substructures, and similar studies are in progress to identify combinations most likely to give improved reliabilities. For example, for the “unknown” n-propyl p-hydroxybenzoate above, the neutral-loss data class 5B indicates the C3H70substructure with high reliability. The performance of the substructure list of Tables I and I1 can also serve as a base line against which to evaluate further improvements to STIRS. Such modifications to the secondary-neutral-loss data class are reported separately (7), and improvements to the lowmass-ion-series data class appear promising. Finally, a logical next step is the development of a computer-aided system to predict the complete molecular structure from the STIRSpredicted information on substructures, molecular weight, and elemental composition (7, 17).

ACKNOWLEDGMENT I. K. Mun, H. E. Dayringer, and B. L. Atwater provided valuable assistance and advice.

LITERATURE CITED (1) Burlingame, A. L.; Balllle, T. A.; Derrick, P. J.; Chlzhov, 0.S. Anal. Chem. 1980, 52, 214R. (2) . , ChaDman. J. R. “ComDuters In Mass SDectrometrv”; Academic Press:

New York, 1978. ’ (3) McLafferty, F. W.; Knutti, R.; Venkataraghavan, R.; Arplno, P. J.; Dawklns, B. G. Anal. Chem. 1975, 47, 1503-1505. (4) Pesyna, G. M.; Mclafferty, F. W. ”Determination of Organic Structures by Physical Methods”; Nachod, F. C., Zuckerman, J. J., Randall, E. W., Eds.; Academic Press: New York, 1976; pp 91-155. Henneberg, D. Adv. Mass Specfrom. 1980, 8, 1511. Dayringer, H. E.; Pesyna, G. M.; Venkataraghavan, R.; McLafferty, F. W. Org. Mass Spectrom. 1978, 11, 529-542. Mun, I. K.; Venkataraghavan, R.; McLafferty, F. W. Anal. Chem. 1981, 53, 179. Kwok, K.-S.; Venkataraghavan, R.; McLafferty, F. W. J. Am. Chem. SOC. 1973, 95, 4185. Mclafferty, F. W.; Venkataraghavan, R. “Mass Spectral Correlations”, 2nd ed.; American Chemlcal Society: Washington, DC, I981 Harakl, K. Ph.D. Thesis, Cornel1 University, May 1980. Stenhagen, E.; Abrahamsson, S.;Mclafferty, F. W. ”Registry of Mass Spectral Data” (extended version on magnetlc tape); Wiley: New York, 1978. Brown, Carter N. Chemlcal Abstracts Service, Columbus, OH. Pesyna, 0.M.; Venkataraghavan, R.; Dayringer, H. E.;McLafferty, F. W. Anal. Chem., 1978, 48, 1362-1368. Dayringer, H. E.; McLafferty, F. W.; Venkataraghavan, R. Org. Mass Spectrom. 1978, I f , 895-900. McLafferty, F. W. Anal. Chem. 1977, 49, 1441-1443. Mun. I. K.; Venkataraghavan, R.; McLafferty, F. W. Anal. Chem. 1977, 49, 1723-1726.McLafferty, F. W.; Atwater (Fell), B. L.; Haraki, K. S.;Hosokawa, K.; Mun, I. K.; Venkataraghavan, R. Adv. Mass Spectrom. 1980, 8, 1564-1567. I

RECEIVED for review September 8,1980. Accepted December 12,1980. Support of this research by the National Institutes of Health (Grant GM16609) and the National Science Foundation (Grant CHE7910400) is gratefully acknowledged.