Computerized learning machines applied to chemical problems

to Chemical Problems. Molecular Structure Parameters from Low Resolution Mass Spectrometry. P. C. Jurs,1 11B. R. Kowalski,2 T. L. Isenhour,3 and C. N...
0 downloads 14 Views 693KB Size
Computerized Learning Machines Applied to Chemical Problems Molecular Structure Parameters from Low Resolution Mass Spectrometry P. C. Jurs,’ B. R. Kowalski,2 T. L. I ~ e n h o u r and , ~ C. N. Reilley Departments of Chemistry, Unicersity of Washington, Seattle, Wash., and University of North Carolina, Chapel Hill, N. C . The learning machine method of empirical data treatment has been applied to low resolution mass spectrometry for the purpose of elucidation of molecular structures. For a set of 387 hydrocarbon compounds and 243 carbon-hydrogen compounds containing oxygen and/or nitrogen, 43 and 65 binary decision surfaces, respectively, were trained and then tested for predictive ability. Use of these decision surfaces allows deductions regarding molecular structure to be made from low resolution mass spectra without recourse to theory. BOTH LOW AND HIGH RESOLUTION mass spectra contain a wealth of information which may be used for the identification of chemical compounds and the elucidation of their structures. A variety of strategies have been applied t o the identification and interpretation of mass spectra. Methods include matching recorded spectra with library spectra for identification (1-5) and element mapping (6-8). Computations t o determine fragment formulas (elemental compositions) from the mass value using high resolution have been implemented ( 9 ) ; development of plausible isomers from the combination of the mass spectrum and molecular formula has also been explored (IO). With the advent of on-line computerized data acquisition systems (11-13), it has become practical t o collect and process large quantities of mass spectrometry data. The great activity in this area is indicated by a recent biannual review o n mass spectrometry which includes 32 references o n data processing (14). 1 Present address, Department of Chemistry, the Pennsylvania State University, University Park, Pa. 16802. 2 Present address, Shell Development Company, 1400 53rd Street, Emoryville, Calif. 94608. Present address, Department of Chemistry, University of North Carolina, Chapel Hill, N. C. 27514.

Previous work has shown that molecular formulas of standard spectra can be accurately determined by learning machine procedures (15). The learning machines used consisted of adaptive binary pattern classifiers which were trained using negative feedback so that their performance improved as their experience in classifying mass spectra increased. Using this completely empirical approach, weight vectors were found which could make the significant decisions and also required considerably less storage than the standard library spectra. Since such weight vectors are derived without recourse t o theory, trends and relations may be established which might not otherwise be considered. Furthermore, once the weight vectors have been trained, it is feasible t o utilize this scheme without recourse to a computer. Further studies of this type have shown that a trained weight vector may be successfully used to predict composition information from the spectrum of a n unknown compound (16, 17). This work describes the application of learning machine methods to the determination of structural parameters from low resolution mass spectra, Two classes of compoundshydrocarbons (hereinafter referred t o as the CH class) and compounds containing carbon and hydrogen plus either oxygen and/or nitrogen (hereinafter referred to as the C H O N class)-were used to develop weight vectors for various structural parameters. Application of these weight vectors t o the mass spectrum of a compound from the training set gives answers t o a series of structural questions which always correctly defined the molecular formula and which often are sufficient to determine the entire molecular structure. Application of the weight vectors to the spectrum of a n unknown compound gives predictions o n structural information, often with accuracy as high as 90%, and this information is helpful in determining the molecular formula and structure. TRAINING AND TESTING WEIGHT VECTORS

(1) B. Pettersson and R. Ryhage, Ark. Kemi, 26, 293 (1967). (2) B. Pettersson and R. Ryhage, ANAL.CHEM.,39, 790 (1967). (3) “Computer Recording and Processing of Low Resolution Mass Spectra,” R. A. Hites and K. Biemann, in “Advances in Mass Spectrometry,” E. Kendrick, Ed., The Institute of Petroleum, London, England, 1968. (4) S. Abrahamsson, S. Stallbergstenhagen, and E. Stenhagen, Biochem. J., 92, 2P (1964). (5) L. R. Crawford and J. D. Morrison, ANAL.CHEM.,41, 994 (1969). (6) K. Biemann, P. Bommer, and D. M. Desiderio, Tetrahedron Leu., 26, 1725 (1964). (7) R. Venkataraghavan, F. W. McLafferty, and J. W. Amy, ‘ANAL.CHEM., 39, 178 (1967). (8) D. D. Tunicliff and P. A. Wadsworth. ibid.. 40. 1826 (1968). (9) “Computation of Molecular Formulas’for Mass’Spectrometry,” Joshua Lederberg, Holden-Day, San Francisco, Calif., 1964. (10) “Mechanization of Inductive Inference in Organic Chemistry,’’ by J. Lederberg and E. A. Feigenbaum, in “Formal Representation of Human Judgment,” B. Kleinmutz, Ed., John Wiley, New York, N. Y . , 1968. (11) R. A. Hites and K. Biemann, ANAL.CHEM.,39, 965 (1967). (12) Zbid., 40, 1217 (1968). (13) A. L. Burlingame, D. H. Smith, and R. W. Olsen, ibid., p 13. (14) R. W. Kiser and R. E. Sullivan, ibid., 40, 273R (1968).

All computations were done o n the University of Washington IBM 7040/7094 direct couple system using Fortran IV computer programs. General aspects of the learning machines and the computer programs involved have been described earlier (15-17). The overall set of mass spectra, taken from the American Petroleum Institute Research Project 44 tables, includes 630 low resolution mass spectra of compounds in the range Cl-l0, H1-24, 00-4, N G ~ . Of these, 387 were CH compounds and 243 were CHON compounds. F o r this work, the weight vectors are arranged in a parallel fashion rather than a branching tree as in a previous study (15). The parallel arrangement allows complex decisions t o be made with acceptably high overall predictive ability, (15) P. C. Jurs, B. R. Kowalski, andT. L. Isenhour, ANAL.CHEM., 41, 21 (1969). (16) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, ibid., p 690. (17) B. R. Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reilley, ibid., p 695.

ANALYTICAL CHEMISTRY, VOL. 42, NO. 12, OCTOBER 1970

0

1387

Table I. CH Class

Carbon number

Hydrogen number

cutoff 9 8 7 6 5 4 20 18 16 14 12 10 8

Carbon :hydrogen ratio

Methyl

+

6 2

2n 212 212 - 2 2n - 4 2n - 6 4 3 2 1

0 1 0

Ethyl

1

ti-Propyl

0

Largest ring

Branch point carbons

6 5 4 3 2 1 0

Number of -C==C-

2 1 0

Carbons w/o hydrogens Benzene ring

-c=cVinyl

1 0 0 0 0

Training set Negative Positive category category 163 37 121 79 80 120 53 147 30 170 15 185 196 4 168 32 143 57 110 90 138 72 47 153 28 172 191 9 44 156 75 125 47 153 9 191 15 185 9 191 40 160 86 114 138 62 171 29 34 166 96 104 191 9 145 55 199 1 142 58 121 79 120 80 174 26 108 92 42 158 29 171 159 41 98 102 167 33 111 89 179 21 184 16 166 34

whereas binary decisions arranged in a branching tree must have extraordinarily high individual predictive abilities for the overall predictive ability t o remain useful. Of course, for the parallel arrangement, all weight vectors must be used for each classification which requires more computation than for the branching tree method. CH Class. From the 387 CH compounds, a subset of 200 were chosen randomly to serve as a training set for developing weight vectors and the other 187 were used to test the predictive ability of the weight vectors. Table I shows the results of training and testing the 43 weight vectors developed for the determination of structural parameters of hydrocarbons. Each vector was trained to give a binary decision. I n all cases, except the carbon:hydrogen ratio, a positive answer indicated the value was greater than the cutoff number, and a negative value indicated that the value was less than or equal to the cutoff. F o r example, the first vector (carbon number 9) was trained to give a positive dot product with a mass spectrum of a compound containing ten carbons and a negative dot product with one containing nine or less carbons. The carbon:hydrogen ratio was trained to give a positive result for that particular ratio and a negative result for any 1388

Prediction set Positive category 33 74 110 143 166 180

~.

Spectra feedback 227 167 185 99 71 42 53 170 202 58 51 59 31 34 25 28 36 39 13 177 800 859 648 328 1518 >zoo0 211 >2000 11 141 149 194 356 >2000 1486 11 153

>2000

432 1327 37 163 >2000

Negative category 154 113 77 44 21 7 182 154 132 100 55 34 19 10 143 107 154 180 170 165 136 97 47 24 153 97 177 148 184 137 114 112 163 90 36 165 158 98 156 97 168 174 158

~

5

33 55 87 132 153 168 177 44 80

33 7 17 22 51 90 140 163 34 90 10 39 3 50 73 75 24 97 151 22 29 89 31 90 19 13 29

Per cent prediction 89.3 92.5 93.6 94.1 97.9 97.3 97.3 95.7 97.3 94.1 95.2 96.8 96.8 97.3 96.8 96.8 96.8 98.9 98.9 90.9 86.1 86.6 86.6 89.3 80.7 73.3 90.4 71.7 97.9 89.8 90.9 91.4 90.9 61.5 87.2 98.4 92.5 77.5 83.4 70.6 96.8 93.6 80.2

other. F o r example, n-hexane gives a positive dot product for carbon:hydrogen ratio 2n 2 and a negative dot product for all other ratios. The categories appearing in Table I are defined as follows. Methyl, ethyl, and n-propyl numbers are the number of each group which can be produced by a single bond rupture, i.e., 3-methylhexane has 3 methyls, 2 ethyls, and 1 n-propyl by this definition. The largest ring classification includes saturated, unsaturated, or aromatic rings. Branch point carbon number is the number of carbon atoms in the compound which are bonded directly to at least three other carbon atoms. F o r number of carbon-carbon double bonds, benzene has been classed as three. The “carbon w/o hydrogen” category refers to carbons which are not bonded to any hydrogens. The final three weight vectors detect presence or absence of benzene rings, acetylenic bonds, and vinyl structural features. The third and fourth columns of Table I indicate the number of compounds of the training set which fell into each class. The fifth column gives the number of feedbacks which were necessary for convergence with >2000 indicated for those that had not been completely trained after 2000 feedbacks. (As indicated in previous work, failure to train in some arbitrary

ANALYTICAL CHEMISTRY, VOL. 42, NO. 12, OCTOBER 1970

+

Table 11. CHON Class

Carbon number

Hydrogen number

Oxygen number

cutoff 9 8 7 6 5 4 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 2 1 0

Nitrogen number Carbon :hydrogen ratio

Largest clump

1

+ +

0

2n 3 %+2 2n 1 2n 2n - 1 2n - 2 2n - 3 2n - 4 2n - 5 9 8 7 6 5 4

Number of clumps Methyl

3 2 1 2 1 2 1 0

Ethyl

1 0

Largest ring Number of -C=C-

5 4 2 1 0

Carbonyl Oxygen linkage Ether Heteroatom in ring Amines Alcohol Aromatic Odd hydrogen number

0

0 0 0

0 0

0 0

Training set Negative Positive category category 141 9 131 19 124 26 111 39 80 70 48 102 146 4 143 7 141 9 138 12 137 13 130 20 127 23 116 34 113 37 92 58 82 68 55 95 49 101 33 117 20 130 14 136 145 5 102 48 43 107 146 4 99 51 138 12 100 50 87 63 37 113 31 119 22 128 17 133 15 135 9 141 146 4 131 13 132 18 125 25 108 42 89 61 53 97 29 121 10 140 136 14 87 63 119 31 58 92 23 127 124 26 90 60 128 22 114 36 140 10 132 18 123 27 106 44 92 58 111 39 116 34 112 38 129 21 138 12 104 46

Spectra feedback 49 120 220 266 516 159 22 73 93 84 74 159 115 279 345 254 229 237 204 80 63 62 46 963 88 55 67 26 213 223 112 113 45 62 36 19 20 96 164 168 123 133 193 75 56 187 544 279 935 340 326 852 112 116 23 29 145 852 582 1005 288 83 122 16 194

Negative category 86 81 76 71 52 35 92 91 91 90 90 89 89 80 78 66 63 45 39 27 20 10 85 55 26 87 63 4 28 32 63 65 71 75 79 80 90 87 81 78 66 56 37 22 4 78 55 80 43 11 80 52 74 66 79 73 66 61 59 75 74 73 81 76 69

Prediction set Positive Per cent category prediction 94.6 7 87.1 12 90.3 17 83.9 22 80.0 41 82.8 58 1 98.9 2 98.9 2 98.9 96.8 3 96.8 3 89.2 4 92.5 4 86.0 13 80.6 15 21 87.1 30 82.8 48 80.6 54 81.7 66 81.7 73 83.9 83 87.1 8 94.6 38 76.3 67 93.5 6 93.5 30 91.4 89 97.8 65 87.1 61 81.7 30 91.4 28 91.4 22 94.6 92.5 18 14 93.5 13 93.5 3 96.8 6 88.2 12 89.2 15 85.0 27 86.0 88.2 37 56 85.0 71 90.3 89 95.7 15 90.3 38 75.3 13 80.6 50 18.5 82 85.0 13 85.0 41 68.8 19 93.6 27 93.6 14 94.6 20 94.6 27 88.2 32 73.1 34 75.3 18 76.3 19 85.0 20 92.5 12 89.2 17 94.6 24 86.0

ANALYTICAL CHEMISTRY, VOL. 42, NO. 12, OCTOBER 1970

1389

Table 111. CH Training Set Carbon : hydrogen ratio 1 2 3 4

Largest ring

1 1 0 1 0 1

none none none none

0 1

>6

2

+

212 2 2n 212 - 2 212 - 2

3 3

2

5 6

1

>1 1 0

2 0 0

0

>2

-c=c-

Q

no no no yes no

no no no no no

CH3

1

CH3-CH-CHz-CH3

CH3-CH-CHz-CH3

2

CH3-CH=C-CH2-CH3

I

I

or CH3-C==CH-CH2-CH3

I

I

CH3

CH~-CH=C-CHZ--CH~

I

CH3

CH3

CHj--CH=CH-CH3

CH3--CH=CH--CHa

CHI

CH3

I

4

CH3-CH-C-C-CH2-CH3

5

2 rings, 5 double bonds

no no no no no no no no no yes

Actual molecular structure

CH3

3

Vinyl

I

or CH3-C=C-CH-CH2-CH3

CH3

I

CH~-CEC-CH-CH~-CH~

\

/

CH3 6

I

CHa-CHz-CHz-CH2-C-CH3

I

CH3

7

Q-cH~ or isomers

\

CHI

8 9 10

or isomers ,CHI ~ C H , - C H \ CHI

@CH,-

CH=CH,

number of feedbacks does not prove the given training set to be linearly inseparable, and, indeed, incompletely trained vectors may still have considerable recognition and predictive ability.) The sixth and seventh columns indicate the number of compounds in the prediction set which fell into each class. The final column gives the prediction success for the 187 compounds which were not part of the training set. It should be stressed that these compounds were treated by the machine in every way as complete unknowns. The only difference from real unknowns is that the results of these computations may be evaluated. Predictive ability ranged from 61.5 % to 98.9% with a n average of 90.3 %. The prediction percentage can be used as a gauge of the credibility of a n answer when produced for a n unknown spectrum, Random guessing would give 50% success. It is seen that in the CH class considerable structural information may be derived with a high confidence 1390

level from a completely empirical calculation method. CHON Class. From the 243 CHON compounds, a subset of 150 was chosen for training and the other 93 were used to test the predictive ability of the developed weight vectors. Table I1 shows the results of training the 65 weight vectors developed for various aspects of structure. The categories appearing in Table I1 are defined as follows. The clump number of a compound is the largest number aggregate of carbon atoms bonded together. Methyl, ethyl, largest ring, and number of double bonds are as in the C H class. Carbonyl presence means that there is a carbon-oxygen double bond in the compound. Oxygen linkage means that two carbon atoms are bonded together by a n oxygen bridge. Ether and the rest of the categories are defined in the conventional manner. I n each case, the weight vector was trained to give a positive answer if the value was greater than the cutoff and a

ANALYTICAL CHEMISTRY, VOL. 42, NO. 12, OCTOBER 1970

CH Prediction Set Branch point Carbon 12Largest carNo. of without Benzene Triple Ethyl Propyl ring bons -C==C-hydrogen ring bond no no 1 0 0 0 0 2 no 1 0 0 0 0 1 no no >1 0 0 2 1 >1 no no no 0. 2 0 0 0 0 1 no no 0 1 1 0 5 2 no >1 0 6 2 >2 >1 Yes no no 0 0 1 1 5 2 no no 0 1 1 0 5 1 no no 0 1 1 0 0 2 no no 0 0 0 1 6 1 Table IV.

Carbon: hydrogen ratio

HyCarbon drogen number number 1 2 3 4 5 6

7 8

9 10

5 6 6 5 10 8 8 8 9 9

Methyl

211 2 14 212 3 12 212 3 8 2n - 2 2 20 212 4 16 2/t - 6, 2/2 - 2 2 1 212 16 1 2n - 2 14 4 2rz 2 20 211 2 18 10

+

Predicted Actual molec- molecular ular formula formula

Vinyl group

Incor. classifications

no

3

yes Yes no no no no no no no

2 6 4 3 6 2

6 2 2

Actual molecular structure CH3

Predicted molecular structure CH

I

/ \

CH3CH-CH-CHa CHI

CHa-CkCH-CH3 CH3

CH3-CH-CH,---CH=CHn CH2-CH3

CH~-CH--CHZ-CH=CH~

I

I

I

CH3-CH2-C=CH2

or CH3-CH2-CH=CH-CH2-CH3

CH3-CH=CH-CH

/CH3

\ CH3

CHa

I

CH3-C==C=CH2

CHFZH-CH=CH-CH~

or CH3-CH=C=CH-CH3

5-membered ring, 3 Me, 1 ethyl

6

CsHia

(tt.butyl

CIOHI~

$“‘

CH,-CH3

CHJ

I

(>-CH?-CH?--CH;

or

0

6 FH3

CH