Computerized learning machines applied to chemical problems

Learning machine methods with computer implemen- tation have been applied to 4500 ... placed into chemical classes by weight vectors de- veloped from ...
0 downloads 0 Views 644KB Size
Computerized Learning Machines Applied to Che ical Problem interpretation of Infrared Spectrometry Data B. R. Kowalski,' P. C. Jurs,2 T. L. Isenhour3 Department of Chemistry, University of Washington, Seattle, Wash. 98105

G . N. Reilley Department of Chemistry, University of North Carolina, Chapel Hill,N . C . 27514

Learning machine methods with computer implementation have been applied to 4500 infrared spectra. A learning criterion is presented, and compounds are placed into chemical classes by weight vectors developed from a representative training set. The learning machine method i s compared to a classlfication method based on comparing the spectrum to average spectra and is shown superior in each of the cases tested. Included are studies of training set size, even and uneven training set distributions, a no-decision criterion for detecting patterns difficult to classify, and a method of assigning confidence levels to predictions.

COMPUTER INTERPRETATION of infrared spectra for the identification of chemical functional groups and structural formulas has been based almost exclusively either on recognition of characteristic absorption bands or on matching unknown spectra with standard spectra from a library. A variety of computer systems have been developed to implement the search and compare method (1-3); however, this approach suffers from the necessarily large library of data which must be stored and the searching times involved. A fast search method (-90 sec for 90,000 compounds) has been developed (3) but requires a computer with a large disk storage. Learning machine theory as implemented with computer methods has proved successful in the empirical interpretation of low resolution mass spectrometry data (4-7). This paper describes the investigation of learning machine methods for the interpretation of four thousand five hundred infrared spectra and the classification of compounds into chemical classes. A CRITERION FOR MEASURING LEARNING

To determine whether an empirical interpretation process is actually developing a meaningful relation between patterns and the categories into which they are being classified, a criterion for learning was developed based on simple probaPresent address, Shell Development Company, 1400 53rd Street, Emoryville, Calif. 94608 a Present address, Department of Chemistry, The Pennsylvania State University, University Park, Pa. 16802 a Present address, Department of Chemistry, University of North Carolina, Chapel Hill, N. C. 27514 (1) R. 0. Crisler, ANAL.CHEM., 40,246R (1968). (2) D. N. Anderson and G . L. Covert, ibid., 39, 1288 (1967). (3) D. S. Erley, ibid., 40, 894 (1968). (4) P. C. Jurs, B. R. Kowalski, and T. L. Isenhour, ANAL.CHEM.,

41, 21 (1969). (5) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, ibid., 41, 690 (1969). (6) B. R. Kowalski, P. C Jurs. T. L. Isenhour. and C. N. Reillev. .ibid,, p. 695 (1969). (7) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, ibid., p 1949. _

I

bility theory. If a set of patterns is to be classified into two categories, and the set has an equal number of patterns in each category, then random guessing would give an 0.5 success rate. Furthermore, even if the distribution were not 50/50, random guessing would still produce an 0.5 success rate. If, however, the classification process employed the same distribution as the pattern set, but still in a randomized q2 where p is the fashion, the success rate would be p 2 probability of category 1 and q is the probability of category 2. (The probability of having a pattern in category 1 and guessing category 1 would be p 2 ; likewise, the probability of having a pattern in category 2 and guessing category 2 would be q2. Hence, the total probability of success would be p 2 q2.) Because, for a binary case

+

+

p + q = 1

the success rate (r) for such a method would be r = 1

- 2p+2p2

(2)

On the other hand, if the more populous category (defined to be category 1 in this case) were guessed every time, the success rate would be just p . Table I shows that always guessing the more populous category yields a higher success rate than imposing the actual distribution upon the guessing sequence, except for the cases where p = 0.5 or p = 1.0 where the two methods of guessing become equivalent. Hence, more right answers are obtained by always choosing the more frequent category. However, this method is meaningless because no question is answered in the process. Table I is used as a criterion for learning in this work and a classification method that produces more right answers than just using the training set distribution is considered to have learned something about the relation between the patterns and the categories. Furthermore, a classification method that exceeds the success rate of always guessing the more populous cate-

Table I. Success Rate for Guessing with Imposed Distribution

Fraction in category 1

Fraction in category 2

0.50

0.55

0.50 0.45

0.60

0.40

0.65 0.70 0.75

0.35 0.30 0.25 0.20

0.80

0.85 0.90

0.95

1.oo

0.15 0.10 0.05 0.00

Success rate, r 0.500

0.505 0.520 0.545 0.580

0.625 0.680 0.745 0.820

0.905 1.ooo

ANALYTICAL CHEMISTRY, VOL. 41, NO. 14, DECEMBER 1969

0

1945

Table XI.

Classifications Using Sum Spectra

Chemical class

Positive

Negative

Positive

Per cent correct Negative

Carboxylic acid Esters of carboxylic acid Linear amides Ketones 1O Alcohols Phenols 1 O Amines Ethers and acetals Nitro, nitroso compounds Cyclic amides Nitriles, isonitriles Urea and derivatives Aldehydes 2" Alcohols 3 Alcohols 2" Amines 3 O Amines 1O Amine salts Unsaturated hydrocarbons

683 551 309 421 329 403 491 708 304 251 20 1

3817 3949 4191 4079 4171 4097 4009 3792 4196 4249 4299 4390 4370 4239 4397 4114 3789 4353 4385

80 93 90 95 95 80 91 70 86 89 87 97 82 95 97 71 54 82 98

73 45 68 44 48 72 57 71 82 70 65 59 69 39 39 63 78 76 40

Total

O

110 130

261 103 386 711 147 115

gory, not only has learned something, but is useful in that it produces more right answers than any random or fixed guessing method. DATA AND COMPUTATIONS

W.Y

=

wl*yl

+

nl,pl,pz,p3,

I

. .Pnl,n~,ql,q2,q3,.. . qnz,na,ri,rz,r3,. . .rn,

where nl is the number of peaks with amplitudes of 1.0 followed by the dimensions (or positions) of those peaks, nz i s the number of peaks with amplitudes of 2.0 followed by the dimensions of those peaks, and ns is the number of peaks with amplitudes of 3.0 followed by the dimensions of those peaks. The resultant patterns required less than one third the storage of the uncompressed spectra. Furthermore, the dot product process used to form the scalars necessary for classification could be performed as follows. If W is the weight vector and U is the pattern for which the dot product is to be formed, then 1946

e

74 51 69

49 51 73 61

71 82 71 66 60

69 43 40 64 74 76 42

W 3 ' Y 3 . . .Wd+l'Yd+l

(3)

However, if Y contains only values of 0.0, 1.0, 2.0, and 3.0, the dot product may be computed by

W . Y = 1.0

x

2 + 2.0 x 2 + wgj

wp,

3=1

The learning machine methods used in this study, and the computer programs necessary to implement them, have been thoroughly discussed in references (4) and (5). The data used in this study were made available on loan by the Sadtler Research Laboratories, Inc., Philadelphia, Pa., and are part of the same data used in the Sadtler IR Prism Retrieval System. From the 24,142 spectra of standard compounds, the first 4500 were selected that satisfied the requirements of no more than ten carbon atoms, four oxygen atoms, three nitrogen atoms, and no additional elements except hydrogen. These Sadtler spectra are recorded in 0.1 p bands from 2.0 to 14.9 p . This gives a total of 130 pattern dimen1 term. The amplitude of each patsions, including the d tern component is assigned one of four values based on the intensity of the strongest absorption in that 0.1 p band. For the largest peak in the spectrum, the amplitude is set at 3.0; for the largest peak in a 1.0 p band, the amplitude is 2.0; for other peaks, the amplitude is 1.0; and for no peak in a given 0.1 p band, the amplitude is set at 0.0. Because the amplitude data were limited to only three nonzero values, and the majority of the values were zero, it was useful to compress the patterns to save computer storage and decrease computation time. Each pattern was, therefore, rewritten as a series of integers as follows

+ wz.yz +

Total

j=1

For the patterns used, this method of computing dot products decreased computation time by roughly a factor of twenty. In all parts of this study, the patterns were classified on the basis of the chemical classes supplied with the SadtIer Spectrum for each compound. The computers used in this study were the University of Washington Computer Center IBM 7040/7094 direct couple system and the Triangle Universities Computation Center IBM 360/75 teleprocessing with the University of North Carolina Computation Center IBM 360/40. All programming was done in FORTRAN IV. RESULTS AND DISCUSSION

Normalized Sum Spectra. To estimate how well an individual might empirically classify infrared spectra on the basis of comparison to a large number of standard spectra for chemical class identification, sum spectra were composed from the set of 4500. This was accomplished by adding the individual components of all the spectra of compounds in the positive category for a given chemical class, normalizing the resultant spectrum and then subtracting a similarly summed and normalized spectrum of all the spectra of compounds in the negative category. The resultant average spectrum contains positive peaks where they are a strong indication that the compound involved belongs to the chemical class and negative peaks where the compound is not likely to belong to the class. Such average spectra were then used to classify the compounds from which they were developed. Table I1 show the results of nineteen such computations. The per cent correct classifications range from 40 for 3' alcohols to 82 for nitro and nitroso compounds. The figures which show less than 50 per cent success (which would be produced by completely random guessing) suggest that the

ANALYTICAL CHEMISTRY,' VOL. 41,NO. 14,DECEMBER 1969

Table 111. Training with Uneven Populations

Chemical class Carboxylic acid Esters of carboxylic acids Linear amides Ketones 1 Alcohols Phenols 1 Amines Ethers acetals Nitro, nitroso compounds Cyclic amides Nitrites, isonitrites Urea and derivations Aldehydes 2" Alcohols 3 Alcohols 2" Amines 3 " Amines 3 Amine salts Unsaturated hydrocarbons O

Positive numbers of training set 64 83 27 32 33 60 50

78 16 7 25 3 21 31 20 37 50 6 28

Number fed back

Per cent prediction

Positive members overall

Total wrong

926 437 123 255 251 286 977 3429 64 49 166 45

78.2 75.3 86.4 83.2 87.9 81.8 82.2 78.5 91.3 90.7 93.8 94.3 90.8 89.5 94.6 82.8 77.1 92.9 93.0

583 480 272 374 307 365 436 630 268 222 162 103 124 228 97 334 613 125 115

764 514 477 587 423 636 624 752 306 325 217 199 323 369 189 604 803 249 246

155

310 179 407 661 61 112

noise level of this method is probably around +lo%, and the upper level suggests the rate of success an individual might attain by directly trying to interpret spectra using data in this form. This does not mean, however, that greater success could not be achieved by interpreting a continuous spectrum which had not been digitized and reduced, nor does it mean that identification can not be accomplished by comparison of an unknown spectrum to a library of standard spectra one at a time, as most searching routines do. However, the object of this study is to determine if classification may be accomplished by direct comparison of a spectrum to an empirically derived classification factor, which is much quicker than the arduous searching of a large library of spectra. Hence, Table I1 serves as an estimate of the upper level of success that an investigator might have using his own experience and intuition. Learning Machine Method. Each of the nineteen chemical classes was next treated by a linear learning machine to develop weight vectors. Four thousand spectra were used in each case, with five hundred randomly selected as the training set and the other three thousand five hundred used as a prediction set. Table I11 shows the results of this approach. In every case the learning machine method shows a better prediction success after the investigation of five hundred spectra than the average spectrum method even though the average spectrum was developed from the compounds on which it was tested. Furthermore, in every case, the learning machine completely converged for each training set of five hundred and thereby had 100% recognition for the patterns with which it was developed. However, in view of the learning criterion established above, it is seen from Table I11 that in almost every case, the percentage in the negative category is greater than the percentage predictive success. This means that a greater number of right answers could be obtained by always guessing that the compound was not in the class in question. This again amounts to not answering the question at all. Hence, with such uneven categories there is question as to whether the learning machine approach is useful. For this reason, subsets of the entire set were chosen randomly to give training sets with equal numbers in each category. For such a set of three hundred compounds, one

Table IV. Training for Carboxylic Acids with Different Training Set Size

Training set size

Number of feedbacks

10

5

20 30 40

10 10 15

50 100 150

22 36 103 167 199 341 398 597 1816

200 250 300 350

400 450 500

a

Prediction percentage 62 63 61 56 61 67 71 71 72 73 71 71 68 71

a Complete training not obtained after 20,000 feedback (Recognition = 83.7x).

hundred and fifty of which contained nitro or nitroso groups, complete training occurred and a prediction of 80% was recorded for a different prediction set also randomly chosen and also containing one hundred fifty compounds in each category. Now a random guessing method would only give 50 % success and it can be concluded that the classifier is indeed learning about the relation between a compound's infrared spectrum and its chemical structure. To determine the influence training set size has on the prediction success, the carboxylic acid group was treated with even training sets varying from 10 to 500 patterns. The results are shown in Table IV. While there are some statistical fluctuations in the data, little improvement is noted for training sets above one to two hundred. For this reason training sets of three hundred were chosen for the remainder of the investigations. The nine chemical classes for which there were enough compounds to form balanced training sets of three hundred and balanced prediction sets of three hundred were used to test the learning machine method on infrared data. The results are given in Table V. For each chemical class two weight

ANALYTICAL CHEMISTRY, VOL. 41, NO. 14, DECEMBER 1969

e

1947

____~~ ~~

Table V.

Training Using Evem Categories

Training Set 300 (150/150) Prediction Set 300 (15OjlSO) Weight vector Weight vector initiated positive initiated negative Number of Per cent Number of Per cent Chemical class feedbacks prediction feedbacks prediction Carboxylic acids 341 74 29 1 74 Esters of carboxylic acids 304 76 258 76 Linear amides 153 78 140 80 Meton es 41 1 73 372 73 1 Alcohols Phenols 1 Amines O

Ethers and acetals Nitro, nitroso compounds

1015 205 841 2291

69 76 71 63

708 127 586 1595

69 82 69 62

157

81

127

82

Classes of C o ~ ~ o Incorrectly u ~ d ~ Classified While Predicting for Carboxylic Acids Number wrong -~ With Without Total Chemical class acid acid wrong Esters of carboxylic acids 0 7 7 3 9 12 Liaiear amides Aldehydes 0 8 8 Ketones 2 5 7 1 ' Alcohols 3. 2 3 7 5 12 Phenols 1' Amines 6 3 9 2" Amines 6 3 9 3 3 6 3 Amines 3 I 4 1 ' Amine salts Ethers and acetals 5 9 14 0 4 4 Nitro, nitroso compounds Table VI.

O

vectors were developed, one with all initial components set equal to +1.0 and the other with all initial components at - 1.0. Complete training was accomplished in every case, and predictive abilities were all above random with the highest around 80%. Furthermore, it should be realized that complete training means the weight vector is able to perfectly classify spectra of compounds which were in the training set. Hence in these particular cases, a dot product calculation can answer any given question without resorting to the lengthy task of comparing a spectrum to each of the three hundred members of the library. For one of the nine cases treated, the carboxylic acid class, classes of compounds which were missed more than twice

from the prediction set were tabulated and classified as shown in Table VI. It is interesting to note that a number of the compounds which did not contain carboxyl groups that were incorrectly classified actually contained oxygen atoms as carbonyl and hydroxy groups. This implies Chat the learning machine method is depending heavily on the carbonoxygen absorption modes as an indication of carboxyl presence which is just what a theoretical interpretation would likely do. As discussed in earlier work, weight vectors from different starting points may have different predictive abilities even though both were produced from the same iraining set which had been perfectly classified. The hyperplanes represented by the weight vectors so developed act, in effect, to partially describe a no-decision region of the pattern space. That is, any pattern which falls inside this portion of the hyperspace was not represented in the training set and will not have as great a probability of correct classification as one falling outside the no-decision region. To verify this concept, the two different weight vectors developed for each chemical class in Table V were used to reclassify each member of the prediction set in the following manner. If both weight vectors gave the same classification, then the compound was placed in that category. However, when the weight vectors disagreed, the compound was not classified at all, but rather indicated as a compound for which the classification was outside the capability of the learning machine as presently structured. The results are shown in Table VII. It may be seen by comparing Table VI1 to Table VI in every case when prediction was made by the second method, it was better than or equal to the prediction by a single weight vector. However, with the no-decision method fewer patterns are classified. For example, in the case of the nitro or nitroso class, 81% are correctly predicted by the positively initiated weight vector and 82 % by the negatively initiated weight vector, while 87 of the predictions using the no-decision process are correct. However, 1.5.3xof the compounds fell into the no-decision region. It must be realized that random guessing of the 15.3 in the no-decision region would reduce the success to the 81-82x of the single vector method. In other words, it is seen-and was true in virtually every case tested-that no real gain is made by classifying patterns in the no-decision region. The logical extension of this approach would be to develop as many different weight vectors as convenient from different initial vectors and use a multiple-decision process in which the confidence in the result would be a function of the number of vectors that give agreeing classifications. Another available piece of information which might give some confidence to the classification of unknowns is the

Table VIP. Results of Prediction Using Two Weight Vectors Simultaneously Prediction Set 300 (150/150) Fer cent performance

Chemical class Carboxylic acids Esters of carboxylic acids Linear amides Ketones 1O Alcohols Phenols 1 Amines =hers and acetals Nitro and nitroso O

1

a

Predicted positive

Predicted negative

Average

Per cent correct

Fer cent wrong

Per cent not decided

75.7 81.8 82.8 73.9 73.9 76.5 72.2 63.5 88.5

81.7 82.0 87.1 82.4 71.1 84.1 75.0 64.9 85.5

78.7 81.9 85.0 78.2 72.5 80.3 73.6 64.2 87.0

66.0 67.7 70.7 64.7 62.0 68.0 63.0 57.3 73.7

18.3 15.0 12.7 18.7 23.7 17.3 23.0 32.0 11.o

15.7 17.3 16.7 16.7 14.3 14.7 14.0 10.7 15.3

ANALYTICAL CHEMISTRY, VBL. 41,

NO. 14, DECEMBER 1969

amplitude of the scalar produced by the dot product. While only the sign of the dot product is used in the binary decision maker, still its size should be some indication of how close the pattern simulates the typical pattern for a given category. In other words, small scalars are indicative of patterns which lie close to the no-decision region and are difficult to correctly classify while those with large scalars lie far on one side of the decision surface and are likely to be correctly classified. For the case of the carboxyl group classification, the numbers of right and wrong classifications were tabulated as a function of the scalar amplitude and are given in Table VIIE. While the function is not smooth, which is probably because of the relatively small number of patterns in each scalar range, it is certainly evident that the confidence increases rapidly as the scalar moves away from the center zero value and approaches 100% for very large values.

Table VIII. Confidence Intervals for Carboxylic Acid Prediction

Number correct

Number incorrect

- 16 - 14 - 12 - 10

1 8 11 22 21 39 41 41

0 0 1

-8 -6 -4 -2

NEG 0 $2 $4 $6 $8 $10 +12 +14 +16 +18 +20 $22

CONCLUSIONS

Learning machines methods employing computer implementation have been shown capable of classifying infrared spectra on the basis of the chemical classes of the compounds which produced them. While the success of prediction varies, complete training and recognition has been accomplished for sets of patterns as large as five hundred. In any data set so extensive as that used in this study, the error rate due to recording and transferring of information from one form to another is likely to be more than negligible. Hence training for larger sets may be limited by the fact that some wrong classifications exist among the standard set. (This is not meant to imply that large data sets aren’t useful, because the same error production process is going to occur in collecting unknowns to compare to the set.) However, weight vectors which are successfully developed for useful sized training sets may be used in place of library search methods for classifying compounds.

ti

Calculated scalar

(37) 48 47 43 27 26 13 11 5 2 0

I

Total

2 7 9 13 17 POS (29) 24 27 11

7 5 4 2 0 0 0 0

Per cent confidence

iver

92 92

12 24 28 48 54 58

75 81

62 57

66 72 74 54 34 31 17 I3 5 2

100 100

0 1

100

67 63 80 79 84 76 85

ACKNOWLEDGMENT

The authors wish to acknowledge Sadtles Research Laboratories, Inc. for its help and cooperation in providing the data used in this research.

RECEIVED for review July 1, 1969. Accepted August ‘14, 1969. Research supported by the National Science Foundation.

atterns from puterized Learning ~

in

IO0 100

1 8

a

~

B. C. Jurs,l B. R. K o ~ a l s k iand , ~ T. L. Isenhour3 Department of Chemistry, University of Washington, Seattle, Wash. 98105 C. N. Reilley Department ?f Chemistry, Unicersity of North Carolina, Chapel Hill, N.C. 27514 machine is applied to the interpretation of ~ a t ~ e produced ~ ~ s by combining mass spectra, infrared spectra, and melting and boiling points. Through a generalized learning procedure using negative feed back, the machine evaluates which data from each source is most relevant to answering a given question. Parameter reduction methods are applied to reduce the number of input parameters and minimize the required data and storage of trained weight components.

EARLIER WORK has shown that learning machine theory combined with computer methods may be successfully applied to the interpretation of complex data from such sources as infrared spectroscopy and mass spectrometry (1-5). While these two techniques along with a number of other analytical (1) P. C. Jurs, B. R. Kowalski, and T. L. Isenhour, ANAL.CHEM., 41, 21 (1969). (2) P. C. Jurs, B. R. Kowalski, T’. L. Isenhour, and C. N. Reilley, ibid., 41,690 (1969).

methods for investigating chemical systems provide a variety of information relating to a given question, usually no single technique is sufficient to answer complex questions such as a chemical structure determination. The conclusions of most investigations rest, in fact, upon information derived from a variety of sources. Present address, Department of Chemistry, The Pennsylvania State University, University Park, Pa. 16802 2 Present address, Shell Development Co., 1400 53rd Street, Emoryville, Calif. 94608 Present address, Department of Chemistry, University of North Carolina, Chapel Hill, N. C. 27514 (3) B. R.Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reilley, ibid., 41, 695 (1969). (4) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, unpublished data, 1969. (5) B. R. Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reilley, ANAL.CHEM.,41, 1945 (1969).

ANALYTICAL CHEMISTRY, VQL. 41, NO. 14, DECEMBER 1969

0

11949