Investigation of combined patterns from diverse analytical data using

Analytical Data Using Computerized Learning Machines. P. C. Jurs,1 B. R. Kowalski,2 and T. L. Isenhour8. Department of Chemistry, University of Washin...
0 downloads 0 Views 575KB Size
amplitude of the scalar produced by the dot product. While only the sign of the dot product is used in the binary decision maker, still its size should be some indication of how close the pattern simulates the typical pattern for a given category. In other words, small scalars are indicative of patterns which lie close to the no-decision region and are difficult to correctly classify while those with large scalars lie far on one side of the decision surface and are likely to be correctly classified. For the case of the carboxyl group classification, the numbers of right and wrong classifications were tabulated as a function of the scalar amplitude and are given in Table VIIE. While the function is not smooth, which is probably because of the relatively small number of patterns in each scalar range, it is certainly evident that the confidence increases rapidly as the scalar moves away from the center zero value and approaches 100% for very large values.

Table VIII. Confidence Intervals for Carboxylic Acid Prediction

Number correct

Number incorrect

- 16 - 14 - 12 - 10

1 8 11 22 21 39 41 41

0 0 1

-8 -6 -4 -2

NEG 0 $2 $4 $6 $8 $10 +12 +14 +16 +18 +20 $22

CONCLUSIONS

Learning machines methods employing computer implementation have been shown capable of classifying infrared spectra on the basis of the chemical classes of the compounds which produced them. While the success of prediction varies, complete training and recognition has been accomplished for sets of patterns as large as five hundred. In any data set so extensive as that used in this study, the error rate due to recording and transferring of information from one form to another is likely to be more than negligible. Hence training for larger sets may be limited by the fact that some wrong classifications exist among the standard set. (This is not meant to imply that large data sets aren’t useful, because the same error production process is going to occur in collecting unknowns to compare to the set.) However, weight vectors which are successfully developed for useful sized training sets may be used in place of library search methods for classifying compounds.

ti

Calculated scalar

(37) 48 47 43 27 26 13 11 5 2 0 I

Total

2 7 9 13 17 POS (29) 24 27 11 7 5 4 2 0 0 0 0

Per cent confidence

12 24 28 48 54 58 66 72 74 54 34 31 17 I3 5 2

100 100

0 1

100

iver

67 63 80 79 84 76 85

ACKNOWLEDGMENT

The authors wish to acknowledge Sadtles Research Laboratories, Inc. for its help and cooperation in providing the data used in this research.

RECEIVED for review July 1, 1969. Accepted August ‘14, 1969. Research supported by the National Science Foundation.

atterns from puterized Learning ~

in

IO0 100 92 92 75 81 62 57

1 8

a

~

B. C. Jurs,l B. R. K o ~ a l s k iand , ~ T. L. Isenhour3 Department of Chemistry, University of Washington, Seattle, Wash. 98105 C. N. Reilley Department ?f Chemistry, Unicersity of North Carolina, Chapel Hill, N.C. 27514 machine is applied to the interpretation of ~ a t ~ e produced ~ ~ s by combining mass spectra, infrared spectra, and melting and boiling points. Through a generalized learning procedure using negative feed back, the machine evaluates which data from each source is most relevant to answering a given question. Parameter reduction methods are applied to reduce the number of input parameters and minimize the required data and storage of trained weight components.

EARLIER WORK has shown that learning machine theory combined with computer methods may be successfully applied to the interpretation of complex data from such sources as infrared spectroscopy and mass spectrometry (1-5). While these two techniques along with a number of other analytical (1) P. C. Jurs, B. R. Kowalski, and T. L. Isenhour, ANAL.CHEM., 41, 21 (1969). (2) P. C. Jurs, B. R. Kowalski, T’. L. Isenhour, and C. N. Reilley, ibid., 41,690 (1969).

methods for investigating chemical systems provide a variety of information relating to a given question, usually no single technique is sufficient to answer complex questions such as a chemical structure determination. The conclusions of most investigations rest, in fact, upon information derived from a variety of sources. Present address, Department of Chemistry, The Pennsylvania State University, University Park, Pa. 16802 2 Present address, Shell Development Co., 1400 53rd Street, Emoryville, Calif. 94608 Present address, Department of Chemistry, University of North Carolina, Chapel Hill, N. C. 27514 (3) B. R.Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reilley, ibid., 41, 695 (1969). (4) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, unpublished data, 1969. (5) B. R. Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reilley, ANAL.CHEM.,41, 1945 (1969).

ANALYTICAL CHEMISTRY, VQL. 41, NO. 14, DECEMBER 1969

0

11949

Table 1. Double Bond Presence 1. IR No. of

Parameters

feedbacks

262 162 125

156 144 177 261

100

70 50

Parameters

feedbacks

262 162 125

1182 1024 1005 1056 1I76 1845 >5000

100

70 50

30 20 10

5 No. of

Parameters

feedbacks

262 162 125 100 70

141 149 145 133 121 128 >5000 >5000 >5000

50

30 20 10

>5000

5

Parameters 262 162 125 100

70 50

30 20 10

5

No. of feedbacks

87 83 90 88 121 129 468 1520 >5000 >5000

feedbacks

Recognition

X

79 83 82 77 62 47

1105 958 1529 1677 >5000 >5000 >5000 >5000

X X

50

69

3. Combined patterns Per cent Recognition prediction 86 84 87 85 85 85 79 82 61 61

X X

X

X X X

182

>5005 >5000 >5000

179

135 142

4. Combined patterns Per cent Recognition prediction X X X X X X

186

140 136 139

X X

176 129

87 88 89 89 91 90 89 92 84 64

e

87 88 85 84 81 81 69

X X

179 178 141 131

56

1361126 9916 92/33 88/12 6911 5010 3010 2010 1010

510

MS/IR

feedbacks

1361126 521110 24/101 11/89 1/69

105 82 82 87 93 169 146 >5000 >5000 >5000

0150

0130 0120 0110 015

5 . Combined patterns RecogPer cent nition prediction X X X X

X X X

174 164 137

89 88 89

90 89 89 92 88 75 62

MS/IR 1361126 76/86 61/64 46/54 30140 23/27 13/17 9/11 614 411

MS/IR/other 136112612 7617612 5517012 4415612 3113912 2512512 l5/15/2 121812 8/2/2 51111

Most sophisticated methods of computer data interpretation, however, are limited to data from a single source. For example, a calculation method designed to determine possible molecular formulas from mass spectrometry fragments as a guide to chemical structures can not use information from an infrared spectrum or gas chromatographic investigation. This restriction usually exists because the interpretation method has been derived specifically from theoretical considerations of the type of data for which it is to be used. Pattern recognition using learning machines methods, on the other hand, is a completely empirical approach and can simultaneously use data from a variety of sources to derive decision processes characteristic of the overall features of the 1950

Per cent prediction

MS/IR

No. of

78 80 83 80 81 78 69 52 52 55

6. Combined patterns Per cent Recognition prediction X X X X X X

2. MS No. of

prediction

168 136 129 129

No. of

-

Recognition

X X X

>5000 >5000 >5000 >5000

30 20 10 5

Per cent

patterns. The learning machine method can not, of course, interpret patterns of a type not used to develop the decision process. That is, a machine developed using only mass spectrometry patterns could not interpret infrared spectra. However, because the decision process is developed in an empirical manner, the same generalized learning technique may be used with any type of data, including mixtures from unrelated sources. This work describes the investigation of the application of computerized learning machines to the classification of mixed patterns produced by the combination of mass spectra, infrared spectra, and melting and boiling points of chemical compounds.

ANALYTICAL CHEMISTRY, VOL. 41, NO. 14, DECEMBER 1969

DATA AND CONIBUTATPONS

The learning machines and the computer programs which implement them have been thoroughly described in References ( I ) and (2). Particular attention should be called to the computer method of casting out pattern components as described in Reference (2). All computations were done on the University of North Carolina Computation Center IBM 360175. Patterns were taken from 291 compounds for which both the low resolution mass spectrum and the infrared spectrum were available. The mass spectra, which were from the API Research Project 44 Tables, had a total o€ 132 possible mass positions, and had amplitudes adjusted in a range from 10 to 100 by the square root process as described in References ( I ) , (21, and (3). The infrared spectra were obtained on loan from the Sadtler Research Laboratories, Inc. and contained 130 possible absorption wavelengths, over a range from 2.0 to 14.9,~as described in Reference (5). Four peak amplitudes are used in the I R data: (a) peak not present, normally assigned 0.0 intensity; (b) peak present in 0.1,~band, normally given intensity of 1.0; (c) largest peak in a 1.0,~band, normally given an intensity of 2.0; and (d) largest peak in the spectrum, given an intensity of 3.0. Thus, an IR spectrum consists of an ordered sequence of 130 numbers. Compounds had formulas from C1--lo,HI--24, 0 0 - 4 , Nw, and in each of the combined patterns of the mass spectrum and infrared spectrum there are 262 components. Of the 291 such patterns, 191 were selected randomly as a training set and the other 100 were used as a predicting set. In combining data from different sources into a single pattern, the relative contributions of the two (or more) types of data grossly affect the learning machines’ behavior. Methods of normalization of the magnitudes of pattern components, which would affect the relative contributions, have been discussed previously (2, 6, 7), although not with reference to combining data from unrelated sources. RESULTS AND DISCUSSION

In order to evaluate the effect of combining data from diverse sources, compounds were classified on the presence or absence of one OT more double bonds. Attempts to answer this question using only infrared patterns or mass spectrometry patterns met with limited success as shown in the first two sections of Table I. In each case parameters are discarded from the 125 starting parameters by the previously described criterion (2). Prediction is around 82 % for the infrared case and falls off rapidly after the number of parameters i s below 50. The mass spectrometry gives prediction in the region of 8 7 x which slowly decreases until the number of parameters is reduced below 20 when it drops off rapidly toward random success. The mass spectrometry also fails to converge within the allotted number of feedbacks below 50 parameters. Sections three, four, and five show the results of training with patterns formed by combining the mass spectrum and infrared spectrum. In section three, the patterns have been normalized such that the mass spectral intensities are much greater than the infrared components. The mass spectral peak intensities were on the range 10 to 100 and the I R intensities were set to 0, 1.0, 2.0, 3.0 Thus the contribution from the IR data to the overall length of any pattern vector is relatively ( 6 ) L. R. Crawford and J. D. Morrison, ANAL.CHEM., 40, 1469 (1968). (7) @. S. Sebestyen, “Decision-Making Processes in Pattern Recognition,” The Macmillan Co., New York, N. Y.,1962.

small compared to that from the mass spectrum portion of the pattern. As is expected the results are almost identical to those of the mass spectra alone, and by the time the number of parameters is 50 there remain only components of the mass spectral patterns. In section four, the results of normalizing the patterns so that the infrared data predominates is shown. Here the I R intensities are set to 0, 1000, 2000, 3000, so most of the pattern vectors’ length is due to the IR contribution. As expected the results are quite like the independent infrared spectra and when only 50 parameters are left they are all from infrared patterns. Section five shows the results of normalizing the patterns so that each data source contributes equally to the total amplitude of the pattern set. The I R intensities over the entire data set is equal to the sum of all the IR peak intensities over the entire data set. In this case the prediction starts out around 90% and remains very high until the number of pattern components is reduced below 20. Even with only 10 components left, prediction is 75% and recognition is 82% showing better than random behavior. It is also interesting to note that components of both the mass spectra and the infrared spectra have been retained throughout the parameter reduction process. Section six of Table I shows the further improvement gained by adding two more data points to the patterns, the melting point and boiling point for each compound. (Note that in every case where the boiling and melting point are included, the total number of parameters is 2 more than in the combined MS/IR patterns or the uncombined patterns.) In comparison to section five no particular improvement is noted down to thirty parameters, both the predictive abilities and convergence rates being about the same. However, with addition of the boiling and melting points, the learning machine still converges within the limit with only twenty parameters and also retains approximately a 90% predictive ability at that level. Furthermore, the prediction at 10 parameters is still notably higher than the other cases. The decision process used to reduce the number of parameters contributing little to the classifications retains the melting and boiling point information almost until the end of the calculation. Hence, the comparative study of the independent parameters DS. the combined patterns allows decisions to be made as to which is the better method of answering a given question. Furthermore, the parameter reduction process allows the important data to be retained, while notably reducing the amount of information necessary to answer a given question. For example, in this particular case as indicated by the last section of Table I, 20 pieces of data retained for each cornpound will allow the unique identification of that compound from the set of 191 in the training set. This has particularly useful potential applications in analytical situations where the set of possible compounds is well known, and it is necessary to frequently identify one among that set. In addition, the limited predictive ability should not be interpreted as a real upper limit for the technique. As shown in previous work, particularly that of Reference (2), predictive ability is strondy a function of training set size, and often prediction has been found as high as 98 for difficult structure problems. The evaluation of the combined method of data reduction is, in itself, a good way to determine whether one technique is sufficient unto itself. Table 11 shows such an evaluation. It is seen from sections 1 and 2 that ethyl determination i s a difficult question using either single approach. The mass spectrometry has notably better predictive success even though its convergence rate is slower. (Indeed, the predictive ability on the infrared seems to be no better than random.) The

ANALYTICAL CHEMISTRY, VOL. 41, NO. 14, DECEMBER 1969

e

1951

Table 11. Ethyl Presence 1.

IR

2. MS

No. of

Parameters 262 162 125 100

70 50 30 20 10

5

feedbacks 433 408 1325

>5000 > 5000 >5000 > 5000 >5000

Recognition

Per cent prediction 54 56 52 60 51 53 58 62

X X X

156 113 110 110 110

No. of feed backs

2782 3002 3949 >5000

>5000 >5000 >5000

>5000

3. Combined Patterns No. of

Parameters 262 162 125 100

70 50

30 20 10

5

feedbacks 294 282 307 29 1 406 1834 >5000 >5000

>5000 >5000

Recognition X X X X X X

162 163 140

126

Per cent prediction 70 64 63 65 72 67 71 68 51 56

No. of feedbacks 272 318 378 424 557 4040

>5000 >5000 >5000 >5000

Recognition

X X X

177 166 156 145 107 4. Combined Patterns Recognition X

X

X X X X

162 159 148 144

Per cent prediction 69 70 74

74 79 70 67 43 Per cent prediction 69 66 73 74 69 69 65 70 52 65

Table HI. Vinyl Presence 1. IR

No. of

Parameters 262 162 125 100

70 50 30 20 10

5

feedbacks

Recognition

89 90 72 74 84 93 >5000 >5000

X

2. MS

Per cent prediction 87 85 84

X X X X X

81 81

84 83

178 175

81

No. of

feedbacks

Recognition

1601 1604 1750 >5000 >5000 >5000 >5000 >5000

X X X

Parameters

feedbacks

Recognition

Per cent prediction

262 162 125

104 105 104

X X

X

77 79 82

100

98

X X

81 80

X X

76 82 78 70 62

70 50 30 20 10

5

120 93 178 >5000 >5000 >5000

184 170 175

combined infrared and mass spectral patterns show no better prediction than the mass spectra alone, although there is some improvement in convergence rate. From section three it appears that the better aspects of the two methods have been combined. Furthermore, the combined patterns converge with only 50 parameters left, when the minimum for each single technique was seventy patterns. Finally in section four it is seen that the addition of boiling points and melting points made virtually no improvement. This is to be expected because the phase transition temperatures of organic compounds should not be particularly dependent on terminal ethyl groups. 91952

(I

82 82

78 83 77 71 85 80

4. Combined patterns

3. Combined patterns No. of

186 173 174 174 173

Per cent prediction

No. of

feedbacks 98 93 74 77

Recognition X

X

Per cent prediction 83 84 80 80

82

X X X

72

X

80

118 91

X X

79 79

172 170

71

>5000 >5000

79

82

In Table 111, which deals with the determination of vinyl group presence, it is seen that infrared is superior to mass spectrometry in both prediction and convergence rate, In fact, infrared is able to converge with only 20 parameters. In this case, the combination of the patterns seems to give about the average convergence rate while lowering the predictive ability somewhat. Apparently, forcing the consideration of the mass spectrometry data obscures the infrared information. The addition of boiling and melting point information returns the situation very close to that of the inence, from such a study as summarized in Table 111 it is seen that a single technique is superior to com-

ANALYTICAL CHEMISTRY, VOL. 41, NO. 14, DECEMBER 1969

Double Bond Presence with Binary Spectra

Table IV.

1. IR

2. MS

No. of

Parameters 262 162 125 100 70

50 30 20 10

5

feedbacks

Recognition

183 173 202 193 >5000 >5000 >5000 >5000

X

Per cent prediction 81 83

X X

79 79 77 65 56

X

172 165 144 142

61

No. of

Per cent prediction

feedbacks

Recognition

454 486 373 494 >5000 >5000 >5000 >5000

X X X

84 85

X

84 85 81 75 69

175 161 157 155

8.5

3. Combined patterns Parameters 262 162 125 100 70

50 30 20 10 5

No. of feedbacks 124 110

122 119 99 105 234 >5000

>5000 >5000

Recognition X X X X X

Per cent prediction 92 89 90 89 88

89 89 85 54 72

X

X

178 161

158

bining techniques, at least for this method of data interpretation. This is not meant to imply that the addition of the mass spectral. patterns to the infrared patterns in some way causes the loss of information. Surely only an increase in information content could actually occur. However, in the hyperspacial approach of the learning machine method here employed, the combination apparently made the data more difficult to interpret by adding dimensions which were not as discriminating. In much analytical instrumentation, such as infrared and mass spectroscopy, it is often easier to obtain accurate information on the resolved property in the measurement--e.g., wavelength in the former case and mass value in the latterthan on the intensity of the measurement. For this reason it is useful to be able to evaluate how important intensity information really is. Table IV shows such an evaluation for the double bond determination which was summarized in Table I. In this case, all components of both the mass spectrum and the infrared spectrum were reduced to intensity values of 1.0 where detectable peaks occurred, and zero where they did not. These binary spectra were then trained in the same fashion as

before. The results of sections one, two, and three of Table IV when compared to sections one, two, and five of Table I show that apparently the peak, no-peak information i s just as good as the intensity resolved information for this question. In other words, information sufficient to answer the question of double bond presence is present in peak locations for both IR and mass spectrometry. Hence, in this particular case, it would only be necessary to have the peak positions to train and predict on the question of double bond presence. In conclusion, the learning machine method as implemented with high speed digital computer programs is capable of evaluating which of several individual techniques or some combination thereof, is capable of answering a given question without regard to the actual nature of the measurement processes. Furthermore, such an approach allows the evaluation of which data from a set of patterns is necessary to answer a given question.

RECEIVED for review July 1, 1969. Accepted September 29, 1969. The financial support of the National Science Foundation is gratefully acknowledged.

ANALYTICAL CHEMISTRY, VOL. 41, NO. 14, DECEMBER 1969

1953