Machine Intelligence Applied to Chemical Systems Prediction and Reliability Improvement in Classification of Low Resolution Mass Spectrometry Data Peter C. Jurs Department of Chemistry, The Pennsylvania State University, University Park, Pa. 16802 Several techniques for improving the prediction and reliability of binary pattern classifiers are investigated. The data used are low resolution mass spectra of small organic molecules containing C, H, 0, and N atoms. Aspects of pattern classification discussed include transformation of the original data, investigation of the effects of thresholds, and the effects of using layered systems of threshold logic units. A predictive ability as high as 98% on complete unknowns can be obtained by incorporating all these features into one classification system.
PREVIOUS STUDIES have shown that adaptive binary pattern classifiers are capable of learning to usefully classify chemical data derived from low resolution mass spectrometry, infrared spectrometry, and from diverse sources ( I ) . Four important attributes of pattern classifiers have been pointed out and studied : recognition ability, reliability, convergence rate, and predictive ability. This papef will discuss some methods whereby the reliability and predictive ability of binary pattern classifiers can be enhanced without undue slowing of the convergence rate. Data Set. The first and most obvious way to improve the results of any data interpretation technique is to use the best set of data which is available. Previous work dealing with pattern classifiers working o n chemical systems used a set of mass spectrometry data which has been previously described (1). The presenf work uses another data set taken from a collection of mass spectra purchased o n magnetic tape from the Mass Spectrometry Data Center, Atomic Weapons Research Establishment, United Kingdom Atomic Energy Authority, Aldermaston, Berkshire. A portion of that tape contains 2261 American Petroleum Institute Research Project 44 spectra. These spectra have been digitized with normalized intensities ranging from 99.99 to 0.01 in each spectrum. Throughout this work spectra corresponding to compounds containing only C, H, 0, and N atoms were used. In the data selected from the tape for actual tests, there are about 50 to 70 such peaks per spectrum. There are 132 m/e positions in which ten o r more peaks appear throughout the data set used. I n making actual computer runs, an input subroutine is employed to read spectra from a tape, select the ones to be used according to preset criteria, e.g. carbon number, and set up the data for the remainder of the program to use. Normally, 600 spectra are thus input, with a training set of 300 and a prediction set of 300. Data Selection and Transformation. Because of the large quantity of data available, it was possible to select data for inclusion in each problem so as to acquire as homogeneous a data set as possible. I n most of the work reported here, the spectra were selected so that they correspond to compounds with three to 10 carbon atoms. (Tests using data sets with spectra of compounds with three to 20 carbon atoms
showed that the results obtained in this investigation were not artifacts of the data but worked for the less homogeneous data set also.) Six hundred spectra meeting the criteria of three to 10 carbon atoms were used for each computer run; they were divided evenly between the training and prediction sets. The data set of 600 spectra most often used contained 35,550 peaks spread over 132 mje positions. Previous pattern classification research has shown that transformations of the intensities of the spectral peaks markedly help the pattern classifiers to find good decision surfaces (2). A discussion of the information present in mass spectra from an information theory point of view has been given by Grotch (3). Table I shows the results of an investigation of several different transformations. The investigation shown in Table I used spectra from the purchased collection as described above. For each of the four transformations, a feature selection program was run. The method of feature selection employed has been described in detail previously ( I ) . The main points of the feature selection routine are that for each stage in the routine, two weight vectors (initiated with all f l ’ s and all -l’s, respectively) are trained, and then the signs of the weight vector components are compared. Only the m/e positions corresponding to weight vector components with the same sign are retained for the next stage. The cycle is repeated until no more ambiguous m/e positions can be located, and the program then terminates. Each row of Table I shows the results of running this program for the same data, having undergone the transformation listed in column one. The original data have a dynamic range within each spectrum of 104 (0.01 to 99.99); the second column gives the dynamic range after the transformation. The third column gives the number of feedbacks necessary for convergence to 100% recognition for the first stage of the feature selection program (when there are 132 m/e positions). The fourth column Zives the predictive abilities of the two pattern classifiers and the fifth gives the average per cent prediction for this first stage. The logarithmic transformation yields the best predictive ability for the first step of the program. Column five shows the number of ambiguous mje positions found during this first stage; a narrower dynamic range causes the pattern classifier to find few ambiguous mje positions. The program repeats the cycle mentioned above many times for each transformation, and the number of m/e positions remaining after all ambiguous positions have been discarded is shown in column six. Again, narrower dynamic range reduces the number of ambiguous m/e positions. The final column gives the predictive ability (2) P. C . Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, ANAL.CHEM.,41,690 (1969). (3) S. L. Grotch, “A Computer Program for the Identification of Low Resolution Mass Spectra,” 18th Conference on Mass
(1) P. C. Jurs, ANAL.CHEM., 42, 1633 (1970).
22
ANALYTICAL CHEMISTRY, VOL. 43, NO. 1, JANUARY 1971
Spectrometry and Allied Topics, San Francisco, Calif., June 1970.
Transformation Square root Fourth root log Zeroth power
Table I. Effects of Transformations on Properties of Binary Pattern Classifiers Per cent mle ?!e Dynamic Feedbacks prediction A: 7 positions positions range +1 WV/- 1 WV" +WV/- WV prediction discarded final 100 1231114 93.7196.3 95.0 66 43 10 101/107 94,7194.3 94.5 49 44 4 1411102 95.3195.3 95.3 44 57 1 2351229 93.7194.0 93.8 23 83
Overall % prediction 94.4 94.6 95.5 94.0
+
Total Training set 121 179 300 Prediction set 126 174 300 132 mie positions Oxygen presence-absence determinations a
Weight vector. Table 11. Comparison of Properties of Four Pattern Classification Methods. Oxygen Presence-Absence Determination 4
m 70
/O
Feedbacks +lWV/-lWV
1 2 3
1351118 1351127 921172
96.3194.0 95.3194.0 93.3194 .O
95.3 94.7 93.6
98.5
96.4
1 2 3
2391184 286121 1 1571179
96.0196.0 96.3195.7 95.7194.3
96.0 95.8 95.0
99.6
98.6
Committee machine
1 2 3
90.7
100.0
99.8
2 3
96.0 94.0 91.7 98.0 96.3 95.7
95.0
Committee machine with threshold A = 50
180 153 95 375 297 398
Linear machine Linear machine with threshold
+
prediction 1WVI - 1wv
Av 3 prediction
Randomization
u =
recognition 2% u
-
=
5%
A = 50
1
Randomization
Training set
1 2 3
841216 921208 781222
Prediction set
+I-
+/-
891211 81/219 951205
exhibited by each pattern classifier averaged over all stages of the feature selection process. Once again the logarithmic transformation exhibits the highest predictive ability. This result is consistent with the result obtained by applying information theory to the transgeneration problem. Throughout the remainder of this study, the logarithmic transformation was employed. After the transformation, the magnitude of the peaks was adjusted with constant multipliers for purely computational reasons, and the average peak intensity thus becomes approximately 13. Dead Zone Training. The binary pattern classifiers used in this research and previous studies are known as threshold logic units, TLU's. The data are represented by d-dimensional vectors. They are classified by forming the dot product between a particular pattern vector and the weight vector, a procedure which gives a scalar, s. The pattern is classified into one category o r the other according to whether the scalar is positive o r negative--i.e., the scalar is compared to the threshold level of zero. A threshold level other than zero, A, can also be used. Then, when a dot product is formed between the pattern
vector and the weight vector, the scalar is compared to A. If s >A then the pattern is said to belong to one category; if s < - A then it is said to belong t o the other category; and if - A < s < A, then the pattern is not classified. The region between - A and A is known as the dead zone. This classification method can be thought of in terms of two hyperplanes, where patterns are classed in one category if they fall o n one side of the two planes, are classed in the other category if they fall o n the other side of the two planes, and are not classified if they fall between the hyperplanes. Training is performed using the threshold much as if A = 0. Feedback is applied only to correct errors, but now errors include both incorrect classification and failure to make a classification. The new weight vector is calculated from the old one at step i in training by the equation
where Wr+l is the new weight vector, W r is the old one, Y is the pattern vector being classified, c the correction increment, and the sign is chosen according to the sense of the
ANALYTICAL CHEMISTRY, VOL. 43, NO. 1, JANUARY 1971
23
Table 111. Properties of Binary Pattern Classifiers as a Function of Threshold, A m
mle ,-
70
A
Feedbacks
Average feedbacks
0 25 50 75
1351118 1561150 2391184 2751203
126 153 206 239
recognition
? prediction
AV, 8 Prediction
u = 2%
u = 5%
positions discarded
96,3194.0 95.3196.0 96.0l96.0 96.0196.3
95.3 95.7 96.0 96.1
98.5 99.3 99.6 99.7
96.4 97.5 98.6 98.9
41 32 27 22
Table IV. Enhanced Predictive Ability of Binary Pattern Classifiers by Using Threshold Prediction Process 1 weight vector - 1 weight vector A\I ? prediction % prediction A‘: 7 prediction with threshold Not attempted with threshold Not attempted prediction
+
A 25 50 75
95.7 96.0 96.1
96.3 97.0 97.6
8
Table V. Feature Selection Using Threshold Binary Pattern Classifier A = 50 Av prediction Feedbacks A = O A = 50 mle 132 105 97 94 91 90 89
2391184 216/ 109 247199 1961127 2391138 2141129 2071130
96.0 95.7 95.9 95.7 96.0 96.0 96.0
96.7 96.8 96.8 96.9 96.7 97.1 96.9
error. The correction increment is calculated according to the equation c =
- -L( & A Y*Y
-
S)
where Y . Y is related to the length of the pattern vector Y, s is the scalar result which led to the incorrect classification (s = Y ‘ W ) , and the sign is chosen according to the sense of the error. This feedback method shifts the decision surface so that after feedback the pattern point is as far o n the correct side of the decision surface as it was o n the incorrect side before feedback. With assiduous choice of A, one would expect predictive ability to rise as compared to the zero threshold case because one is essentially more narrowly defining the hypervolume within which the hyperplane can fall and still be a “good” decision surface. One would also expect an increase in reliability with a nonzero threshold. Table I1 shows the results of training binary pattern classifiers with and without the threshold. For each test, 300 spectra were randomly selected from the data set; three different randomizations were used, resulting in three different training and prediction set populations, as shown a t the bottom of Table 11. The TLU’s with A = 50 require more feedbacks to train to 100% recognition in each case, and they display higher predictive ability in each case. These TLU’s with A = 50 are able to correctly classify 95.0, 95.8, and 96.0x of the complete unknowns comprising the prediction set. The figures given for per cent recognition were obtained as follows. After training to complete recognition, the members of the training set are again fed to the pattern classifier one at a time, but they are varied by a randomly generated Gaussian vector before classification. The u figure refers to the relative standard deviation of the imposed Gaussian 24
96.3 96.3 96.3
3 5
4 7 9
96.3 96.7 96.9
vector. For example, u = 5% means that approximately one third of the components in each pattern vector are varied more than f 5% from their original values before variation. The variations are imposed after the logarithmic transformation process, so the occurrence of errors is much more frequent than it would be in an actual laboratory situation where the errors would be introduced before any transformation. It is seen in Table I1 that the percentage recognition rises when the pattern classifier is trained with the nonzero threshold. Table I11 shows the results of investigating the properties of binary pattern classifiers as a function of threshold size. The number of feedbacks required to correctly classify all the patterns in the training set rises as a function of A, as would be expected. The predictive ability and reliability both rise also. The last column gives the number of ambiguous mje positions found out of the 132 used in the training. As A increases and the volume in which the final decision surface can fall is more narrowly bounded, the number of m/e positions used by the pattern classifier rises. When training is done with a threshold, then the threshold can be employed in the prediction process. Table IV shows the results of such an investigation. Column 2 shows the results obtained by using the trained weight vectors to classify the unknown members of the training set with threshold of zero. Then the members of the prediction set were classified using the indicated values for A, that is, not classifying those patterns for which the scalar obtained fell between - A and A, The per cent prediction figures given are the percentage of classifications attempted which were correct. I t is seen that in every case the per cent prediction with threshold is higher than that obtained without the threshold. Evidently this is a superior way to perform prediction of unknowns. Table V shows that the method of feature selection previously described works well when a threshold is imposed o n the binary pattern classifier. With A = 50, the pattern classifier was able to reduce the number of mie positions from 132 to 89 while maintaining the same predictive ability. I n each case, the predictive ability with A = 50 is higher than with A = 0. Committee Machine. The work discussed above utilizes a single threshold logic unit to classify patterns. A slight generalization involves using two levels of TLU’s ( 4 ) . I n this method the pattern to be classified is presented simul(4) N. J. Nilsson, “Learning Machines,” McGraw-Hill, New York, N. Y.. 1965.
ANALYTICAL CHEMISTRY, VOL. 43, NO. 1, JANUARY 1971
Table VI.
A 0
Feedbacks
25
397 375 359
50
75
180
Properties of Layered Pattern Classifiers as a Function of Threshold, A prediction ? 96.0 97.0 98.0 97.3 Table VII.
?
%
recognition
predtction
u = 2%
u = 5%
95.0 99.8 100.0 99.8
90.7 98.5 99.8 100.0
ff
= 2 z 92.8 96.7 97.7 96.5
% fall 3.3 0.3
0.3 0.8
Training Set Representative Subset Selection
% Training Feedbacks Spectra ? set size +lWV/-lWV fed back prediction I 3008 2391184 80 96.0/96.0 I1 300b 1961206 82 94.3195.7 I11 162” 6221595 118 IV 172d 1283/1104 126 a Odd indexed members of data set. * Even indexed members of data set. * Spectra fed back in trials I and 11. Training set 111 plus those predicted incorrectly in trial 111. Trial
taneously to three (or five or seven etc.) TLU’s which operate in the normal manner. However, the outputs of the first layer of TLU’s go to a second layer TLU which follows a majority rule law. Thus, the whole committee of classifiers sends its output to the vote-taker which classifies the original pattern into the category agreed upon by the majority of the first level TLU’s. A useful and simple training method is to make only the minimum necessary error correction feedback at each step in the training. Thus, when an incorrect classification is made during training, then the weights of the TLU’s that missed being correct by the smallest amount are altered using Equations l and 2 with A = 0 substituted. Only enough TLU’s are changed to ensure correct classification after the feedback. Table I1 shows the results of training a committee machine with three members with the same three sets of data used by the other pattern classification implementations. Convergence to 100% recognition is seen to be fast, and the predictive ability is high. However, this committee machine is evidently quite sensitive to variations in the training set members, as seen from the per cent recognition figures. Committee Machine with Dead Zone. A pattern classification system combining both features discussed above is a committee machine where each member has a nonzero threshold. Training is as before-the scalars of each member of the committee must be outside the dead zone (-A to A) to register any classification and the vote-taker registers an overall classification only when a majority of the committee members agrees o n a classification. One would expect this to be somewhat more powerful pattern classification system than the simpler systems. Table I1 shows the results of training a committee of three TLU’s with A = 50 and with the same three sets of data as the other implementations in the table. The number of feedbacks is seen to rise, although it remains at a reasonable level. However, the per cent prediction is quite a bit better. This pattern classifier could correctly classify 95.7, 96.3, and 98.0x of the members of the prediction set, which are complete unknowns. The percentage recognition is seen to be very high also ; this implementation is nearly impervious to errors in the patterns of this size.
prediction A = 50
97.0 94.2
No. not Prediction classified A = 50 5 95.1 5 97.3
Percentage of all spectra No. not classified classified correctly 7 ... 11 ... 98.3 100.0
A study was made of some of the properties of committee machines as a function of threshold size. The results are shown in Table VI. The number of feedbacks necessary for convergence jumps dramatically for A = 25 and then remains at approximately the same level. The predictive ability goes through a maximim for A = 50. The per cent recognition is very high for all nonzero thresholds. The sixth column gives the results of testing the predictive ability with random errors imposed on the prediction set members, and the last column gives the per cent drop in predictive ability with these errors compared to the predictive ability on the prediction set members themselves. Thus, a committee of machines each operating with nonzero threshold is seen to be superior in terms of predictive ability and reliability. Data Pool Characteristics. For routine use of pattern classifiers, the goal would be to have at all times the best possible weight vectors for classification purposes. When new compounds were run o n the mass spectrometer, they would be classified with probability of success approximated by the predictive ability. However, if these new spectra were then to be included in the (expanded) data pool, then it would be necessary to retrain the weight vectors, which requires access to the entire previous data pool. Because storage of the entire data pool may be inconvenient (or impossible) it was decided to see whether the entire data pool could be represented by a subset of its members. This was found to be so in a series of tests summarized in Table VII. In trial I, the odd indexed members of the data set were used in the training set with the results shown. A tabulation was kept of the members of this training set which were fed back at some time during training; 80 of the 300 spectra participated in the training process. In trial 11, the training set was composed of the even indexed members of the data set; 82 spectra participated in training. I n trial 111, the 80 spectra from trial I and the 82 spectra from trial I1 were used as a training set; then 10 spectra were missed out of the 438 remaining spectra. These 10 spectra were incorporated into the training set to make a training set of 172 members. In trial IV training was accomplished with these 172 spectra, of which 126 participated in training. The resulting weight vector could then classify not only the 172 members of the
ANALYTICAL CHEMISTRY, VOL. 43, NO. 1, JANUARY 1971
25
training set but the other 428 spectra also. This is not called predictive ability because of the selection process employed in the training. However, the 126 spectra which participated in training are completely representative of the overall data set with regard to the question of oxygen presence-absence. By going through a feature selection routine identical to those described above, the number of mfe positions used in these 126 spectra can be reduced to 104, and the number of spectra
used to 114. Thus, the entire data set of 600 spectra of 132 m/e positions (600 X 132 = 79,200) can be completely represented by 114spectraof 104peaks (144 x 104 = 11,856), a substantial savings,
RECEIVED for review June 26, 1970. Accepted October 19, 1970.
Double Beam Photon Counting Photometer with Dead Time Compensation K. C. Ash1 and E. H. Piepmeier Department of Chemistry, Oregon State University, Corvailis, Ore. 97331
A low cost double beam photon counting photometer has been built with integrated circuits. The improved regulation of the double beam instrument is compared to single beam operation of one of its beams, and the limitations of double beam compensation are discussed. The counting precision due to shot noise for the double beam instrument compared to that of a similar single beam instrument is found to be worse by a factor of