NBS bovine liver, NBS orchard leaves, human and dog hair, NBS coal, and fly ash. In addit,ion, coal, fly ash, slag, and the scrubber solutions used to collect gaseous selenium which was formed by the combustion of coal a t the Allen Steam Plant in Memphis, Tenn., were analyzed. Accuracy a n d Precision. As shown in Table V, the relative error for the selenium determination varied from 0 to 17.5% with an average of 3.7%. It should be noted that the error values in Table V assume the validity of the values obtained by the independent methods used or the certified
values supplied. Regarding precision, the average relative standard deviation is 4.7% and ranged from 2.2% to 9.5%. RECEIVEDfor review May 3, 1974. Accepted July 24, 1974. Oak Ridge National Laboratory is operated for the U.S. Atomic Energy Commission by Union Carbide Corporation under Contract No. W-7405-Cong-26. Research supported by the National Science Foundation-RANN (Environmental Aspects of Trace Contaminants) Program under NSF interagency agreement No. 389 with the U.S. Atomic Energy Commission.
Interpretation of Infrared Spectra Using Pattern Recognition Techniques R. W. Liddell 111 and
P. C. Jurs
Department of Chemistry, The Pennsylvania State University, University Park, Pa. 16802
A pattern recognition technique using adaptive binary pattern classifiers has been utilized for the classification of infrared spectra into chemical classes. Four different methods are used to generate classifiers, and the predictive abilities on unknown infrared spectra are compared. A new procedure for supervised training, called the fractional correction method, is shown to give superior results for this data set. Using fractional correction training and an offset intensity scale for the infrared absorptions, a new feature selection procedure requiring only one weight vector training is implemented. It is applied to the infrared data set with marked decrease in the number of absorptions necessary for training and prediction. Weight vector maps for three chemical classes are presented.
Computerized learning machines utilizing adaptive binary pattern classifiers have been applied to the interpretation of infrared spectra previously (1-3). The present paper reports several new developments which lead to improved performance. A modification of the error-correction feedback training procedure is described, and it is compared to several other training methods. An improved method for the preprocessing of the raw infrared spectral data is described. In addition, a new method of feature selection has been developed for which only one weight vector need be trained. DATA S E T The data for this study were taken from the Sadtler standard infrared spectra. Two hundred twelve spectra were taken for small organic molecules in the range C3-10 Ho-22 0 0 - 3 No-2. Each spectrum was divided into 0.1-micron intervals, thereby generating 131 X-Y values. The transmittance in each interval was converted into absorbance and put onto a zero to nine scale for convenience as previously described (2). The data were then renormalized so that the sum of the intensities of each infrared spectrum was equal to an arbitrary constant. All programs were in FORTRAN IV, and were run on the Penn State University (1) B. R . Kowalski. P. C. Jurs, T. L. Isenhour, and C. N . Reilley, Anal. Chem., 41. 1945 (19691.
(2) R . W.~LiddellIlland P. C. Jurs, Appl. Specfrosc., 27, 371 (1973). (3) D. R . Preuss and P. C. Jurs, Anal. Chem., 46, 520 (1974).
2126
Computation Center IBM 370/168 or the Department of Chemistry MODCOMP 11/25. BINARY P A T T E R N CLASSIFIERS Binary pattern classifiers (BPC) have been described in detail in earlier publications (e.g., 4 ) . A digitized infrared spectrum is represented as a 132-dimensional vector,
in which rj is equal to the normalized intensity of each of the 131 intervals across the 2.0- to 15.0-micron range and i denotes that this pattern is the i t h member of the data set. An extra dimension, arbitrarily chosen as unity, is added so the decision surface can pass through the origin. A linear decision surface (hyperplane) is represented by its normal vector,
such that the related pattern points of class 1 will all cluster on one side of the decision surface and can therefore be separated from the unrelated spectra of class 2 on the other side. A set of pattern points that can be separated into two classes by a linear decision surface is called linearly separable. Linear separability has been demonstrated for all the categories defined by the functional groups that have been studied in this work. To classify a particular spectrum with respect to a particular decision surface, the dot product between W and Xi is taken, s = W * X,. The value of the scalar, s, is compared to the deadzone cutoff, Z, being used. If s > 2, then the point falls on one side of the decision surface; if s < -2, then the point falls on the other side; if -Z < s < 2, then the point falls within the decision surface and is not classified. Several variations of the error correction feedback procedure have been used for training in the work reported here. In each case the weight vector is initialized with the first 131 components equal to zero and the 132nd dimension equal to a large negative constant. Then the members of the training set of pattern vectors, for which the correct classifications are known, are presented for classification to the BPC being trained. The error correction training proceeds in one of the following ways. (4) T. L. lsenhour and P. C . Jurs, Anal. Chem., 43 ( l o ) , 20A (1971).
ANALYTICAL CHEMISTRY, VOL. 46, NO. 1 4 , DECEMBER 1974
The first training method (Basic BPC) takes one pattern at a time for classification. When a pattern vector is misclassified or not classified, the weight vector is improved as f 0110ws: W' = w 4- cixi (3 1 where ci =
(*z-
xi
(4) and W is the old vector, W' is the improved vector, 2 is the deadzone value, and the sign to be used depends on the sense of the error. This procedure guarantees that the new weight vector W' will correctly classify the particular pattern just misclassified. This procedure is followed until all members of the training set are correctly classified. The second method (BPC with normalization) is very similar to the first except that the weight vector is normalized to unit length after each feedback. The ratio of deadzone, 2,to the length of the weight vector, is calculated after all training set members are correctly classified, t = 24 This ratio is then increased and the weight vector is retrained. This procedure is repeated until a value for t is reached where any further increase in t causes the two classes to be inseparable. Forcing the weight vector length to remain constant forces the training algorithm to work on the relative orientation of the decision surface. The trivial possibility of training with twice as large a deadzone by using twice as long a weight vector is eliminated. The final hyperplane decision surface is oriented so that the perpendicular distances from it to the nearest pattern points of the two classes are maximized. The third method (fractional correction) involves a modification of the basic error correction feedback method. Fractional correction training proceeds by splitting the weight vector being trained into two components named W, and W,, where p denotes the positive class and n denotes the negative class. For classifying patterns, the combined weight vector Wt is used, where Si)/Xi '
Iw,
v.
w,= w, + w,
(5)
fi = ( 1 . 5 ) ( s i - Z ) / S n
(12)
and ci is as above. The constant 1.5 is an adjustable parameter used to decrease the computer time needed to train. This procedure is repeated until all the training members are correctly classified. The fractional correction training method causes many pattern vectors to contribute to the decision surface improvement during each feedback instead of just one as in the basic method. Thus, a more representative sample of the training set is obtained. This factor contributes greatly to the validity of the final weight vector. A basic problem of the BPC programs is that the first several feedbacks have too large an effect on the final weight vector. For the basic training method, large corrections are made during the first few feedbacks and smaller and smaller corrections are made as the weight vector converges on the final solution. This causes very large changes in the weight vector to be due merely to the identity of the spectra missed early in training. Even when the normalized BPC is used, this effect is still present, although to a smaller degree. Thus, the fractional correction training yields a weight vector less sensitive than BPC with normalization to the order of spectra in the training set and also to the points closest to the separating decision surface. The last procedure used was MAX as described in a previous publication ( 2 ) .This method finds the positive class spectrum that has the least positive scalar quantity; similarly, it finds the most positive scalar value from the negative class. It then finds the two features that will improve the scalar values for these two spectra the most. This procedure continues until the algorithm can no longer find a feature that will not adversely affect the remaining scalar values. If the initial value of the weight vector is zero, there is no way to determine which spectrum's scalar value is the worst, since all the scalars are zero. Therefore, an initial weight vector must be used, as is calculated by the following formula:
Therefore
s = wt xi (6) and the classification is made on the basis of the magnitude and sign of s, as usual. During training, all the members of the training set are classified without any feedback alterations of the weight vector; during these classifications, two summations are formed as follows: I?
s,
=C
s,
=
-
(7)
S i
i =I
1L
C% 4 i
in which the sum S, is accumulated over those patterns ( n , in number) which are members of the positive class but were misclassified, and S , is accumulated over the patterns of the negative class which are misclassified. Then the feedback equations become: W,'
=
w, +
(9 1
CifiXi
in which f i
:= ( l . 5 ) ( s i - - Z ) / S Pand c i = - 2 ( s i - Z ) / X i
*
X i (10)
and the feedback is performed on W, a total of n, times. Similarly, for the members of the negative class which were misclassified: W,'
in which
=
w, +
cifixi
(11)
in which the summation of Equation 13 is done over the members of the positive class, the summation of Equation 14 is done over the members of the negative class, and the entire procedure is repeated for each descriptor, i e . , j = 1, 2, . . . , 131. All weight components that have negative values were initialized to zero.
RESULTS AND DISCUSSION Table I shows the predictive abilities obtained for seven chemical classes when the infrared spectra with all 131 descriptors were used. A single training set of 126 randomly chosen patterns was used for training of all the binary pattern classifiers. In each case, the BPC was trained with one of the previously described methods, and the predictive ability was tested by classifying the remaining 86 infrared spectra. For the BPC with normalization, training was done with a nonzero deadzone, but prediction was done with 2 = 0. Thus, all 86 members of the prediction set were classified for all four methods. For the tests reported in Table I, the training set is approximately as large as the number of descriptors per pattern. While this is undesirable, it does not invalidate the re-
ANALYTICAL CHEMISTRY. VOL. 46, NO. 14, DECEMBER 1974 * 2127
Table 11. Predictive Abilities of Fractional Correction BPC Using Shifted Intensity Scale
Table I. Predictive Abilities of Four BPC Classifiers Using 131 Peaks Functional group
Basic BPC
Alcohols Benzene Carbonyls Carboxylic acids Esters Ethers Ketones
98.8 90.7 100.0" 96.5
c
97.7 98.8 97.7
BPC with Fractional normalization correction
97.7 90.7 100.0 97.7 98.8 97.7 96.5b
hL4X
97.7 93.0
100.0 100.0 98.8 98.8 96.5
a Prediction of least populous class low. Did not classify all training set correctly.
b
Descriptors retained by MAX
98.8 84.9 100.ObqC 97.7
58 58 54 76
97.7 100.Oc 95.4
45 57 62
Alcohol Benzene Carbonyl Carboxylic acid Ester Ether Ketones
Peaks used
13 1 13 1 13 1 13 1 131 13 1 13 1
Percent prediction
98.8 94.2
Peaks used
prediction Percent
100.0 100.0
38 22 32 59
98.8 89.5
100.0 100.0
97.7 97.7 97.7
29 26 33
97.7 97.7 95.4
Prediction invalid.
sults obtained. The weight vector being trained is constrained to be a weighted linear summation of the patterns in the training set. Since the patterns of the training set are not necessarily independent of one another, the 132-dimensional space is not spanned, and not all solutions are accessible. Thus, the final solution weight vector is meaningful even though there are more dimensions than patterns. This behavior has been observed in previous work with infrared spectra (1-3) and mass spectra (e.g., 5 ) , In addition, the results of the trainings are developed to be used only in feature selection which remedies the problem of the ratio of descriptors to the number of patterns. The results reported for the basic BPC were obtained with the weight vector initialized as follows: ui = 0 , j = 1, 2 , . . . , 131, and ~ 1 3 =2 -2100. The deadzone parameter, or threshold, 2, was set to 2000. The same procedure was used for the BPC with normalization. The basic BPC program and the BPC with normalization show very similar predictive abilities. Normalization of the weight vector produces no overall improvement in prediction, and for ethers the predictive ability is invalid because of the extremely low predictive ability for the less populous positive class. For the basic BPC, the predictive ability of the positive Benzene class is low, and therefore the validity of the weight vector is in question. The prediction shown for the fractional correction method is very similar to that for the other methods for most classes. However, there is marked improvement in the Benzene and Carboxylic acid classes. Of the four training routines shown in Table I, this method shows the most promise. In the case of MAX, the weight vector is initialized in accordance with Equations 1 3 through 15. The 132nd dimension is set so that the worst scalar quantity in each the negative and positive class are equidistant from the decision surface. The 131 features are input to MAX and the number which survive are shown in the fifth column of Table I. The results of MAX are comparable to those of the other methods shown, whenever linear separability was obtained. Because of the inability to separate several classes, it probably is the poorest of the four training routines. The left half of Table I1 shows the results obtained with the fractional correction method working with the infrared spectra after they had been subjected to a further preprocessing step. The values of the descriptors which were originally on the interval zero to nine were shifted to the interval -1.75 to 7.25. This allows an infrared spectral region which does not contain absorption, and is therefore represented by a zero, to contribute negatively to the scalar used (5)P. C.Jurs, B. R . Kowalski, T. L. Isenhour, and C. N. Reilley. Anal. Chern. 41, 690 (1969). 2128
Functional group
for classification. Thus, a BPC can use the lack of absorptions for classifications just as the presence of absorptions are normally used. Column three of Table I1 shows the predictive abilities obtained with this offset intensity scale. The predictive abilities are higher here than those of Table I for three cases, equal for two cases (both loo%), and lower here for two cases.
FEATURE SELECTION The general goal of feature selection is to cut down the number of descriptors per pattern without decreasing the separability of the clusters. This has been done previously using weight-sign feature selection. Weight-sign feature selection involves independently training two weight vectors with the same training set. Independent training is done by initializing the two weight vectors differently, or by shuffling the sequence of the patterns in the training set, etc. Then the inidividual components of the two trained weight vectors are compared pairwise. Two cases can arise: if the two weight vector components agree in sign, then the corresponding descriptor is retained; if the two weight vector components disagree in sign, then the corresponding descriptor is discarded. The entire procedure can be repeated until no further descriptors can be discarded. Under some circumstances, the weight-sign feature selection method can be difficult to control. A new method was therefore developed for which only one weight vector is trained. The new feature selection method utilizes the fractional correction training procedure summarized by Equations 5 through 12 and the offset intensity scale described above. When training is done in this way, and the components of the final, trained weight vector are inspected, four possible cases can arise. These are shown in Figure 1. For case I, u p is negative which says that there is no substantial absorption in that interval. (ci is always positive for patterns in the positive class and always negative for patterns in the negative class.) For case I, w, is negative which shows a substantial absorption in that interval. This combination can be important, and it is a valid criterion for the usefulness of an interval when there are two closely related classes being differentiated, e.g., ketones from alkehydes. Case I1 denies a peak for both the negative and positive members and therefore shows the interval to be an unimportant feature. This combination is always eliminated from further consideration during feature selection. The third grouping is split into two parts. When u tis negative, both groups have a substantial absorption, but the negative class members have a stronger or more frequently used absorption a t that interval. If u t is very negative, the feature will be very similar to case I; if only slightly negative, either it is a common feature of most spectra (such as the aliphatic peaks) or it may, with further training on a reduced number of features, fall into category IIIB. In the IIIB clas-
ANALYTICAL CHEMISTRY, VOL. 46, NO. 1 4 , DECEMBER 1974
Table 111. Reduced Number of Peaks Using Only Feature Selection Combinations I11 and I V Functional
Peaks
Percent
group0
used
prediction
9
Alcohol
98.8 did not t r a i n 98.8 98.8
4 8 21
Carbonyl Ca r b ox y 1i c acid
a Benzene, ether, ester, and ketone would not train after the first feature selection.
CASE I
CASE I1
CASE 111
CASE I V
4.0
8.0
12.0
HAVELENGTH ( m i c r o n s )
Figure 3. Carboxylic acid weight vector m a p CASE I I I A
CASE IIIB
Figure 1. The four possible cases arising during feature selection
I
4.0
a10
I
12.0
XAVELEhGTH ( m i c r o n e )
4.0
8.0
12.0
W A Y g L E i r n (microns)
Figure 4. Alcohol weight vector map Figure 2. Carbonyl weight vector m a p
sification the positive class members have a stronger or more frequently used absorption than the negative class members. This grouping is the second most important in the decision mechanism for all classifications, especially when there is little or no unique behavior for the differentiation of one particular class from another. The most important combination is IV. This shows a strong absorption for the positive class and the lack of one for the negative
class. Unique behavior of one class compared to another will show up in this classification, such as the C=O stretch from non-carbonyl compounds. The feature selection in Table I1 was done by eliminating only the features that fell in case 11. In all cases, the feature selection was repeated until no further intervals could be discarded. As one can see, the feature selection is quite good with very little change in the predictive ability. The prediction of Benzene is fairly low due mainly to the large
ANALYTICAL CHEMISTRY, VOL. 46, NO. 14, DECEMBER 1974
2129
number of peaks that were eliminated in the first pass. This is caused by benzene having very broad, low intensity absorptions. In Table 111, feature selection was done by eliminating features having all combinations of weight vector components except IIIB and IV. Most classes would not train before the feature selection routine had completely reduced to a minimum. While the second method reduces the features to a bare minimum, only those with unique behavior will continue to train, causing this method of elimination to be unapplicable to most classes of compounds. The advantages of this feature selection method over weight-sign feature selection are twofold. First, it allows the weight vector to be initialized to zero, thus removing any bias caused by the weight vector initialization step. The second advantage is that, with this feature selection method, only one weight vector need be computed, so the computer time needed to train for a class is decreased by a factor of two. Figures 2 through 4 show weight vector plots of the three functional groups that would train after feature selection was performed as in Table 111. The abscissas are wavelength in microns and the ordinates are scaled arbitrarily. In the case of carbonyls and carboxylic acids, this feature selection was continued until no more features could be eliminated. By using the restraint that all features must be in classes IIIB or IV, all weight components must have a positive value as is shown in Figures 2 and 3. This shows that there exists a large amount of unique or semi-unique behavior in these two classes. For alcohols, much less of this behavior is evident. To separate the alcohols from the other compounds, it is necessary to have negative weight components of the weight vector. Literature infrared spectral tables (6, 7) show a great deal of similarity with the IR weight vector component (6) R. M. Silverstein and G. C. Bassler, "Spectrometric identification of Organic Compounds," Wiley, London, 1967, p 73. (7) "Handbook of Chemistry and Physics," Robert C. Weast, Ed., The Chemical Rubber Co., Cleveland, Ohio 1969, D F169.
maps. In the case of carbonyls, the tables show sharp peaks in the 5.8-6.1 p and 7.6-9.1 1 ranges. With carboxylic acids, there are also close similarities between the weight components and the IR tables. Silverstein and Bassler show sharp peaks a t 3.0-3.6, 5.8, and 7.7-8.0 and medium peaks a t 3.7-3.9, 7.0, and 7.7-8.0. The CRC tables show sharp peaks a t 5.8-6.1 and 7.6-8.4 and moderate peaks a t 3.1-3.4, 6.97.4, and 10.4-11.7. With alcohols, the correlation is not as good a t first glance. Since feature selection has been performed several times in reducing the spectra to only nine peaks, any feature remaining will have a high degree of correlation with the OH functional group. The IR tables list alcohols as having sharp peaks in the 2.8-3.1 and the 8.2-10.0 ranges. All the remaining nine peaks are found in these ranges.
CONCLUSIONS A new error correction feedback training method, called the fractional correction method, has been developed and tested. The classifiers generated with the fractional correction method gave predictive abilities from 93 to 100% on complete unknowns. The average predictive ability for the six chemical classes investigated was higher than the predictive abilities attained by any of the classifiers generated by other methods. When the fractional correction training method was combined with an offset intensity scale for the IR absorptions, a new feature selection procedure could be used. Complete training and high predictive abilities were obtained for patterns containing as few as 20% of the original features. A second new feature selection approach attempts to identify which features can be expected to be useful for classification by breaking down the weight vectors into positive and negative components. Further studies are being pursued to investigate more fully the methods reported and to apply them to other sets of data.
RECEIVEDfor review June 20, 1974. Accepted August 14, 1974. The financial support of the National Science Foundation is gratefully acknowledged.
Bayesian Approach to Resolution with Comparisons to Conventional Resolution Techniques P. C. Kelly' and Gary Horlick Department of Chemistry, University of Alberta, Edmonton, Alberta, Canada T6G 2G2
A Bayesian approach to resolution is presented in which a direct calculation is made of the posterior probability that the height of a second peak in a doublet is greater than zero. Since it makes full use of all available information, the Bayesian method should be an effective method for resolving peaks. This conclusion is borne out in a comparison of the Bayesian method to deconvolution, second derivative, cross-correlation, and least-squares methods for the resolution of noisy Lorentzian doublets with various relative heights and separations. In addition, the meaning of results obtained by both least-squares and Bayesian methods in a low signal-to-noise ratio situation is discussed and illustrated. Present address, Research Centre, Canada Packers, 2211 St. C l a i r Avenue West, Toronto, Ontario, Canada M6N 1K4.
2130
The term resolution refers to the decomposition of a system into its constituent parts. The problem of resolution is frequently encountered by analytical chemists, both instrumentally during acquisition of analytical data and interpretively while processing analytical data. This paper is concerned with resolution a t the processing step where the system is some type of recorded spectrum and the components are peaks. Resolution methods rely on the generation of a resolution statistic. For example, it is readily apparent that a t least two peaks are present when a valley can be seen between two peak maxima. A suitable statistic might be the depth of the valley. Many of the common approaches to resolution involve techniques such as deconvolution or second-differentiation which accentuate the depth of the valley between two peaks.
ANALYTICAL CHEMISTRY, VOL. 46, NO. 14, DECEMBER 1974