Application of Pattern Recognition Techniques to the Interpretation of Severely Overlapped Voltammetric Data: Theoretical Studies Q. V. Thomas' and S.
P. Perone"
Purdue University, Department of Chemistry, West Lafayette, Indiana 4 790 7
Pattern recognition methods were used here to determine the multiplicity of Stationary Electrode Polarographic (SEP) curves. While earlier work in this area consldered only reversible SEP data, here an extended data base was used which Included the Irreversible and quasi-reversible cases In addltlon to the reversible case. Using this more general data base, evaluatlons were made for several feature definition, selection, and pattern classiflcatlon methods, each of whlch was based on elther the h e a r learnlng machine (LLM) or the k-NearestNeighbor (kNN) classiflcatlon rule. These evaluations led to the selectlon of a method which could be used to classify synthetic, nolse-free slnglet and severely-overlappeddoublet SEP data wlth an accuracy of 94 % This method was then extended to the classiflcatlon of alternate data bases conslsting of nolsy synthetic SEP data, and of reverslble synthetlc SEP data outslde the limits of the training set. Concluslons derived from these studles can be applied directly to classiflcatlon of real SEP data.
.
There are many examples of analytical measurements which can yield multiple component information under the guise of a single response peak. In these cases the single peak is actually a summation of two or more components which are so closely overlapped that visual or mathematical detection of the multiplicity is virtually impossible. When the response peak results from a known number of components, various deconvolution (1-3) methods may be applied to determine the relative contribution of each. If, however, the number is not known a priori, it becomes desirable to use an independent method to determine the peak multiplicity before deconvolution is attempted, In this work we have evaluated the use of pattern recognition techniques for the detection of singlets and doublets in stationary electrode polarography (SEP). Previous work in this laboratory ( 4 , 5 ) has demonstrated that either the linear learning machine (LLM) (6-8) or k-Nearest Neighbor (kNN) (9, IO) approach may be used to classify synthetic doublet and singlet reversible SEP data with about 90% to 95% accuracy. In those studies the SEP waveforms were not complicated by electron transfer kinetics, so any shape differences between singlets and doublets arose primarily from the actual peak multiplicity. In the more general case of real voltammetric data, electron transfer kinetics and other factors affect the shape of waveforms, and these complications must be considered when generating a data base for the classification of real patterns. In this paper we report the preparation of an extensive set of synthetic SEP data (QUIRDS data set) which includes Quasi-reversible, Irreversible and Reversible Doublet/ Singlet 'Current address, Finnigan, 845 West Maude Avenue, Sunnyvale, Calif. 94086.
patterns. This data base provided a convenient means of exploring questions crucial to the classification of real doublet/singlet SEP data-such as what are the most meaningful data features and most appropriate classification technique. The results of these studies are reported here, while applications to real voltammetric data are reported elsewhere (11).
General Considerations for Pattern Classification, When selecting a pattern recognition method to solve a particular problem in chemical analysis, the analyst must be aware of the unique characteristics of available methods. The LLM, for example, assumes that the patterns under study are linearly separable. This results in the training of weight vectors which converge to a decision surface that classifies the data 100% correctly. Lack of linear separability prohibits convergence and lowers the prediction percentage. Training weight vectors is an iterative process and therefore time consuming for large data sets. Prediction by weight vectors on the other hand is a very quick process requiring only one calculation for each pattern classified. The kNN algorithm makes no assumption of linear class separability and can be applied to multiclass problems. There is no training process as such with kNN, but prediction requires distance calculations for the entire training set, a process requiring substantially more time than LLM predictions. The most crucial aspect of pattern recognition is the intelligent definition of pattern descriptors or features (feature definition). Features must be defined so as to effect class separation on the basis of real class differences and not on the basis of trivial information. This is often a highly intuitive step but it must be based on the analyst's best understanding of his data. The general procedure for evaluating these various factors is described below. This procedure was followed for each of two independent feature sets in order to determine which classification method and features were best applied to our SEP data base. The first step, data base generation, involves the process of collecting the raw analog or digital data which represent patterns of each of the classes to be considered. These data may result from actual physical measurements or may be derived from theoretical considerations. Subsets of this data base may be selected for use as training or prediction sets during the later processes of feature evaluation and selection. Training sets should include as wide a representation as possible of those patterns which may be found in a prediction set. In a case where only small numbers of real patterns are available, this may become a limiting factor in classification accuracy. The feature extraction process simply involves the calculation of values for each feature from the raw data. This is also the point at which various signal and feature preprocessing methods may easily be applied. These may include digital smoothing techniques (12-14), numerical transformations (15, 16), and autoscaling (17). ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
1369
Feature selection (sometimes referred to as feature elimination) is a process used to eliminate those features which actually supply little or no useful information to the classification routine. The resulting decrease in dimensionality usually provides for higher classification accuracy while requiring less computation time. If feature elimination causes only a decrease in classification results, methods used for feature definition and extraction should be reexamined as well as the classification method itself. Computer Hardware/Software. All calculations were programmed in FORTRAN IV on a Hewlett-Packard 2100s computer, with 32K words of core memory and floating point hardware, operating under the H-P DOS-M executive. Peripherals include a HP-7900 disk drive with 5 Mbytes of storage, a Centronics 306 printer, a Calcomp 565 digital plotter, Teletype, paper tape reader and punch, and a Tektronix 603 storage monitor. Generation of the Data Base. There are several factors which influence the shape of a multiple-component SEP curve. If the effects of coupled chemical reactions and surface complications like adsorption or precipitation are avoided, then the primary shape factors include the degree of reversibility, the number of electrons involved in the reduction step, concentration, and peak potential for each component. Considering only the doublet case, the size of a representative data base which included various combinations of all of these becomes quite large. Experimental generation of this type of all-inclusive data base would be impractical. For this reason, we chose to generate our data base synthetically from SEP theory (18,19). Hopefully, the synthetic data base could then be used as a training set for classifying real experimental doublet/singlet data, or at least for identifying useful features for the classification of real data. There are several advantages to this approach beside the practicality of synthesis. Because each of the above shape factors can be varied at random, a data base can be prepared which would be thoroughly representative of the range of possible waveforms that might be expected in real data. A second advantage is that a synthetic data base would be free from instrumental artifacts, noise, and bias. The presence of these distortions in real data might mask intrinsic relationships important for classification. Such relationships may become clear only in the absence of these distortions. A synthetic data base also provides a convenient source of information for the feature selection process. Because all of the synthesis parameters are available, correlations between features and parameters are easily studied. This can be useful to eliminate trivial features, a process far more difficult with real data. Finally, pattern recognition studies with a synthetic data base determine, at least, whether or not the proposed classification objective is feasible. That is, if the idealized synthetic patterns are not classifiable, it is unlikely that classification of real patterns can be achieved.
THEORY Equations 1 and 2 describe SEP current for the reversible and irreversible (18)cases, respectively, where values of x(at) and x(bt) are available as tabulated functions of n(E - E l p ) or an(E - Ellz),respectively. (n = number of electrons, E = electrode potential, E l j z = halfwave potential, a = transfer coefficient) i,,
=
nFACo*(Dna)”2X(at)
(1)
where a = nFV/RT.
ibr = nFACo*(Dnb)”*x(bt)
(2)
where b = anFV/RT, V = voltage scan rate, F = faraday, A 1370
ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
Table I. Parameter Ranges for Synthetic SEP Doublet/Singlet Data Base (QUIRDS) 0.01, 0.02, 0 . 0 5 , 0.10, 0 . 2 0 , 0.50, 1.0, 2.0, 5.0, $ 10.0 0.3, 0.4, 0.5, 0.6, 0.7 0.9, 1.0, 1.1;1.8,. . ., 2.2; 2 . 7 , . ,, 3 . 3 4,6, 8, 10 mV cl~/c,*bl 10:1 to 1:10
01
na E, -E,
.
a . , , represents increments of 0.1. b c,*= glets, all other values chosen at random.
o for sin-
Table 11. Typical SEP Featuresa ( A E at 75% ~ , ) / ( A Eat 80%ip) ( A E at 75% i p ) / ( A E at 80%i‘,) (Area above 75% i,)/(aE at 75% i,) (Area above 75% ip)/(Area above 80% ip) Efunction at 70% i, Efunction at 80% i’, Ifunction at aEip = 16 mV iIpl
/itpz
G’Pl
/(f’,,ip,)
at E , +- 10% a See Ref 4 and 22 for complete definition of all features and terms. Efunction = (E, - Ep)(Ep- El)/((EzE,) - (EP - E l ) ) 2 . Ifunction = i z * i l / ( i -2 i1)’, i, = peak current, i’, = 1st derivative peak current, i”, = 2nd derivative peak current. Ai
= electrode area, Co* = bulk concentration, D = diffusion coefficient, and other symbols have their usual significance. A more general expression which includes these two limiting cases as well as the quasi-reversible case can be expressed by Equation 2 where x(bt) is replaced by an expression, derived by Nicholson (19),which may be tabulated as a function of n(E - Ell2)for various values of +, a reversibility factor, and a. For values of rC, greater than 7.0, the function describes totally reversible behavior and is independent of a. For less than 0.01, the function represents totally irreversible behavior with intermediate values pertaining to quasi-reversible cases. To generate the synthetic data base used in this work, x(+,a)functions were calculated using 10 values of +,ranging from totally irreversible to reversible, and 5 values of a. Using Equation 3, 1200 singlets and 2400 doublets were synthesized
+
by randomly selecting values for +, a , n, and EP1- EP1from the values listed in Table I. Concentration ratios were assigned by selecting a random number between 1.0 and 10.0 for each component of the curve. For singlet curves, Cz* was set equal to zero. Each synthetic curve is composed of 250 points calculated at a resolution of 2 mV per point with the peak falling between data points 150 and 154. The data base of 3600 patterns (QUIRDS) was then scrambled to obtain a random mixture of singlets and doublets throughout. This step was taken to prevent pattern order from exerting a strong influence on weight vectors calculated by the LLM. From this scrambled data base, a training set of 1000 patterns was selected as well as several prediction sets of 200-300 patterns. In each case the ratio of singlets to doublets was very close t o the 1:2 ratio expected, indicating that thorough mixing was obtained. Feature Definition. We have defined two feature sets of 62 each, one set based on theoretical considerations of S E P shape factors and the other based upon the discrete Fourier Transform (20,21) of the raw SEP curve. A list of typical
a
Figure 1. Translation-rotation of SEP curves for Fourier transformation.
(a) Original curve. (b) After translation-rotation SEP-based features is found in Table 11. A complete list may be found in Ref. 22. Thirty-four features were defined as ratios which would eliminate any dependence on shape factors such as a or n. This type of ratio was shown to be an important feature in earlier studies ( 4 , 5 ) . Another 28 features were selected from the list of arbitrarily-defined features which Sybrandt ( 4 ) found to be the most useful for classification of reversible SEP data by the LLM. Note that in both cases there are several features which are derived from the first and second derivatives of the SEP curve. Significant noise on the zero derivative curve could result in distortion of these features. An alternative method for describing the shape of a waveform is by a series expansion of orthogonal functions. The Fast Fourier Transform (FFT) provides a convenient method of obtaining such an expansion in terms of sine and cosine functions. The low frequency components of the transform will contain most of the fundamental shape information while the high frequency components may be regarded as noise information. Using the low frequency components as features allows the retention of shape details while discriminating against background noise. For this work, 31 real and 30 imaginary coefficients corresponding to the lowest frequency components of the discrete Fourier transformation were retained as features. (The first imaginary coefficient was not used because it is always zero.) F e a t u r e Extraction, A minimum amount of signal preprocessing was performed on the raw SEP data in order to maintain curve fidelity as much as possible. All curves were normalized to a peak current of 1.0 prior to extraction of both the SEP and FFT features. During extraction of the SEP features, derivatives were calculated using an 11-point cubic convolution function (12). All SEP features were calculated directly from the raw curve and these derivatives. Before extraction of the FFT features each curve was shifted to place E , at data point number 96 and then truncated to 128 points. These two operations were carried out to ensure that all Fourier transformations were performed on the same potential window. Failure to do this would invalidate any comparisons made between FFT features of different patterns. Several E , positions were tried, (e.g. 32, 64, 96, 128) but the best overall classification results were obtained when E, was located at point 96. Finally, each peak-adjusted curve was translated and “rotated” about the origin to bring the initial and final data points to a value of zero (Figure 1). (This is not a true rotation, but rather involves a point-by-point subtraction of the line connecting the initial and final data points (23).) This
has been shown to be a most effective means of maintaining the fidelity of electrochemical data during Fourier transformations (23). After the translation-rotation step, each curve was converted to real values and Fourier transformed. The FFT routine returned 128 imaginary and 128 r e d coefficients from which 61 features were selected as described earlier. Both feature sets were autoscaled to give each feature a unit variance and zero mean. No weighting of features was performed. Generation of Noisy Synthetic Data. In order to study the effects of noise on our two feature sets, a set of noisy patterns was prepared by summing a real noise signal with the raw curves of the QUIRDS prediction set. This noise signal was obtained from a computer automated electroanalytical instrument which has been described elsewhere (24). The noise was scaled to achieve a S / N ratio of about 100. In addition, the noise data array was rotated a random number of points before being added to each raw synthetic curve. This was done to prevent an artificial bias from being added to the data which might result if all synthetic data had exactly the same noise pattern. Both feature sets were extracted from these noisy data and auto-scaled as before. Two signal preprocessing methods were used to smooth the noisy QUIRDS data prior to feature extraction, a Savitzky-Golay 15-point cubic convolution (12)and a rectangular Fourier filter function (23). The Fourier filter function was applied to the noisy QUIRDS prediction set as follows. Each raw curve of 250 points was translated and rotated about the origin as in Figure 1. The data were expanded to 256 points by adding zeros and then Fourier transformed. The Fourier coefficients corresponding to the 208 highest real and imaginary frequency components were then set to zero. This cut-off was found to maintain smooth data fidelity better than any other cut-off tried. Finally, all 256 Fourier coefficients were inverse transformed, rotated, and translated to obtain a smoothed curve. The features were then extracted following peak adjustment and truncation as described earlier. FFT features should show good noise immunity since most noise transforms to the high end of the frequency spectrum. However, noise can create difficulties in locating true peak potentials. Noise can cause slight shifts (1to 2 data points) of the apparent peak location and thus change the potential window viewed by the FFT. This can then result in erroneous FFT feature values. Fourier filter smoothing prior to peak adjustment helps to minimize the errors associated with locating E,,by making the apparent peak location and the true peak location coincide more exactly. The Savitzky-Golay smooth is a weighted moving average method which utilizes information on both sides of a data point to calculate its smoothed value. A 15-point smooth was required to achieve a S / N level comparable to that obtained with the Fourier filter function. This is a rather severe processing step which can result in peak distortions. Feature Selection. The goal of feature selection is to find an optimum set of features which will accurately classify a set of test patterns (prediction set). We have evaluated the ability of three feature selection methods to achieve this goal; two are based on the LLM and one on the kNN approach. Each method was applied to both feature sets (SEP and FFT) to determine which combination would yield the best results. Prior to these studies a correlation coefficient matrix was generated for both feature sets obtained from the QUIRDS data. Eigenanalysis (25) of these matrices provides the number of independent factors inherent in the data and this gives an indication of the information content of each feature set. For the SEP features, the first 5 eigenvectors account for 87% of the total variance, where the respective eigenvalues are: 11.7, 7.06, 3.41, 2.50, 1.37. For the FFT features the first ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
1371
Table 111. Feature Selection and Classification Results for QUIRDS Data Base SEP features
FFT features
% ACC
Selection method
No. retained
kNN
(IFS) LLM
% ACC
Singlets
Doublets
98.6
93.0
54
73.6
68.7
23
54.4
75.3
3a
No. retained
Singlets
Doublets
97.4
91.8
30
11.5
94.3
27
32.0
79.5
9b
(Variance) LLM a
(Sign-magnitude) Features No. 1, 20, 25, Ref. 22.
Features No. 1, 3, 8, 11, 17, 27, 28, 29, 30, Ref. 22.
4 eigenvectors account for 97% of the total variance, where the respective eigenvalues are: 18.9, 5.56, 3.61, 1.08. During feature selection, a prediction set should be used which is as representative as possible of the final set of unknown data to be classified. This will ensure that those features selected are most likely to correctly classify the unknown patterns. Failure to do so will cause the selection of a set of features which may optimize the classification of a given prediction set but fail to classify the patterns of real interest. Sign-Magnitude Method. This approach is based on the LLM and has been described elsewhere (4). Briefly, features are removed from the feature set as a result of two conditions. If, after convergence (or interruption of training if convergence does not occur) of two simultaneously-trained, oppositelyinitialized weight vectors, the weights corresponding to a given feature differ in sign, that feature is removed from the feature set. The remaining weight vectors are then reinitialized and the cycle is repeated, When no further features are removed by this criterion, those features whose weights have the smallest values are eliminated. The entire process continues until linear separability is lost, or, in the case of nonseparable data, prediction falls below some arbitrary limit. Variance Selection Method. The variance selection method (26) is based on the LLM and begins by training a series of weight vectors, each of which has been initialized differently. Features are then ranked in importance by the magnitude of the relative standard deviations for their corresponding weights. A set of suitable features is then selected corresponding to those weights which have the smallest standard deviations. This set is then broken down into smaller sets until separability is lost, with the last successful set being retained as the important features. Variance selection theory assumes the data are linearly separable and cannot necessarily be applied otherwise. T h e One-Dimensional kNN Method. With this method (5),prediction set patterns are classified by the kNN rule, one feature a t a time, rather than starting with all 62 at once as in the previous two methods. The m features scoring the highest accuracy are then retained for multidimensional analysis. The value of m is chosen somewhat arbitrarily and typically includes 10% to 20% of the original features. Single feature histograms and feature correlations are then used to aid rejection of overlapping or redundant features from the set chosen above. The remaining features are then combined in groups of two or more for multidimensional analysis. Following initial feature elimination with the one-dimensional method, an Iterative Feature Selection (IFS) approach, in which one feature a t a time is removed from the feature set, was applied to the remaining features. If its removal results in a loss of classification accuracy, that feature is retained. If its removal results in an increase in the classification accuracy, it is eliminated from the feature set. This process is repeated until each feature has been tested, completing one iteration. Additional iterations are performed, 1372
ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
using those features retained during the previous iteration, until no further features are eliminated. The order in which features are tested is arbitrary. In this work it was decided that those features which are most likely to be affected by noise distortions should be tested first. While they should be eliminated whenever tested, removal of suspect features a t the beginning of the first iteration will speed-up the overall feature selection process. Condensed kNN Algorithm. One of the primary disadvantages of the kNN algorithm is the amount of time required to classify data when the training set is composed of a large number of patterns. Classification time can be significantly decreased if the training set can be reduced to a smaller number of patterns which lie on or near the decision surface between pattern classes. Generally, information contained in patterns far beyond the decision surface can be found to exist in the decision surface, and since they represent essentially redundant information, these patterns could be eliminated from the training set without a loss in classification ability. The condensed kNN algorithm (9) is a method utilizing the kNN decision rule which can be used to reduce pattern redundancy in the training set. The method proceeds as follows: The first training set pattern is placed in an empty set of condensed patterns. Each of the remaining training set patterns is then classified using the condensed set patterns by the kNN rule. If the pattern is classified incorrectly, it is placed in the condensed data set; if it is classified correctly, the pattern remains in the original training set. Once all training set patterns have been classified, the process is then repeated using the new condensed set and the patterns remaining in the original training set. When no further transfers to the condensed set occur, the process is complete, This usually occurs within four cycles.
RESULTS The results of a feature selection study for the QUIRDS data base are summarized in Table 111. These results suggest that a kNN approach is the most appropriate selection and classification method for both features sets. Not only is the classification accuracy for both classes the highest of the three methods, but the number of features retained is also the smallest. In the case of variance selection with SEP features, Virtually no feature selection occurred. While fewer FFT features were selected by the variance method, the final weight vector essentially classisfied all the patterns as doublets. The sign-magnitude approach likewise did not select a useful set of features for linearly separating the two pattern classes in either feature space. The fact that neither learning machine method converged to 100% classification accuracy is proof that these data are nonlinearly separable in both FFT and SEP feature spaces and prevents their use as effective pattern classifiers for the QUIRDS data set.
Figure 2. Nonlinear map, QUIRDS prediction set; kNN-selected FFT features. (1) = singlet; (2)= doublet 2
t
2
1
Figure 3. Nonlinear map, QUIRDS prediction set; kNN-selected SEP features, (1) = singlet; (2)= doublet Visual inspection of the QUIRDS prediction set allows further interpretation of these results. Figures 2 and 3 present information in the form of nonlinear mappings (27) (NLM) of both feature sets. NLM allows d-dimensional data to be plotted in two dimensions while retaining as nearly as possible the d-dimensional pattern distances. These plots were obtained by mapping the kNN features selected in Table I11 for 100 patterns selected a t random from the prediction sets. Plots were restricted to 100 patterns because of core memory limitations. In the mapping of the kNN-selected SEP features (Figure 3), singlets are found as two tightly clustered groups, the centers of which are reversible and irreversible patterns. Quasi-reversible singlets are found between and surrounding these two clusters. It is this high local density which allows the singlets to be classified correctly by kNN even in the presence of overlapping doublets. Any factor, such as noise, which tended to reduce singlet density could be expected to cause a significant reduction in kNN classification accuracy. The mapping of kNN-selected FFT features (Figure 2) shows an entirely different spatial distribution where singlets tend to lie on a curve surrounded by doublets. As in the case of the SEP features, reduction of the singlet density along this curve by noise or other distortions could be expected to reduce significantly classification accuracy. Classification of Noisy Data. It was suggested above that noise on raw data curves could cause serious problems in feature extraction and pattern classification, especially where features are derived from derivatives of the curves. If a smooth synthetic training set is to be used for classification of real data, there must be a method of dealing with patterns which have a finite signal-to-noise (S/N) ratio, that is, to select features which have an inherent immunity to noise distortions. Thus, a set of noisy synthetic patterns (noisy QUIRDS, NQUIRDS) was generated as described in the Theory section above.
Table IV. Effect of Noise on k N N Classification of QUIRDS Dataa Classification accuracy, % 3 SEP Features 9 FFT Features Prediction set Singlet Doublet Singlet Doublet Smooth QUIRDS 98.6 93.0 97.4 91.8 (From Table 111) Noisy QUIRDS 22.2 88.9 34.6 64.8 (NQUIRDS) Savi tzky-Golay33.3 78.7 35.9 80.3 smoothed NQUIRDS Fourier-Filter40.2 90.6 44.9 75.4 smoothed NQUIRDS a Smooth QUIRDS used as training set. Table V. Classification of QUIRDS Prediction Sets by Noise-Optimized Features Classification accuracy, % 4 SEP 1 2 FFT featuresa feature& Prediction set Singlet Doublet Singlet Doublet Smooth QUIRDS 99.1 86.0 85.9 84.4 Noisy QUIRDS 89.9 57.1 64.1 85.3 (NQUIRDS) Savitzky-Golay 7.4 93.2 57.7 68.0 smoothed NQUIRDS Fourier Filter 80.8 89.4 82.1 90.9 smoothed NQUIRDS a Features No. 2, 35, 37, 43, Ref. 2 2 . Features No. 1, 6, 7, 9, 10, 12, 13, 16, 20. 21. 22. 23. Ref. 2 2 . The kNN classification results using smooth QUIRDS training set and both smooth and noisy QUIRDS prediction sets are shown in Table IV. The features used were the same as those used to compile Table 111. The data show that the presence of noise causes a significant loss of classification accuracy for both the SEP and FFT features used. Table IV also lists the results obtained for the classification of the smoothed NQUIRDS prediction set, using both SavitzkyGolay and Fourier filter smoothing. Both smoothed data sets show slight improvements in classification accuracies for the SEP and FFT features over the unsmoothed noisy data. However, none of the results begin to approach those obtained for the smooth QUIRDS data. Recalling the importance of choosing the correct prediction set for feature selection, the results described above were not unexpected. For this reason, kNN (IFS) feature selection was performed again using the Fourier-filtered noisy QUIRDS data as the prediction set. The new “noise-optimized’’ sets of SEP and FFT features were then applied to each QUIRDS prediction set to obtain the results shown in Table V. There was a significant improvement in the classification of both the noisy and Fourier-filtered data sets using these noiseselected features. This underscores the importance of selecting features to optimize prediction set classification. In addition, the higher results of row 4 compared to row 2 can be attributed to more reliable location of peak potentials prior to peak adjustment made possible by Fourier filtering. The smooth QUIRDS prediction set was classified about as well as the Fourier-smoothed noisy data. However, the Savitzky-Golay smoothed data showed essentially no improvement in classification with the noise-selected features. This is an indication that poor classification is due to factors other than noise alone (e.g., smoothing distortions). Because the data show no clear advantage to using either the SEP or FFT features, a mixed feature set was created which included the best noise-optimized features of each type. ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
1373
Table VI. Identification of Noise-Optimized Features SEP feature No. Definitiona (E, at 70% i, - E,)/(E, - E , at 70% i,) 2 35 Efunction at 70% i, 37 Efunction at 80% i 43 (E,,, - E p W 1 , 4 -%,I FFT feature No. 1, 6, 7, 9, 10, 1 2 , 13, 16, 20, 21, 22, 2 3 a See Reference 2 2 for complete description of all terms. The purpose of this was to determine if combinations of SEP and FFT features would be more effective than each individually. (Note that these are not linear or algebraic combinations of features, but new groupings of the feature sets.) An optimum set of nine mixed features was found, but they provided no enhancement beyond the results of Table V. Summarizing the results of Tables I11 through V, we see that kNN classification, using appropriately selected SEP or FFT features, is the method of choice for classification of synthetic singlet/doublet SEP data. Not only does this approach yield the best classification results for both smooth and noisy data, but it also requires fewer features than either LLM approach. Noise Optimized Feature Lists. Table VI identifies those features which were found to be the most important for the classification of noisy QUIRDS data. Each of the four noise-optimized SEP features listed in Table VI is a function concerned with ratios of potential differences. This general class of feature has previously been shown to be most useful for classification of smooth SEP data ( 4 , 5 ) . Their selection here demonstrates the additional ability to discriminate against distortions caused by noise. There is no simple way to describe the physical significance of the optimized FFT features. Each feature is the magnitude of a sinusoid whose frequency is proportional to the feature number with units of volts-’ (this corresponds to Hz for the time domain transformation). Those feature numbers listed in Table VI represent fairly low frequency information as might be expected considering the shape of SEP data (Figure 1). These features were selected because of their particular ability to faithfully represent singlet/doublet shape information while discriminating against noise distortions. Classification of Independent D a t a Sets. While the previous results demonstrate that it is possible to accurately classify singlet/doublet SEP data which fall within the QUIRDS data base, it is also important to know how classification is affected by the presence of patterns which fall outside the limits of the training set. T o evaluate this, an additional prediction set was generated, based solely on reversible synthetic SEP data which contained patterns inside and outside the QUIRDS limits. Equation 3 was again used to synthesize the SEP data. In this case, the chi function corresponding to totally reversible behavior was used for all curves. Doublet concentration ratios were allowed to vary between 20:l and 1:20, limits twice those of the training set. Curves synthesized near those limits could be expected to show more singlet character than doublets of the training set. Values for n were chosen randomly between 0.9 and 3.3 with no minimum increments. Peak separations for doublets varied between 6 and 1 2 mV. A subset of reversible synthetic SEP curves was selected from the QUIRDS data set to act as a reference prediction set for comparison of classification results. Classification was performed by the kNN method using the best multidimensional feature sets (Table V) determined by the IFS feature selection process, and results are given in Table VII. There are two major observations which can be made concerning the results found in Table VII. The first of these is concerned with the classification of singlets and doublets 1374
ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
Table VII. Classification of Test SEP Data Lying Outside the QUIRDS Training Set Limitsa Classification accuracy, % Parameters FFT featuresb SEP featuresb n C,/C, Rep Test Ref Test Singlets WLd 99 100 100 79 OLd 41 75 Doublets WL WL 80 78 87 78 OL WL 77 80 WL OL 45 78 65 74 OL OL a See Table I for limits. Features used were the “noise optimizeq” set from Table V. Reference data composed of a reversible subset of the QUIRDS data base not found in the training set. WL = within limits, OL = outside limits. exceeding any of the training set limits and a second which relates to classification of test set patterns falling within the training set limits. Of the two feature sets used, the SEP features appear to have the advantage in classification of patterns outside the QUIRDS limits. The average accuracy obtained for all singlets and doublets outside the limits was 76% for the SEP features and only 61 % for the FFT features. From these results it can be concluded that FFT features are more sensitive than S E P features to those shape factors which make patterns fall within certain regions of feature space (e.g., the concentration ratio for doublets and n for singlets). Apparently, singlet patterns are affected the greatest by these factors as the FFT features show a strong bias towards their classification as doublets. Reference to Figure 2 suggests why this may occur. Singlet patterns tend to be distributed curvilinearly in FFT feature space, with doublets scattered around them. Any singlets which are forced slightly away from the singlet region by various shape factors are likely to be misclassified. On the other hand, doublets, because of their wide distribution in feature space, should be less affected by these slight changes in position. This conclusion is consistent with the results obtained here and is supported further by the results obtained in noise studies (Table IV). The second major fact observed in Table VI1 is the difference in the classification results for the test and reference sets for WL (within limit) singlet/doublet patterns using S E P features. One would expect the results to be similar for all WL data sets, as observed for FFT features. Inconsistent results like these for SEP features were obtained for several different subsets of the test data, indicating some unreliability associated with their use. FFT features, on the other hand, maintained consistently good results for WL patterns. While these results don’t specifically indicate that one type of feature is more useful than the other for classification of SEP data, factors such as consistency and overall higher classification results for QUIRDS prediction sets tend to support the use of FFT features over SEP features for most classification problems. Of even more importance is the fact that the FFT feature extraction approach is more generally applicable to analytical wave forms than the SEP approach. The SEP approach would require a complete redefinition of features to take advantage of particular theoretical relationships governing a given analytical response while the FFT method could be used directly. Condensation of the Training Set. The condensed kNN algorithm described in the Theory section was applied to the QUIRDS training set. Figures 4 and 5 show the results of
Table VI11 lists the results obtained by applying the 9 condensed training set to classification of various prediction
i
sets. Comparison of these results with those of Table V shows little change in the overall classification accuracies. However, by using one-third the number of training patterns these results were obtained approximately 3 to 4 times faster than with the complete training set.
I
t I
Figure 4. Nonlinear map, QUIRDS training set; noise-optimized FFT features. (1) = singlet: (2) = doublet
CONCLUSIONS There are several observations, based on the work with synthetic voltammetric data, which affect significantly the approach to be taken for applying pattern recognition to the classification of real voltammetric data. The first of these is that the classification of singlets and severely overlapped doublets appears, at least, to be feasible. Secondly, it appears that the kNN classification algorithm should be the most appropriate. Finally, FFT features seem to allow the most reliable and consistent classification of voltammetric doublets and singlets. The conclusions from these studies have been applied in the classification of real voltammetric data reported elsewhere (11). ACKNOWLEDGMENT The authors thank Lilly Lai for her help in computing theoretical SEP data and Phil Gaarenstroom for development of the nonlinear-mapping subroutine.
1
LITERATURE CITED
Flgure 5. Nonlinear map, condensed QUIRDS training optimized FFT features. (1) = singlet; (2) = doublet
set; noise-
Table VIII. kNN Classification of QUIRDS Prediction Sets by a Condensed Training Seta Classification accuracy, % Prediction set Singlet Doublet Smooth QUIRDS 87.2 86.1 Noisy QUIRDS (NQUIRDS) 58.9 79.5 Savitzky-Golay52.6 70.5 Smoothed NQUIRDS Fourier-Filter83.3 87.7 Smoothed NQUIRDS a Feature set used is noise-optimized FFT set (see Table VI.
condensation on the QUIRDS training set using the best noise-optimized FFT features (from Table V). Figure 4 is a NLM of 100 representative QUIRDS training patterns. Note the relative similarity between the two class distributions compared to those of Figure 5 , a map containing 100 of the 334 patterns retained by the condensation procedure. The main effect of condensation has been to lower pattern density while retaining those patterns necessary for good classification.
R. W. Dwyer, Jr., Anal. Chem., 45, 1380 (1973). D. W. Kirmse and A. W. Westerberg, Anal. Chem., 43, 1035 (1971). K. Yamaoka and T. Nakagawa, Anal. Chem., 47, 2050 (1975). L. B. Sybrandt and S. P. Perone, Anal. Chem., 44, 2331 (1972). M. A. Pichler and S. P. Perone, Anal. Chem., 46, 1790 (1974). N. J. Nilsson, “Learning Machines”, McGraw Hili, New York, N.Y., 1965. P. C. Jurs and T. L. Isenhour. “Chemical Appllcatlons of Pattern Recognition”, Wiley-Interscience, New York, N.Y., 1975. K. S. Fu, “Sequential Methods in Pattern Recognition and Machine Learning”, Academic Press, New York, N.Y., 1968. K. Fukanaga, “Introduction to Statistical Pattern Recognition”, Academic Press, New York, N.Y., 1972. B. R. Kowalski and C. F. Bender, Anal. Chem., 44, 1405 (1972). Q.V. Thomas, R. A. DePalma, and S . P. Perone, Anal. Chem., following naner in this.issue. .._. i:rSavltzky and M. J. E. Golay, Anal. Chem., 36, 1627 (1964). D. E. AsDenes. Anal. Chem.. 47. 1181 (1975). ’ Gary Hohck, Anal. Chem., 44, 943 (1972). K. G. Beauchamp, Processina”, Georae Allen and Unwin LTD. . ”Signal London, 1973. B. R. Kowalski and C. F. Bender, Anal. Chem., 45, 2234 (1973). B. R. Kowalski and C. F. Bender, J . A m . Chem. SOC.,94, 5632 (1972). R. S. Nicholson and 1. Shain, Anal. Chem., 36, 706 (1964). R. S. Nicholson, Anal. Chem., 37, 1351 (1965). J. W. Cooley and J. W. Tukey, Math. Comput., 19, 297 (1965). G. D. Beraland. I€€€ Soectrum. 6. 41 11969). Q.V. Thcmas,’Ph.D. Thesis, Purdue Urhersity, 1976. J. W. Hayes, D. E. Clover, and D. E. Smith, Anal. Chem., 45, 277 (1973). Q.V. Thomas, L. Kryger, and S. P. Perone, Anal. Chem., 48, 761 (1976). R. J. Rummel, “ADplied Factor Analysis”. Northwestern University Press, Evanston, Ill., 1970. G. S.Zander, A. J. Stuper, and P. C. Jurs, Anal. Chem.,47, 1085 (1975). B. R. Kowalski and C. F. Bender, J . A m . Chem. SOC.,95, 666 (1973).
RECEIVED for review November 29,1976. Accepted May 23, 1977. This work has been supported by the Office of Naval Research and by the National Science Foundation Grant No. MPS74-12762
ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
1375