ANALYTICAL CHEMISTRY, VOL. 51, NO. 7 , JUNE 1979
825
On-Line Pattern Recognition of Voltammetric Data: Peak Multiplicity Classification R. A. DePalma' and S. P. Perone" Purdue University, Department of Chemistry, West Lafayette, Indiana 47907
An on-line procedure for the classification of peak multiplicity in stationary electrode polarography has been demonstrated. Real voltammograms are compared with a composite training set of real and theoretical data. The k-Nearest Neighbor pattern recognition classification technique was used. Seven different metal ions were nearly 100% correctly classified as and singlets with an average accuracy of greater than 95 YO, severely overlapped doublets were also detected by this on-line procedure.
Mixture analysis by electrochemical techniques suffers from t h e inability t o resolve and quantitate closely overlapped systems. A necessary prerequisite to quantitation is the ability t o identify an electrochemical waveform as consisting of one or two components. Recently, a method designed to overcome this problem in stationary electrode polarography (SEP)was demonstrated ( I , 2). Stationary electrode polarography was used as the electrochemical tool because the behavior of the waveform is well understood for a wide variety of electrode processes (3, 4 ) . In that method ( I ) , theoretical voltammograms were created within a specified range of reversibility, transfer coefficient, 01, and electron transfer number, n. Single component voltammograms and doublet voltammograms with overlaps less than 10 mV and peak height ratios from 1 O : l to 1 : l O were created. T h e use of pattern recognition allowed the identification of a voltammogram as a singlet or doublet under these conditions. This work was extended to real electrochemical systems with slightly lower accuracy (2). In that work, the voltammogram was represented by its Fourier transform, and the k-Nearest Neighbor (kNN) method used for classification. In this work, we report the development of an on-line classification procedure for the detection of peak multiplicity in SEP with overlaps less than 10 mV and peak height ratios from 1:1 t o 1:s. This procedure was evaluated by using seven different singlet systems and a doublet system a t two overlap conditions and four peak height ratios. In the previous classification procedure ( 2 ) ,a training set which contained only real data was required t o achieve the greatest accuracy. T h e use of synthetic data in the training set, as applied here, has the advantages of precise control of wave shape parameters, freedom from noise, access t o an unlimited curve set, and extreme resolution. By creating synthetic data with a wide range of characteristics, it is possible t o represent a large fraction of the real data which is anticipated. This latitude frees the classification procedure from a dependence on previously acquired real data for maximum accuracy. By applying this classification procedure on-line, the obvious benefit of quicker analysis time is achieved as well as the ability to detect errors in the instrumentation or complications in t h e electrochemistry before the entire set of data is ana'Present address: The Procter & Gamble Company, Winton Hill Technical Center, 6071 Center Hill Rd., Cincinnati, Ohio 45224. 0003-2700/79/0351-0825$01 .OO/O
lyzed. This ability permits instrumental corrections t o be made without needlessly obtaining poor quality data, and it allows alternate characteristics of the electrochemistry to be recognized and pursued while the experiment IS still active.
EXPERIMENTAL Electrochemical. Stationary electrode polarograms of seven different metal ions were obtained under a variety of conditions. The metal ions selected were chosen because they represent the range of conditions where the classification procedure is effective. One-, two-, and three-electron transfers are represented, as well as differing reversibilities, 01, the transfer coefficient, and species whose reduced form is soluble in solution or in mercury. All solutions were prepared with the use of reagent grade chemicals and distilled, de-ionized water. Each was thoroughly deoxygenated prior to analysis by purging with purified, solvent-saturated nitrogen (5) for 20 min. The SEP data were obtained on the computer-controlled instrument described earlier (2, 6, 7 ) . The working electrode was a digitally controlled DME which could supply mercury drops with computer-selected lifetimes from 1 to 10 s. Using this system, voltammograms could be ensemble-averaged by performing potential scans at the end of precisely reproducible drop lifetimes. The potentiostat was modified to include the use of negative resistance IR compensation (8, 9) so that the effect of the uncompensated resistance would be minimized. The 250-point data curves were obtained for each electroactive species at 1.00, 2.00, and 4.00 V / s , and data resolutions of 2.0 mV/point. Voltammograms which contained obvious distortions such as excessive noise, discontinuities, or ADC overflows were discarded. Lab Computer. The instrumentation computer used was a Hewlett-Packard 2115A with 8K words of core memory. Peripherals include a paper tape reader and punch, a Tektronix 601 storage display monitor, and a Teletype. Data acquisition and hardware control subroutines were written in HP assembly language and called by main programs written in BASIC. P a t t e r n Recognition Computer. The pattern recognition processor is a Hewlett-Packard 2100s minicomputer with 32K words of core memory. Peripherals include a 5-Mbyte moving head disc drive (HP-7900), paper tape reader and punch, a Tektronix 603 storage display monitor and a 4012 graphics display terminal, a Centronics 306 serial printer, a Calcornp 565 digital plotter, and a Teletype. Ail pattern recognition programs were written in FORTRAN IV and operated under the HP DOS-M executive. Computer-Computer Interface. Data are transmitted from the laboratory computer to the pattern recognition computer by a bidirectional, high-speed, 16-bit parallel interface. The interface consists of two Hewlett-Packard 12930 universal interface cards and cable. Data are transmitted between 10 and 20 kHz under program control. T h e T r a i n i n g Set. The training set used for the peak multiplicity classification consists of the PREDl experimental data set of Reference 2 and a set of synthetic singlets. The PREDl data set consists of 77 singlets from seven different metal ions. Voltammograms were obtained at different scan rates, starting potentials, and voltage resolutions. ,411 curves were linearly interpolated t o a data point resolution of 2 mV/pt. PREDl also contains 120 doublet voltammograms formed by linear combination of real singlet voltammograms. This method of doublet construction assumes no interactions between the singlet species. The doublets have overlaps less than 10 mV and peak height ratios between 1:l and 1 : l O . The synthetic data in the training set were C 1979 American Chemical Society
826
ANALYTICAL CHEMISTRY, VOL. 51, NO. 7, JUNE 1979 2
4
2
4
t
T
c
I-
2 w R e
z w
e a
a
VOLTRGE
-
U
U
VOLTRGE
VOLTRGE
-C
3
1
*
1
T
c
c
w f
a
R R 3 U
-.
VOLTRGE
+
Figure 1. Real singlet systems. (1) Cd2+, (2) Pb", (3) TI', (4) 4 X lo4 M Fe3+. Curves are uncorrected for background. The line at the
bottom of each curve is a reference line, not base-line current generated using the numerical solution (3)of Nicholson. Synthetic curves were obtained for 11 psi values, 7 a values, and 3 n ranges as given in Table I. Psi is a measure of the reversibility of an electrode process ( 4 ) . It is proportional to the heterogeneous rate constant for the electron transfer, k,, and inversely proportional to the square root of scan rate. For each combination of $, cy, and n,ten curves were generated with the use of a random number generator to vary n within the range. A total of 2310 synthetic curves were generated. Classification Procedure. For this analysis, each voltammogram received identical numerical treatment so that only the shape of the voltammogram would influence the classification, not the magnitude or peak potential. The synthetic data were processed by first scaling each voltammogram so that the background current is 0 and the peak current is 1.0, then selecting 96 points before and including the peak and 32 points after the peak, pseudo-rotating (10) this selected portion of the voltammogram, then taking the Fourier transform of the rotated curve taken by SUBROUTINE FORT (11). The Fourier coefficients corresponding to each voltammogram were placed in a disc file and then autoscaled (12). The average value and standard deviation for each feature were saved for later use. This procedure of feature extraction eliminates peak magnitude and peak position information. Real data voltammograms were processed in an identical procedure except, (1) the curves were background corrected by blank subtraction, and (2) after the voltammogram was scaled to 1.0, it was smoothed by application of a Fourier transform smooth (13). The real data, both PREDl and the data reported here, were autoscaled with the statistics (average and standard deviations for each feature) from the synthetic singlet data. Real data classified by this method were obtained under digital control with the laboratory computer system. The data were displayed on the storage display monitor and, if acceptable, were transferred via the computer-computer interface to the pattern recognition computer which represents the voltammogram in the Fourier domain. The classification algorithm is the k-Nearest Neighbor (14)method which uses the most significant FFT features determined previously for singlet/doublet identification ( 2 ) . For these three features, the Euclidean distance is calculated between the pattern in question and each of the 2507 patterns in the training set. The voltammogram is classified according to the type of its nearest neighbor.
RESULTS AND DISCUSSION An examination of Figures 1 and 2 reveals the severity of the singlet/doublet identification problem. For the doublets, no double peaks or shoulders are present to indicate the presence of two components. T o solve this classification problem, the method of Thomas, DePalma, and Perone (2) was chosen. And to implement this method on-line, a
-C
3
I
z W
VOLTRGE
VOLTRGE
R
=I u
VOLTRGE
VOLTRGE
+
4
Figure 2. Real doublet systems PbZf/Tl' in 0.1 M citric acid, pH 5.5, (1) 1.6 X lo4 M T1'/4 X M Pb", (2)1.2 X M Tl'/l X M Pb2+, (3) 1 X M T1+/5 X M Pb", (4) 1 X M T1+/8 X M Pb2+. Curves are uncorrected for background. The line at the bottom of each curve is a reference line, not base-line current
6
6
a $8,
i Figure 3. Feature plot of real singlet data from the PREDl data set. (1) Pb(II), (2) Cd(II),(3) Cu(II), (4) TU), (5) Sb(III), (6)Ni(1I)-NO3-, (7) Ni(I1)-SCN-, (8) Co(I1)
computer network was required because of the vast number of calculations and mass storage requirements which could not be met with a dedicated laboratory computer. These conditions of storage and speed of calculation were best met by utilizing a disc-based operating system with the classification algorithm written in FORTRAN IV. T o effectively use the resources of the operating system, it is best to control the experiment and acquire the voltammogram with a smaller, dedicated minicomputer and then transfer the data to the operating system. T h e computer-computer interface allows this rapid data transfer. The turn-around time required for kNN classification is typically 12 s. The rapid results provided by the on-line classification procedure allow the additional benefit of immediate data interpretation. Instrumental abnormalities can quickly be detected and corrected, and unanticipated chemical behavior can be recognized and investigated. T h e classification procedure used here was known to be more effective with real data as the training set than with synthetic data ( 2 ) . A feature plot of two of the three features
ANALYTICAL CHEMISTRY, VOL. 51, NO. 7, JUNE 1979
r
I
I
I
I
I
I
827
1
c 0
D
0
o
c
D D
D
0
o f f
I
1
I
I
I
I
I
Figure 4. Feature plot of real singlets and real doublets from the PREDl data set. S = singlet, D = doublet
Figure 6. Feature plot of synthetic singlet data and real doublet data taken on-line. S = synthetic singlet, (1) 1:l peak hei ht ratio, Ti+/Pb2+ doublet, pH 4.8; (3)3:l peak height ratio, Tlf/Pb2 doublet, pH 4.8; (5)5:l peak height ratio, TI+/Pb*+ doublet, pH 4 . 8
9
5
5 %
a
s5 5
Figure 5. Feature plot of 200 randomly selected synthetic singlets
Table I. Synthesis Parameters for Synthetic Data ij
values
values n ranges
CY
20.0, 10.0, 5 . 0 , 2 . 0 , 1 . 0 , 0.5, 0.2, 0.1, 0.05, 0.02, 0 . 0 1 0.2, 0.3, 0.4, 0.5, 0.6, 0 . 7 , 0.8 0.9-1.1, 1 . 8 - 2 . 2 , 2 . 7 - 3 . 3
for classification reveals one of the potential problems of the exclusive use of real d a t a in the training set. In Figure 3, the real singlets cluster according to t h e metal ion. T h e singlet d a t a lie on a curved surface as shown in Figure 4,with t h e doublets lying to t h e right of this surface. This explains the ability of t h e earlier procedure ( 2 ) t o be effective with real data. Any specific metal ion voltammogram is simply being placed in its appropriate cluster, and the incorrectly classified data are probably a t the edge of the cluster. If this procedure is used to classify a singlet voltammogram for a species which is not already represented in t h e training set, it may fail because the pattern does not fall into an existing cluster. To increase the usefulness of this method, a training set which is a composite of real a n d synthetic d a t a was chosen. T h e synthetic d a t a were prepared so t h a t an even distribution of properties was obtained as listed in Table I. Figure 5 shows how this synthetic data set defines more completely the singlet surface. (Figures 5 , 6, a n d 7 represent the same distribution
Flgure 7. Feature plot of synthetic singlet data and real doublet data taken on-line, S = synthetic singlet. (1)1: 1 peak hei ht ratio, TI+/Pb*+ doublet, pH 4.8; (3)1:3 peak height ratio, TI+/Pb2' doublet, pH 4 . 8
Table 11. On-Line Singlet Classification species x 1 0 - 4 MC d 2 + x M~ 1 * x 10'4MPbZ+ x 1 0 - 4 MC o 2 + 2 x 1 0 - 4 M 1n3+ 2x M Eu'+ 2 x M Fe3+
3 4 3 3
4
x
M Fe3+
electrolyte Na,SO, KNO, KNO, KCl KCl KCI K,C,O, + 0.05 M H,C,O, 1.0 M K,C,O, + 0.05 M H,C,O,
0.1 M 0.1 M 0.1 M 0.1 M 0.1 M 0.1 M 1.0 M
voltam- accuracy, 70 mograms 10 9 9 9
100 100 100 89
10
100
9
89
10
80
9
100
as Figures 3 a n d 4, except t h a t t h e scale of the axes is changed.1 R e a l Singlet Data. T h e voltammograms of seven metal ions were obtained and classified by the procedure described in the Experimental section. T h e results are given in Table 11. T h e first four species have voltammograms already represented in t h e PREDl section of the training set, a n d
828
ANALYTICAL CHEMISTRY, VOL. 51,
NO.7, JUNE 1979
Table 111. On-Line Doublet Classification pH 4.8, 2-mV separation at 1.0 VIS, 10-mV separation at 4 . 0 Vis
100% ( 9 ) a
100% (9)
78% (9)
89% ( 9 )
-
x
10-4 M ~ i + M Pb2+ 1:l peak height
1.6 4 x
pH 5.5, 10-mV separation at 1 . 0 VIS, 20-mV separation at 4 . 0 VIS
1.2 x M T1' 1x M Pb2+ 3 : 1 peak height 1 x 10-4 M TI+ 5x M PbZ+ 5:l peak height
0% (18)
1 x 10-4M T I + 8x M Pb2* 1: 3 peak height
0% (9)
4 x 10-4M ~ 1 + 2x M Pb2+ 5: 1 peak height
33% ( 9 )
33% (18)
0% (9)
33% (9)
a Number in parentheses is number of voltammograms taken under those conditions. Electrolyte: 0.1 M citric
acid.
these are classified very well. For each species, the number of voltammograms reflects a use of the three different scan rates as well as a differing number of ensembles for each curve. T h e number of ensembles was varied between 1 and 5 t o demonstrate t h e effect of the signal-to-noise ( S / N ) ratio of t h e voltammogram on its classification. At these metal ion concentrations, the S / N ratio has little effect. The next three species d o not have voltammograms represented in t h e P R E D l section of the training set, and these voltammograms demonstrate the classification procedure's ability to correctly M Fe3+ data classify previously unseen data. T h e 2 X were classified t h e poorest, and this is attributed t o t h e low S / N ratio. The Fe3+ data were retaken a t 4 X lo4 M and were classified completely correctly (Table 11). Real D o u b l e t D a t a . Lead(I1) and thallium(1) in 0.1 M citric acid were chosen t o evaluate the classification procedure's performance with real doublet data. Voltammograms for individual Pb2+a n d T1+ ions were obtained as a function of p H t o document t h e peak potential separation. Also, a t equal metal ion concentrations a t t h e p H of interest, t h e background-corrected peak current due t o Pb2+is four times t h e current due t o Tl+. Since t h e peak height ratios are of interest here, t h e concentrations were adjusted accordingly. I n addition t o p H , scan rate also affects peak position, while concentration was found to have little effect. T o demonstrate classification accuracy two p H values for the doublets were selected. At p H 5.5, t h e peak potential separations are less severe t h a n a t p H 4.8. T h e peak height ratio of t h e two components was also varied. Under these conditions, thallium is t h e preceding component in t h e reduction. For equal contributions of P b 2 + a n d Tl' t o the doublet voltammogram, the classification accuracy is excellent at both peak potential separations as shown in row 1 of Table 111. As t h e contribution of lead t o t h e doublet is decreased, t h e classification accuracy falls until a t 5:l Tl+/Pb2+peak height ratio, the results are not usable. For doublets where the lead is the predominate component, the accuracy is also poor. T h e 5:1 T1+/Pb2+doublets were retaken a t higher concentration
of electroactive species (row 5 of Table 111) t o demonstrate t h a t the ratio of electroactive species, not t h e magnitude, influences the classification performance. An examination of the feature plots for these data is helpful for understanding classification performance. In Figure 6, the voltammograms with equal component contributions are clearly doublets since they lie t h e farthest from singlet distribution. As t h e contribution of t h e lead is decreased, t h e doublet moves towards the singlet surface. The 5:l T1+/Pb2+ doublets are not on the singlet surface, but are misclassified because of t h e low density of doublets in the training set in t h a t region of t h e pattern space rather than some inherent lack of doublet behavior. T h e same trend is seen for t h e 1:3 T1+/Pb2+doublets. These voltammograms are clearly doublets from the feature plot in Figure 7, but are misclassified by the k N N algorithm because of a lack of doublets near these patterns in the training set. This can be seen by comparing Figures 4 and 7 . I t is important to note, however, that correct classification can be made of all doublets in this study by visual inspection of the feature space plots. CONCLUSION An on-line method of classifying stationary electrode polarograms with respect t o peak multiplicity has been implemented and evaluated. This classification procedure has been shown t o classify real singlets with n = 1, 2 , or 3, and where the reduced form of the electroactive species forms an amalgam or is soluble in t h e electrolyte. This procedure has also been tested for real doublets with peak potential overlaps less than 10 mV. Patterns where t h e procedure failed were examined with feature plots a n d could then be correctly classified. Failures of the k N N procedure with doublet d a t a are attributed to training set limitations. These limitations are difficult t o overcome simply by t h e addition of more doublet d a t a into the training set, as doublet space is too diffuse. However, because visual inspection of feature plots of t h e d a t a can lead t o correct classification, an improved classification procedure might involve some type of distance measure. T h e singlet d a t a lie on a fairly tight surface in feature space, and a measure of t h e distance to this surface could be used t o classify an unknown voltammogram. This classification procedure might be advantageous for this type of problem, b u t is much more difficult computationally. T h e 12-s response of t h e on-line procedure reported here has several advantages. Instrumental problems can be quickly recognized and corrected, which prevents large amounts of inferior d a t a from being collected. Also, unanticipated chemical behavior can be examined and alternate experiments performed while the solution is still in the electrochemical cell. LITERATURE CITED (1) 0. V. Thomas and S. P. Perone, Anal. Chem., 49, 1369 (1977). (2) Q.V. Thomas, R. A. DePalma, and S. P. Perone, Anal. Chem., 49, 1376 (1977). (3) R . S. Nicholson and I.Shain, Anal. Chem., 36, 706 (1964). (4) R . S. Nicholson, Anal. Chem., 37 1351 (1965). (5) L. Meites, Anal. Chim. Acta. 18, 364 (1958). (6) 0. V. Thomas, W.D. Thesis, Purdue University, West Lafayette, I&., 1976. (7) 0.V. Thomas, L. Kryger, and S. P. Perone, Anal. Chem., 48, 761 (1976). (8) C. Lamy and C. C. Herrmann, J. Electroanal. Chem., 59, 113 (1975). (9) C.Gabrielli, M. Ksouri, and R. Wiart. Nectrochim. Acta, 22, 255 (1977). (10) J. W. Hayes, D. E. Glover, D. E. Smith, and M. W. Overton. Anal. Chem., 45, 277 (1973). (11) SUBROUTINE FORT from the Purdue Universitv Comoutina . - Center. West Lafayette, Ind. (12) B. R. Kowalski and C. F. Bender, J. Am. Chem. Soc., 94, 5632 (1 972). (13) G. Horlick, Anal. Chem.. 44, 943 (1972). (14) T. M. Cover and P. E. Hart, I€€€ Trans. Inform. Theory, It-13, 21 (1967). ~I
RECEIVED for review September 14, 1978. Accepted February 2 2 , 1979.