Anal. Chem. 2006, 78, 8430-8437
Extraction of Parameters and Their Error Distributions from Cyclic Voltammograms Using Bootstrap Resampling Enhanced by Solution Maps: Computational Study Lesław K. Bieniasz†
Institute of Physical Chemistry of the Polish Academy of Sciences, Department of Electrochemical Oxidation of Gaseous Fuels, ul. Zagrody 13, 30-318 Cracow, Poland Herschel Rabitz*
Department of Chemistry, Princeton University, Princeton, New Jersey 08544
In a recent work,1 the solution mapping technique based on high-dimensional model representation (HDMR)2-4 was proposed as a means for rapidly calculating cyclic voltammetric (CV) transients.5 The HDMR map-based computation of cyclic voltam-
mograms can be orders of magnitude faster than typical digital simulation methods.6 By using HDMR maps, such tasks as the visualization and visual exploration of the effects of the model parameters on the voltammograms or the nonlinear regression of CV data can be performed almost instantaneously on modern personal computers. Therefore, rapid, real-time, on-line theoretical data analysis has been suggested1 as one potential application of the HDMR maps in CV. The high speed of the HDMR map-based computations can also be useful in a traditional off-line data analysis mode, when the time available for the analysis is hours rather than seconds. There exists a number of robust but computationally very expensive methods of data analysis, which until now were not feasible in CV. Examples of these methods include the following: resampling techniques such as bootstrap or jackknife,7 pointwise ANOVA and automatic classification,8 or probing of model parameter spaces by global search statistics.9 The application of high-speed solution maps has a potential to change this situation, allowing electrochemists to employ such methods with reasonable amounts of computational effort. This concept is examined in the present study, with the focus on determining statistical errors in parameter estimation from cyclic voltammograms. Parameter estimation from cyclic voltammograms is a wide subject, with a variety of methods, theoretical and experimental situations, and unsolved problems. For the purpose of the present work, we concentrate on the currently often employed10 method of nonlinear regression,11 which involves a minimization of a selected objective function expressing a difference between the model predictions and the experimental data. Typically, the
* Corresponding author. E-mail:
[email protected]. † Tel./fax.: (+48 12) 266 03 41. E-mail:
[email protected], http:// www.cyf-kr.edu.pl/∼nbbienia. (1) Bieniasz, L. K.; Rabitz, H. Anal. Chem. 2006, 78, 1807. (2) Li, G.; Rosenthal, C.; Rabitz, H. J. Phys. Chem. A 2001, 33, 7765. (3) Rabitz, H.; Alis¸ , O ¨ . F.; Shorter, J.; Shim, K. Comput. Phys. Commun. 1999, 117, 11. (4) Rabitz, H.; Alis¸ , O ¨ . F. J. Math. Chem. 1999, 25, 197. (5) See, e.g.: Speiser B. In Encyclopedia of Electrochemistry; Bard A. J., Stratmann M., Unwin P. R., Eds.; Wiley: New York, 2003; Vol. 3, Chapter 2.1, p 81.
(6) Britz, D. Digital Simulation in Electrochemistry, 3rd ed.; Springer: Berlin, 2005. (7) Efron, B. The Jackknife, the Bootstrap and Other Resampling Plans; SIAM: Philadelphia, 1982. (8) Meites, L. CRC Crit. Rev. Anal. Chem. 1979, 8, 1. (9) Shenvi, N.; Geremia, J. M.; Rabitz, H. J. Phys. Chem. A 2002, 106, 12315. (10) Bieniasz, L. K.; Speiser, B. J. Electroanal. Chem. 1998, 458, 209, and references cited therein. (11) See, e.g.: Draper, N. R.; Smith, H. Applied Regression Analysis; Wiley: New York, 1981.
The conventional determination of model parameter errors in least-squares regression of experimental cyclic voltammetric data assumes validity of local approximations (e.g., linearization) in the parameter space and normal distributions of the data and parameter errors. Such assumptions may not always be satisfied in practice. Bootstrap resampling techniques present a more universally applicable approach to error estimation, which until now has not been used in cyclic voltammetric studies, owing to the high costs of the required voltammogram simulations. We demonstrate that the burden of computing voltammograms can be significantly reduced by the use of high-dimensional model representation (HDMR) solution mapping techniques, thereby making it feasible to apply the bootstrap data analysis in cyclic voltammetry. We perform computational experiments with bootstrap resampling, enhanced by HDMR maps, for a typical cyclic voltammetric model (i.e., the Eqrev Cirr Eqrev reaction mechanism at a planar macroelectrode under semi-infinite, pure diffusion transport conditions). The experiments reveal that the bootstrap distributions of the estimated parameters provide a satisfactory quantification of the parameter errors and can also be used for detecting statistical correlations of the parameters.
8430 Analytical Chemistry, Vol. 78, No. 24, December 15, 2006
10.1021/ac061167z CCC: $33.50
© 2006 American Chemical Society Published on Web 11/09/2006
objective function is a sum of squared deviations between theoretical voltammograms and experimental transients available as a set of discrete pairs of current-time (or current-potential) values. Other choices, like the sum of absolute deviations,12 have also seen limited application in electrochemistry.13 Either singleresponse regression (for a single voltammogram) or simultaneous multiresponse regression (for an ensemble of voltammograms corresponding to different experimental conditions) can be performed in this way. The minimization is usually accomplished either by gradient-based methods, such as the Gauss-Newton14 or Levenberg-Marquardt15 algorithms, or by gradient-free methods such as the simplex algorithm.15-17 These methods are fast, but they have restricted, local convergence properties. Alternatively, global search methods employing genetic algorithms can be used18 at a higher computational cost. We emphasize that inherent in the above methodology is the assumption that the sole cause of experimental data errors is the random noise contaminating measured electric current values (cf. numerous discussions cited in Bieniasz and Speiser,10 or the very recent analysis19). Other possible situations, such as the estimation of parameters from an ensemble of noiseless voltammograms subject to other sorts of random errors may require a specifically dedicated, and not yet available methodology, and will not be discussed here. Finding the best-fit parameters by nonlinear regression is often feasible, except for extreme cases such as strongly overlapping peaks, which require special treatments. The situation is less satisfactory with regard to the evaluation of the statistical errors or confidence intervals for the extracted parameter values. The conventional approach to error estimation is based on a number of simplifying assumptions. First, it is believed that both the data and parameter errors are subject to normal distributions. Second, a local linearization of the relation between the data and parameter errors is introduced to determine the variance-covariance matrix of the parameters in terms of the local parametric sensitivity coefficients of the voltammograms evaluated at the best-fit parameter vector.10 The sensitivity coefficients are obtained either as a byproduct of the minimization algorithm (if gradient techniques are used) or can be simulated separately.10 For the simplex method, special methods of error estimation exist,20,21 which involve similar simplifying assumptions. Hence, in the conventional approach, the confidence domains are ellipsoids in the parameter space, determined by the variance-covariance matrix. Although the above approach often works well in practice, it would be useful to have a more universally applicable method of determining the statistical errors of the estimated parameters. The data errors need not necessarily have a normal distribution. Also, complex situations typical for chemical sciences may cause that, owing to the (12) See, e.g.: Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. Numerical Recipes, The Art of Scientific Computing, 2nd ed.; Cambridge University Press: New York, 1992. (13) Alden, J. A.; Compton, R. G. Electroanalysis 1998, 10, 207. (14) See, e.g.: http://www.digielch.de. (15) See, e.g.: Lavagnini, I.; Pastore, P.; Magno, F. Anal. Chim. Acta 1989, 223, 193. (16) See, e.g.: Gosser, D. K. Jr. Cyclic Voltammetry, Simulation and Analysis of Reaction Mechanisms; VCH: New York, 1993. (17) Bieniasz, L. K. Comput. Chem. 1997, 21, 1. (18) See, e.g.: Kasprzyk, G. P.; Jaskuła, M. J. Electroanal. Chem. 2004, 567, 39. (19) Tourwe´, E.; Pintelon, R.; Hubin, A. J. Electroanal. Chem. 2006, 594, 50. (20) O’Dea, J.; Osteryoung, J.; Lane, T. J. Phys. Chem. 1986, 90, 2761. (21) Phillips, G. R.; Eyring, E. M. Anal. Chem. 1988, 60, 738.
nonlinear dependencies of model responses on the parameters, the parameter error distribution will deviate from the normal form,22 even if the data errors are normal. The most general way of obtaining reliable confidence intervals would be to determine the actual distributions of the parameter errors. Ideally, a large population of experimental voltammograms obtained in a series of replicate experiments could be used for this purpose. Unfortunately, this is not practical. Therefore, other methods of statistical inference have to be sought. The bootstrap resampling techniques7,23-25 offer one such possibility. However, bootstrap is a computationally expensive data analysis method. Therefore, in the present work we examine the utility of HDMR maps for the practical execution of the bootstrap analysis for estimating the errors of parameters extracted from CV data. The possibility of using bootstrap in connection with the parameter estimation from transient electrochemical responses was suggested several years ago by Bieniasz and Speiser.10 However, the only electrochemical application of the bootstrap methods, that we are aware of, is that of Pais et al.,26 who extracted parameter estimates from polarographic responses. Under such circumstances, the usefulness of the bootstrap for data analysis in CV deserves examination on its own. For these reasons, we follow the methodology of computer-aided modeling27 and computational electrochemistry28 and choose to examine the bootstrap method by means of computational experiments with synthetic experimental CV data. The synthetic data are obtained by adding controlled pseudorandom noise to simulated voltammograms corresponding to known parameter values. This allows us to compare all parameter estimates and their bootstrap distributions with the true values and true distributions, which is not possible while using real experimental data. We select single-response regression for the discussion and tests, because it presents a bigger challenge for bootstrap than the more recent and powerful multiresponse regression, owing to the fact that the information content of a single voltammogram is smaller than that of a whole ensemble. Single-response regression is also simpler algorithmically. Consequently, the tests are tough but relatively easy to perform and interpret. Besides, singleresponse regression remains of continued interest for experimental practice (see, for example, numerous examples cited in Bieniasz and Speiser,10 or more recent illustrative examples19,29,30). Our choice does not mean that bootstrap cannot be applied to multiresponse regression; the relevant extension of the algorithm is straightforward and will be indicated. (22) See, e.g.: Meinrath, G. Chemom. Intell. Lab. Syst. 2000, 51, 175. (23) Efron, B.; Tibshirani, R. J. An Introduction to the Bootstrap; Chapman & Hall: New York, 1993. (24) Davison, A. C.; Hinkley, D. V. Bootstrap Methods and their Applications; Cambridge University Press: Cambridge, 1999. (25) Chernick, M. R. Bootstrap Methods, A Practitioner’s Guide; Wiley: New York, 1999. (26) Pais, A. A. C. C.; Pereira, J. L. G. C.; Redinha, J. S. Comput. Chem. 2000, 24, 533. (27) Mocak, J. In Encyclopedia of Analytical Science; Worsfold P. J., Townshend A., Poole C. F., Eds.; Elsevier: Oxford, 2005; p 208. (28) Bieniasz, L. K. In Modern Aspects of Electrochemistry; Conway B. E., White R. E., Eds.; Kluwer/Plenum: New York, 2002; Vol. 35, p 135, and references cited therein. (29) Sanecki, P.; Amatore, C.; Skitał, P. J. Electroanal. Chem. 2003, 546, 109. (30) Rees, N. V.; Klymenko, O. V.; Coles, B. A.; Compton, R. G. J. Electroanal. Chem. 2003, 557, 99.
Analytical Chemistry, Vol. 78, No. 24, December 15, 2006
8431
BOOTSTRAP RESAMPLING IN NONLINEAR REGRESSION The bootstrap procedures relevant to regression problems have been amply described in a number of books.7,23-25 Instructive examples of chemical applications have also been published.22,31,32 Below we summarize the basic concepts utilized in the present work. Let f (t, p) denote the model response as a (generally nonlinear) function of the independent variable t and parameter vector p. In CV, f (t, p) would represent the electric current as a function of time t. An experimental response can be regarded as a sequence of pairs (ti, fi) for i ) 1, ..., M, such that
fi ) f (ti, p) + i
(1)
where fi is a discrete measured response value determined at ti and i is random measurement error subject to an unknown distribution. In the absence of detailed knowledge about the physical mechanism driving the errors, it can be assumed (as often done in electrochemistry) that all i values, i ) 1, ..., M, are statistically independent and have a common distribution with a zero mean and the same standard deviation, independently of ti. This assumption meets the requirements of the bootstrap techniques described below. The least-squares regression of the data fi, i ) 1, ..., M, leads to a parameter vector estimate pest and a corresponding best-fit model response f (t, pest). The differences
ri ) fi - f (ti, pest)
(2)
are the regression residuals. The bootstrap resampling technique occurs in two variants.23-25 Variant 1 consists in resampling the observations fi. Given a single experimental data set (ti, fi) for i ) 1, ..., M, one produces a number B of replicates of this set. Each replicate contains elements randomly selected (with replacement) from the original set of pairs (ti, fi), assuming an identical probability for selecting every pair in a single try. The “replacement” means that, after choosing a pair from the original data set, it is returned back and can be chosen again in a next try. This implies that some of the original pairs may occur more than once within a replicate. Some other pairs may not occur at all. If a given pair is chosen n times for a replicate, then it is counted as n elements. The total number of elements in a replicate is kept equal to M, as it was in the original experimental data set. Having obtained B replicates of the experimental curve, one performs the least-squares regression for each of the replicates. In this way, a population of B best-fit parameter vector estimates is obtained. If B is sufficiently large, the population can be used to construct an empirical distribution of the parameter estimates, which is subject to further analysis (see below). Variant 2 consists in resampling the residuals ri. Given the best-fit parameter vector for the original experimental curve, one calculates the residuals according to eq 2. The distribution of the residuals approximates the distribution of the errors i. To make this resemblance even more perfect, the residuals can be corrected24 for any remaining nonzero mean value (31) Caceci, M. S. Anal. Chem. 1989, 61, 2324. (32) Sohn, R. A.; Menke, W. Geochem. Geophys. Geosyst. 2002, 3.
8432
Analytical Chemistry, Vol. 78, No. 24, December 15, 2006
by the transformation M
r i r ri - (
∑r )/M j
(3)
j)1
Still further corrections of the residuals can be applied24 at the expense of greater complexity of the algorithm. A number B of bootstrap replicates of the set (ti, ri) is created analogous to replicating the pairs (ti, fi) in variant 1. Each replicate set of the residuals is added to the best-fit response f(t, pest) to obtain B replicates of the original experimental data curve. The regression is performed for each replicate, and the population of the best-fit parameter estimates obtained in this way is used to construct an empirical distribution, as in variant 1. Readers are referred to refs 22-25, 31, and 32 and further references cited therein, for discussions of the two bootstrap variants, the connections between them, and relative merits of using them. The interpretation of the bootstrap distributions is based on the rule known as the bootstrap principle,25 which for the regression problems states that for sufficiently large M the distribution of M1/2(prep - pest) approximates the distribution of M1/2(pest - p), where prep denotes the best-fit parameter vector for a bootstrap replicate. Freedman33 applied this rule to linear regression. Nonlinear regression is assumed to obey this asymptotic property as well. The rule implies that the bootstrap distribution of prep should be concentrated around the best-fit parameter values pest, whereas the distribution of pest (that we shall call the true distribution) should be concentrated around the true parameters p. Such a shift of the bootstrap distributions of prep, relative to the true distribution of pest, is expected from the construction of the bootstrap replicates, especially in variant 2, where the replicates are obtained by adding noise similar to the true data errors to the best-fit model curves (rather than to the model curves corresponding to true parameter values). According to the bootstrap principle, the bootstrap distributions can be used for evaluating the statistical errors of the estimated parameters. For this purpose, the conventional variance-covariance estimators applied to the population of bootstrap replicates of estimated parameters present an obvious option, which however is not necessarily adequate, especially if the deviations of the parameter distribution from normality are large (for example, if the distribution is not symmetric). A better approach adopted here is the percentile method that relies on sorting and counting of the bootstrap replicates for the parameters. The method yields confidence intervals for each parameter separately.23-25,32 Given a percentile value β of the distribution of a particular parameter p, we find among the observed parameter estimates the separating values plow(β) and phigh(β), such that a fraction β of (the totality of) the sorted estimates is less than or equal plow(β), and a fraction β of the sorted estimates is greater than or equal phigh(β). The interval (plow(β), phigh(β)) approximates the 1-2β confidence interval. More sophisticated methods have also been developed and can be employed.22-25,33,34 Bootstrap does not require the normality of the distribution of either the data errors i or the parameter estimates, which (33) Freedman, D. A. Ann. Stat. 1981, 9, 1218. (34) Perriott, L. L. Bootstrap in Nonlinear Regression. Ph.D. Thesis, Queen’s University, Kingston, Canada, 1992.
Table 1. Typical Characteristics of the Histograms of the True and Bootstrap Single-Parameter Distributions Obtained by Computational Experiments. characteristic true parameters best-fit parametersb mean standard deviation plow(0.99) phigh(0.99)
distributiona
log k01
R1
E01
log k
log k02
R2
E02
0, 1, 2 1, 2 0 1 2 0 1 2 0 1 2 0 1 2
-0.3 -0.293 -0.298 -0.294 -0.294 0.015 0.014 0.015 -0.338 -0.330 -0.331 -0.260 -0.257 -0.258
0.46 0.4633 0.4608 0.4633 0.4633 0.0068 0.0063 0.0067 0.4422 0.4468 0.4462 0.4778 0.4797 0.4799
-7 -7.025 -7.004 -7.025 -7.024 0.026 0.031 0.032 -7.088 -7.102 -7.105 -6.916 -6.948 -6.940
-0.9 -0.8991 -0.8998 -0.8990 -0.8990 0.0035 0.0028 0.0027 -0.9067 -0.9063 -0.9057 -0.8923 -0.8919 -0.8919
-0.4 -0.4000 -0.3990 -0.4001 -0.4000 0.0028 0.0060 0.0058 -0.4146 -0.4160 -0.4145 -0.3837 -0.3844 -0.3850
0.54 0.5309 0.5395 0.5308 0.5308 0.0059 0.0057 0.0055 0.5250 0.5163 0.5164 0.5541 0.5453 0.5446
-15 -15.0094 -14.9997 -15.0093 -15.0092 0.0058 0.0142 0.0138 -15.0365 -15.0465 -15.0449 -14.9639 -14.9732 -14.9723
a Notation: 0 ) true distribution, 1 ) bootstrap variant 1, 2 ) bootstrap variant 2. b The estimates refer to the original synthetic data curve used for bootstrap resampling.
makes it a powerful technique. Furthermore, bootstrap is not restricted to least-squares regression. It can also be used in combination with other parameter estimation methods like, for example, the aforementioned robust regression by minimizing the sum of absolute deviations,12,13 which is most adequate when the distribution of data errors is a two-sided exponential.12 The large cost of bootstrap, as previously indicated, results from the need to perform a large number B of additional regressions, apart from the single regression needed to obtain the parameter estimates from the original experimental response. Such additional regressions are not needed in the conventional approach to parameter error estimation, previously described. The extension of the above algorithm to multiresponse regression requires a separate replication of each individual response in the ensemble, according to the procedure described. Each response (or its replicate) may be characterized by a different standard deviation. Regression should then be performed simultaneously for the ensemble of original responses, and for each corresponding ensemble of their replicates.
on seven dimensionless parameters: two conditional rate constants k01 and k02, two charge-transfer coefficients R1 and R2, two conditional (that is, formal35) potentials E01 and E02, and the homogeneous rate constant k. The E01 and E02 are defined here as differences between the dimensional conditional potentials and the starting potential of the CV scan, normalized by RT/F, where R, T, and F have their usual meaning. The dimensionless electrode potential polarization waveform relevant to CV is
EXAMPLE CV MODEL The performance of the HDMR map-based bootstrap techniques is most comprehensively tested on multiparameter CV models. We therefore chose the most complex example out of those for which HDMR maps have been previously obtained,1 namely, the model of CV for the Eqrev Cirr Eqrev mechanism, denoted as example 5 in Bieniasz and Rabitz:1
COMPUTATIONAL DETAILS All computational details referring to the programs for the simulation of the CV curves and the HDMR-based nonlinear leastsquares regression are similar to what was previously reported;1 however, a faster computer was used. The programs were written in C++ using object-oriented programming and compiled using Borland C++ Builder 6.0. The programs were run under MS Windows XP Professional on an IBM-compatible PC with an Intel Pentium D processor operating at 3 GHz and using 2 GB of operational memory. All calculations were done in 32-bit console mode. The parameter estimations were performed using dimensionless voltammograms. Gaussian noise having a standard deviation of 0.02(max |ISIM| - min |ISIM|) was added to the simulated voltammograms to obtain synthetic experimental transients that mimic the typical level of random errors in CV experiments. In this formula, ISIM denotes the simulated current function I(t), and
A + e- h B BfC C + e- h D
(k01, R1, E01) (k) (k02, R2, E02)
(4) (5) (6)
The reactions are assumed to take place at a planar macroelectrode under semi-infinite, pure diffusive transport conditions. Mechanism 4-6 involves species A, B, C, and D, for which we assume identical diffusion coefficients. Detailed model equations have been provided in dimensionless form.1 The model depends
E(t) ) - S(t, ts)
(7)
where
{
t for t < ts S(t, ts) ) 2 t - t for t g ts s
}
(8)
with ts denoting the time at which the direction of the potential scan is switched.
(35) Parsons, R. Pure Appl. Chem. 1974, 37, 499. (36) Caceci, M. S.; Cacheris, W. P. Byte 1984, 9, 340.
Analytical Chemistry, Vol. 78, No. 24, December 15, 2006
8433
algorithm has intermediate efficiency compared to faster gradientbased methods15 and slower genetic algorithms.18
Figure 1. Synthetic data curve for mechanism 4-6, used for computer experimentation with bootstrap methods (dots), together with the corresponding best-fit voltammogram (solid line). Every tenth data point is shown. The true and best-fit parameter values are provided in Table 1.
the maximum and minimum operators are taken over the entire time interval of t ∈ [0, 64], assumed in all the calculations. There is experimental evidence29 confirming the validity of the Gaussian noise assumption as the sole source of data errors in the example considered, in a typical experimental setup. This gives us confidence that the synthetic voltammograms imitate reliably the real data, so that our tests are realistic. The electrode potential switching time ts ) 32 was assumed in eqs 7 and 8. Each synthetic curve consisted of M ) 2001 data pairs, uniformly spread over the entire time interval. As the previously described HDMR map provided a smaller number of M ) 201 current function values,1 temporal interpolation between the map-provided values was necessary for the calculation of the sum of squares. Simple linear interpolation was found satisfactory. Cubic Lagrangian interpolation was also tried, resulting in longer voltammogram computation times, but the final sums of squares obtained did not differ visibly from the case of linear interpolation. The least-squares minimization was accomplished by the simplex algorithm described by Caceci and Cacheris.36 This
RESULTS AND DISCUSSION In a series of preliminary computational experiments, the HDMR map created in our previous work1 was used to determine histograms of the single-parameter and two-parameter distributions by both variants 1 and 2 of the bootstrap technique, using B ) 5000 replicates in each case. This number of replicates was found necessary in order to obtain stable values of the distribution mean, variance, and confidence intervals. In addition, analogous histograms of the true distribution of the parameter estimates were determined, for a population of 5000 different synthetic voltammograms corresponding to the same choice of true parameters. The experiments were concentrated on parameter values that satisfy the following conditions: (a) two weakly overlapping pairs of (forward and reverse) CV peaks are observed; (b) all parameters have a roughly comparable effect on the voltammograms; (c) the probability of the fit resulting in parameter estimates from outside the range covered by the HDMR map available1 is low; (d) the statistical distributions of the parameters exhibit interesting properties, e.g., correlations. As the determination of a single histogram required a mapbased computation of several millions of voltammograms, it would be computationally too expensive to verify the correctness of the histograms by means of analogous calculations employing conventional finite-difference simulations in the place of map-based computations, in the way previously adopted.1 Therefore, more accurate reference histograms have been obtained by using additionally created, highly accurate maps based on single HDMR expansions applicable locally in the immediate neighborhoods of the most probable parameter estimates. Creation of a new, globally applicable, and highly accurate map based on multiple HDMR expansions was not undertaken, because automatic optimal algorithms for map creation still have to be developed.1 The histograms obtained using the map from ref 1, and the local
Table 2. Correlation Coefficients G for Parameter Pairs, Obtained from the Multiparameter Distributionsa pair log
k0
1
- R1
log k01 - E01 log k01 - log k log k01 - log k02 log k01 - R2 log k01 - E02 R1 - E01
distribution
F
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
-0.804 -0.827 -0.813 -0.947 -0.946 -0.951 -0.078 -0.086 -0.093 0.075 0.115 0.079 -0.125 -0.126 -0.123 -0.091 -0.040 -0.069 0.721 0.744 0.736
pair R1 - log k R1 - log k02 R 1 - R2 R1 - E02 E01 - log k E01 - log k02 E01 - R2
distribution 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
F 0.139 0.123 0.142 -0.023 -0.061 -0.038 0.083 0.078 0.078 0.073 0.043 0.062 0.102 0.109 0.114 -0.071 -0.110 -0.072 0.112 0.120 0.116
pair E0
1
-
E0
2
log k - log k02 log k - R2 log k - E02 log k02 - R2 log k02 - E02 R2 - E02
distribution
F
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
0.122 0.084 0.103 -0.287 -0.299 -0.299 -0.083 -0.041 -0.080 -0.010 -0.031 0.019 0.104 0.066 0.060 0.195 0.193 0.239 -0.242 -0.280 -0.288
a The notation and true parameter values are as in Table 1. The most correlated and least correlated pairs identified by bootstrap variant 2 are in boldface type.
8434
Analytical Chemistry, Vol. 78, No. 24, December 15, 2006
Figure 2. Histograms of single-parameter distributions for parameters log k01 (a), R1 (b), E01 (c), log k (d), log k02 (e), R2 (f), and E02 (g). The true parameters are as in Table 1. Notation: (×) true distribution, (O) bootstrap variant 1, (b) bootstrap variant 2. Dashed curves denote Gaussian distributions having the same mean and standard deviation as was obtained from the true distributions.
reference maps created here, were found to be generally quite similar, but in some instances discrepancies were observed, due to the limited accuracy of the map from ref 1. The discrepancies took the form of small systematic deviations of the distribution mean values from the correct values, or artifacts of the histograms, caused either by the map discontinuities at the boundaries of the parameter subdomains or by the nonsmooth dependencies on the parameters within subdomains (piecewise linear and bilinear interpolations were used1). For these reasons, the results obtained
by using the locally applicable accurate reference maps are more conclusive for the purpose of the present study, and only they will be shown. The results presented below in Tables 1 and 2 and Figures 1-3 correspond to one interesting parameter combination allowing us to demonstrate the capabilities of the bootstrap method. The local map used to obtain these results was created for the parameter domain: log k01 ∈ [-0.4, -0.2], R1 ∈ [0.42, 0.5], E01 ∈ [-7.2, -6.8], log k ∈ [-0.92, -0.88], log k02 ∈ [-0.43, -0.37], R2 Analytical Chemistry, Vol. 78, No. 24, December 15, 2006
8435
Figure 3. Histogram and its contour plot of the two-parameter distribution for the most correlated (a) and least correlated (b) parameter pair from Table 2, according to bootstrap variant 2. Contour levels are multiples of 100 (a) and 500 (b).
∈ [0.5, 0.56], and E02 ∈ [-15.08, -14.94]. The number of interpolation nodes was N ) 11 in each parameter interval, and the HDMR reference cut point was placed in the middle of the domain. The largest error of the map was 0.000628 (max |ISIM| min |ISIM|) and 0.000578 (max |ISIM| - min |ISIM|), as determined by tests 1 and 2, respectively.1 Thus, the local map was ∼2 orders of magnitude more accurate than the map developed in ref 1, for which the corresponding errors were 0.0298 (max |ISIM| - min |ISIM|) and 0.0148 (max |ISIM| - min |ISIM|). The computational time needed for obtaining each of the histograms summarized in Tables 1 and 2 was about 1.2-1.6 h. The calculations in each case involved a HDMR map-based computation of ∼2.2 × 106 voltammograms, when the simplex iterations were started from the true parameter values. Analogous determinations using the map from ref 1 and different starting values were more expensive and took ∼2.6-3.1 h, involving a computation of over 5 × 106 voltammograms. Although these computational costs are large, they are at an acceptable level enabled by the use of HDMR maps. In contrast, in accordance with our previous evaluation of the voltammogram computation times,1 analogous calculations performed by means of averagely efficient contemporary digital simulation methods6 would be ∼180 times more time-consuming, requiring ∼9-23 days just for obtaining one distribution, which would clearly be outside of reasonable limits. As can be seen from Table 1 and Figure 2, the two bootstrap variants provide almost identical single-parameter distributions, although the distributions for variant 2 appear a bit smoother and have a higher fidelity. In agreement with the bootstrap principle previously mentioned, the bootstrap distributions generally do not coincide with the true distributions: they are shifted along the parameter axes and have different mean values, which agree well 8436
Analytical Chemistry, Vol. 78, No. 24, December 15, 2006
with the best-fit parameter values. However, all three distributions have similar shapes and comparable variances. Therefore, the bootstrap distributions provide a satisfactory quantification of the statistical errors of the best-fit parameter values. For the results of Table 1, the true parameter values always lie within the ranges predicted by the bootstrap confidence intervals for β ) 0.99. Figure 2 demonstrates also that the single-parameter distributions are very close to normal. We stress that these normal parameter error distributions are inferred by bootstrap, not assumed by the procedure. This confirms the reliability of bootstrap in producing the expected parameter distributions in the situation considered, when the data and parameter errors are relatively small, and when normal data distribution is assumed. In order to obtain non-normal parameter error distributions at the present assumption of normal data errors, the parameter errors would have to be so large that the local linearity of the best-fit response as a function of parameters would no longer be a good approximation. We have not observed such large errors and departures from normal distributions in the present tests, but in general they are conceivable (see, e.g., the asymmetric contour plots of the sum of squares reported by Rees et al.30). Table 2 reveals that the correlation coefficients for parameter pairs, obtained from the population of bootstrap replicates, are qualitatively consistent with the estimates of the true correlation coefficients, although the number B of replicates would probably have to be still larger to obtain better quantitative agreement. We can see, for example, that parameters log k01, R1, and E01 of the first (faster) charge-transfer step (4) are consistently identified as rather strongly correlated, whereas parameters log k02, R2, and E02 of the second (slower) charge transfer (6) are less correlated. The presence or absence of the correlations is also seen upon inspection of the plots of two-parameter histograms. Representative histograms of this kind are depicted in Figure 3. Knowledge of parameter correlations is just as important as the knowledge of individual parameter errors,10 because in the case of significant correlations, the physical roles of the parameters are not independent. The particular correlations between log k01, R1, and E01, observed in Figure 3a and Table 2, appear to result from an asymmetry in the amount of information about the reduction and oxidation currents of reaction 4, contained in the experimental voltammogram of Figure 1. Assuming, in agreement with the shape of the voltammogram, that relatively little information about the oxidation current is available, the experimental data are dominated by the information about the values of the expression k01 exp{-R1[E(t) - E01]}, resulting from the Butler-Volmer equation for the cathodic current (cf. Table 1 in Bieniasz and Rabitz1). This implies an interplay between parameters k01 and E01, such that the minimum of the sum of squares occurs for a continuum of the parameter values satisfying k01 exp(R1 E01) ) const. Thus, an increase of E01 is correlated with the decrease of log k01, as is seen in Figure 3a. Correlations involving R1 are harder to explain, because the values of E(t), most effective in providing the information about the current of reaction 4, additionally play a role. CONCLUSIONS The results obtained in this study indicate that the bootstrap resampling technique presents a reliable approach to determining statistical distributions of the errors associated with parameters
estimated by nonlinear regression of CV data. The method does not assume normality of either the data errors or the parameter estimates. Therefore, bootstrap is more universally applicable than the conventional approach to parameter error estimation, which relies on local approximations in the parameter space and the assumption of normal distributions. The employment of fast HDMR solution maps representing theoretical cyclic voltammograms substantially reduces the considerable computational cost of the bootstrap technique, bringing it down to a level that should be acceptable for off-line CV data analysis in most laboratories. However, it is important that the maps be sufficiently accurate, which calls for developing efficient and automatic algorithms of generating such maps.
ACKNOWLEDGMENT The major part of this work has been done during L.K.B.’s visit at Princeton University, in May-August 2005. The computer code used in this and the previous1 studies is available from L.K.B. on request. The authors acknowledge support from the U.S. National Science Foundation, the Army Research Office, and the Environmental Protection Agency. L.K.B. acknowledges useful correspondence with Prof. A. C. Davison, Prof. H. Dette, Prof. B. Efron, and Prof. R. A. Sohn.
Received for review June 28, 2006. Accepted October 2, 2006. AC061167Z
Analytical Chemistry, Vol. 78, No. 24, December 15, 2006
8437