Ind. Eng. Chem. Res. 2004, 43, 1395-1404
1395
Robustness of Nonlinear Regression Methods under Uncertainty: Applications in Chemical Kinetics Models German Vidaurre, Victor R. Vasquez,* and Wallace B. Whiting Chemical Engineering Department, University of Nevada-Reno, Reno, Nevada 89557
In this work, we study the robustness of nonlinear regression methods under uncertainty for parameter estimation for chemical kinetics models. We used Monte Carlo simulation to study the influence of two main types of uncertainty, namely, random errors and incomplete experimental data sets. The regression methods analyzed were least-squares minimization (LSM), maximum likelihood (ML), and a method that automatically reweights the objective function during the course of the optimization called IVEM (inside-variance estimation method). Although this work represents a preliminary attempt toward understanding the effects of uncertainty in nonlinear regression, the results from the analysis of the case studies indicate that the performance of the regression procedures can be highly sensitive to uncertainty due to random errors and incomplete data sets. The results also suggest that traditional methods of assigning weights a priori to regression functions can affect the performance of the regression unless these weights correspond to a careful characterization of the residual statistics of the regression problem. In the case in which there is no prior knowledge of these weights (particularly in maximum likelihood regression), we suggest that they be characterized, in a preliminary way, by performing a least-squares minimization regression first or by using a method that automatically estimates these weights during the course of the regression (IVEM). We believe that the performance of regression techniques under uncertainty requires attention before a regression method is chosen or the parameters obtained are deemed valid. 1. Introduction Many different regression strategies can be used to fit experimental data by optimizing the parameters of linear and nonlinear models. Of particular interest are approaches that are sufficiently accurate to represent physical or chemical phenomena as well as efficient from a computational standpoint. For chemical engineering applications (particularly in the areas of thermodynamics and reaction engineering), parameter regression or estimation is an important issue. Although generalized linear models play an important role in many applications, the focus of the present work is on nonlinear models, which usually present more challenges and yet are very common in thermodynamics and reaction kinetics. The most common nonlinear regression approaches used for the past 20 years are the least-squares minimization (LSM) and maximum likelihood (ML) methods. Even though these two approaches have been very successful in solving important problems, they present limitations, in particular, in the choice of appropriate weighting factors and in the definition of the objective function so that the values of the regression parameters are properly optimized. Additionally, little is known about the performance of these methods under uncertainty. Traditionally, the weighting factors are chosen using dispersion characterization parameters of experimental measurements such as the standard deviation. Often, the general definiton of the objective function is specific * To whom correspondence should be addressed. E-mail:
[email protected]. Tel.: (775) 784-6060. Fax: (775) 784-4764.
to the nature of the problem being analyzed. For instance, Sørensen et al.1 and Whiting et al.2 present a summary of objective function definitions for the parameter regression of models used in the prediction of liquid-liquid equilibria that reflects the difficulties in choosing appropriate regression objective functions for complex regression problems. In this work, we study the impact that the definition of the regression objective function has on parameter estimation under uncertainty in the experimental data. We compare a method that automatically reweights the objective function during the course of the optimization3 against more traditional methods such as the LSM and ML regression methods. Two case studies are presented. The first consists of finding the parameters of a kinetic model used to describe the process of converting methanol into different hydrocarbons,4-7 and the second is a kinetic model for the isomerization of R-pinene into dipentene and allo-ocimen.8-10 The remainder of the paper is organized as follows: Section 2 summarizes some of the basic ideas behind parameter estimation. Section 3 presents a basic description of the regression method introduced by Vasquez and Whiting3 and is followed by the case studies in sections 4 and 5. Section 6 provides concluding remarks and describes future work. 2. Parameter Estimation The estimation of parameters in kinetic models from time series data and other data types is essential for the design, optimization, and control of many chemical systems. The models that describe the reaction kinetics
10.1021/ie0304762 CCC: $27.50 © 2004 American Chemical Society Published on Web 02/20/2004
1396 Ind. Eng. Chem. Res., Vol. 43, No. 6, 2004
usually take the form of a set of differential algebraic equations. The statistics and model formulation for such problems have been widely studied, but the regression approach is still a matter of discussion in engineering applications. The accurate prediction of specific kinetic rates is a crucial step in reactor design and other areas such as the prediction of thermodynamic properties and the design of separation processes. For modeling purposes, the following steps are typical: (a) experimental data measurement, (b) model development, and (c) model parameter estimation. The two main sources of uncertainty for the parameter values are (a) the method used for obtaining the parameters and (b) the experimental data. The other type that can be important is modeling error, which quantifies the inability of the model and parametrization chosen to describe perfectly the physical system under study. In this work, it is assumed that there is no modeling error. The uncertainty associated with experimental data is usually described in terms of random and systematic errors. Often, the quantification of random errors is reduced to the apparatus precision. The systematic uncertainties are conceptually well-defined, but they are traditionally ignored in regression.3 Such uncertainties are mainly caused by calibration errors and changing environmental conditions, among others. In addition to the typical problems encountered with regression objective functions (e.g., multiple local minima), even if the optimization method guarantees that the global minimum is found, the parameters obtained are still not optimal if the objective function is not appropriate. To achieve an accurate prediction, all of the effects of the aforementioned errors have to be minimized or taken into account. The most common approach used to define regression objective functions is to minimize the sum of the squared differences between the experimental and predicted values (LSM) or to maximize the probability that the residual statistics of the model predictions correspond to the probability distribution of the uncertainty present in the experimental data (ML). Starting with a data set of n observations for m dependent variables denoted as (xij, yij), where i ) 1, 2, ..., n, and j ) 1, 2, ..., m, and where the array x denotes a vector of independent variables, the first approach (LSM) consists of finding the values of the model parameters θ such that the following function is minimized
∑j ∑i [yij - fj(xij,θ)]
2
(1)
where ij represents the error of the model prediction
yij ) fj(xij,θ) + ij
Table 1. Regressed Parameter Sets for the Kinetic Model (Eq 7) Describing the Methanol to Hydrocarbon Conversion Process
(2)
Assuming that the joint probability distribution of ij is known, then the maximum likelihood estimate of θ is obtained by maximizing the likelihood function that best reproduces the joint probability distribution of ij. Suppose that the ij are independent and identically distributed for a fixed dependent variable j, with a density function σj-1gj(ij/σj), so that gj is the standardized error
θˆ i
true θ/i
IVEMa
MLa
LSMa
IVEMb
MLb
LSMb
2.69 0.50 3.02 0.50 0.50
3.145 0.373 3.457 0.232 1.628
3.211 0.848 3.052 0.054 1.980
3.182 0.712 3.166 0.006 1.944
3.174 0.041 3.783 0.698 3.990
3.189 0.269 3.596 0.594 4.819
3.179 0.440 3.500 0.492 4.634
a Entire data set used in the regressions. b Only the first half of the data set used in the regressions.
probability distribution of unit variance. Then, the likelihood function is defined as m
p(yij|θ,σ2j) )
n
[ (
∏j ∏i σj-1gj
)]
yij - fj(xij,θ) σj
(3)
If the ij’s are independent and identically normally distributed for a fixed dependent variable j, then the function gj is given by
gj(xij,yij) )
{
1 1 [yij - fj(xij,θ)] exp 2 σ2j σjx2π
}
2
(4)
and the logarithm of eq 3 becomes
-log p(yij|θ,σ2j) ) mn
m n
S(θ) )
Figure 1. Typical synthetic data set generated by eq 10 and used to regress the parameters of the model given by eq 7. The curves correspond to the predictions of the parameters given by Maria.4 Plotted are the profiles of the species A (4), C (O), and P (/).
2
log 2π +
n 2
m
log
∏j
σ2j +
1m 2
n
∑j ∑i
[
]
yij - fj(xij,θ) σj
2
(5)
In this likelihood function, σj2 is the variance, or error term, which cannot be estimated from the experimental data, given that it is a function of the predicted residuals. Herein lies the difference between traditional approaches and the method proposed by Vasquez and Whiting.3 For nonlinear regression, traditional approaches assume a priori estimation of the variance σj2 based on experimental precision. However, such an approach is internally inconsistent because this estimate is not equal to the final values obtained from the residuals at the end of the regression procedure.
Ind. Eng. Chem. Res., Vol. 43, No. 6, 2004 1397
Figure 2. Performance comparison of the mean errors and standard deviations in the predictions of the concentrations of the species A, C, and P of the kinetic model given by eqs 7-9, respectively. The comparison is for the regression methods LSM ()), ML (0), and IVEM (4) using complete data sets (no simulation of missing data; see methanol to hydrocarbons case study for details).
Figure 3. Performance comparison of the mean errors and standard deviations in the predictions of the concentrations of the species A, C, and P of the kinetic model given by eqs 7-9, respectively. The comparison is for the regression methods LSM ()), ML (0), and IVEM (4) using incomplete data sets (simulation of missing data; see methanol to hydrocarbons case study for details).
The method of Vasquez and Whiting3 called IVEM (inside-variance estimation method) consists of using eq 5, but instead of using a priori values of the variances σj, obtaining these values from each iteration of the optimization procedure using the residual distribution of the estimates. Thus, at the end of the optimization process, the statistical properties of the residual distribution of the estimates are consistent with the variance used in the maximum likelihood function, guaranteeing the most likely values of the parameters, assuming that there is no modeling error and, therefore, that the residuals are Gaussian. For the case in which there is significant modeling error, the form of the function gj in eq 3 is no longer Gaussian. One should, in principle, know a valid form of gj(ij/σj) to define the likelihood function. However, eq 5 coupled with the IVEM approach still presents a significant advantage when compared to traditional regression approaches in the sense that the objective function is always reweighted according to the estimates of the residuals.
essential for rapid convergence of the algorithm. In general, the IVEM algorithm exhibits slower convergence than more traditional regression methods. The algorithm is summarized as follows: 1. Initialize variables. 2. Input experimental data. 3. Input initial guess θ. 4. Compute estimates of the dependent variables using the current value for θ. 5. Compute the sum of the square of the error, where the error is defined as the difference between the predicted and experimental value for each dependent variable. 6. Compute the variance of the errors for each dependent variable. 7. Evaluate the objective function as defined by eq 5. 8. If the objective function has reached a minimum, store the current value of θ. If not, change θ according to the optimization procedure being used and go to step 4. 9. Repeat steps 4-8 until the minimum of the objective function is reached.
3. IVEM Algorithm
4. Illustrative Case Study: Conversion of Methanol to Hydrocarbons
Here, we describe the basic algorithm of the IVEM regression methodology (for more details, see Vasquez and Whiting3). The algorithm starts with an appropriate initial guess for the parameters. A good estimate is
For the conversion of methanol to hydrocarbons, many kinetic models have been proposed. The model analyzed here (see Maria,4 Chang,6 and Anthony5
1398 Ind. Eng. Chem. Res., Vol. 43, No. 6, 2004
Figure 4. Reaction diagram for the thermal isomerization of R-pinene at 189.5 °C.9 Table 2. Experimental Data of Percentage Concentrations versus Time for the Isomerization of r-Pinene at 189.5 °Ca time (min) 1230 3060 4920 7800 10680 15030 22620 36420 a
y1 y2 y3 y4 y5 (R-pinene) (dipentene) (allo-ocimen) (pyronene) (dimer) 88.35 76.40 65.10 50.40 37.50 25.90 14.00 4.50
7.3 15.6 23.1 32.9 42.7 49.1 57.4 63.1
2.3 4.5 5.3 6.0 6.0 5.9 5.1 3.0
0.4 0.7 1.1 1.5 1.9 2.2 2.6 2.9
1.75 2.80 5.80 9.30 12.00 17.00 21.00 25.70
al.9
Data taken from Box et
for details) is as follows k1
A 98 B
(6a)
k2
A + B 98 C,
(6b)
k3
C + B 98 P
(6c)
k4
A 98 C
(6d)
k5
A 98 P
(6e)
k6
A + B 98 P
(6f)
where A denotes oxygenates (without water); B denotes carbenes (C ¨ H2); C denotes olefins; and P denotes paraffins, aromatics, and other products. The constants k1-k6 are the specific rates of reaction and the parameters to be regressed. Maria4 assumed only first-order reactions and a constant carbene concentration for isothermal reaction conditions. Considering the quasistationary hypothesis5 and eliminating B, it is possible to obtain the model
[
]
k1C dA ) - 2k1 + k4 + k5 A (7) dt (k23 + k63)A + C k1A(k23A - C) dC ) + k4A dt (k23 + k63)A + C
(8)
k1A(k63A + C) dP ) + k5A dt (k23 + k63)A + C
(9)
where A and C denote the concentrations of components A and C, respectively; k23 ) k2/k3; and k63 ) k6/k3. Note that, in this arrangement, there are only five regression parameters instead of six. Regressions for obtaining the model parameters (k1, k4, k6, k23, k63) were performed using three different
Figure 5. Experimental and predicted data for the chemical kinetics of the R-pinene isomerization case study. The predicted curves were obtained using the parameters reported by Box et al.9 (See section 3 for more details.) Table 3. Random Error Levels Used in Synthetic Data Generation for the r-Pinene Systema run
cv1
cv2
cv3
cv4
cv5
range (min)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0.00 0.10 0.00 0.00 0.00 0.00 0.10 0.05 0.05 0.05 0.05 0.10 0.20 0.20 0.20 0.20 0.20 0.10 0.10 0.10 0.10
0.00 0.00 0.10 0.00 0.00 0.00 0.05 0.10 0.05 0.05 0.05 0.10 0.10 0.20 0.20 0.20 0.20 0.10 0.10 0.10 0.10
0.00 0.00 0.00 0.10 0.00 0.00 0.05 0.05 0.10 0.05 0.05 0.10 0.10 0.10 0.20 0.20 0.20 0.10 0.10 0.10 0.10
0.00 0.00 0.00 0.00 0.10 0.00 0.05 0.05 0.05 0.10 0.05 0.10 0.10 0.10 0.10 0.20 0.20 0.10 0.10 0.10 0.10
0.00 0.00 0.00 0.00 0.00 0.10 0.05 0.05 0.05 0.05 0.10 0.10 0.10 0.10 0.10 0.10 0.20 0.10 0.10 0.10 0.10
0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-15 000 0-10 000 0-20 000 0-30 000 0-40 000
a All runs were performed using 15 data points in the indicated range.
approaches: least-squares minimization (LSM), maximum likelihood (ML), and the inside-variance estimation method (IVEM). The main objective of this work was to study the performance of these regression methods under uncertainty. Therefore, the methodology in this case study consisted of taking the set of parameters that gives the best fit of the experimental data as the true parameters (notice that this is an artificial construct to facilitate the stochastic analysis of this problem). Then, using these “true” parameters, one can generate predictions of the measured quantities (yij) and add Gaussian noise to simulate the effect of uncertainty in the experimental data. Many synthetic data sets can be generated in this way with different levels of uncertainty that can be used to regress the parameters of the model (θˆ ) and compare the results against the true set θ*. The true parameter set used in this example is reported by Maria,4 θ* ) (2.69, 0.5, 3.02, 0.5, 0.5), with all of the values in appropriate dimensions according to the rate equations. The Gaussian error is added as follows
yij ) ytrue + r × cvj ij
(10)
Ind. Eng. Chem. Res., Vol. 43, No. 6, 2004 1399 Table 4. AARD Results for the Three Regression Methods Investigated, Namely, LSM, ML, and IVEMa LSM
ML
IVEM
run AARD1 AARD2 AARD3 AARD4 AARD5 AARD1 AARD2 AARD3 AARD4 AARD5 AARD1 AARD2 AARD3 AARD4 AARD5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 a
0.000 0.014 0.021 0.000 0.000 0.000 0.018 0.027 0.014 0.012 0.012 0.026 0.037 0.048 0.052 0.052 0.050 0.027 0.026 0.027 0.029
0.000 0.054 0.037 0.000 0.001 0.003 0.550 0.052 0.036 0.035 0.035 0.069 0.109 0.118 0.144 0.129 0.151 0.086 0.066 0.057 0.057
0.000 0.205 0.182 0.000 0.007 0.103 0.218 0.240 0.170 0.169 0.195 0.305 0.435 0.479 0.620 0.497 0.630 0.428 0.248 0.165 0.129
0.000 0.016 0.014 0.000 0.001 0.107 0.034 0.035 0.032 0.035 0.067 0.068 0.419 0.098 0.118 0.078 0.129 0.074 0.061 0.062 0.074
0.000 0.033 0.027 0.000 0.005 0.302 0.129 0.140 0.120 0.150 0.284 0.276 0.371 0.397 0.319 0.311 0.506 0.525 0.195 0.119 0.100
0.000 0.015 0.020 0.000 0.000 0.001 0.019 0.026 0.013 0.012 0.012 0.026 0.038 0.046 0.051 0.051 0.051 0.027 0.025 0.027 0.029
0.000 0.043 0.028 0.000 0.001 0.004 0.044 0.040 0.029 0.027 0.028 0.055 0.088 0.97 0.113 0.101 0.120 0.066 0.053 0.049 0.049
0.000 0.130 0.116 0.000 0.012 0.063 0.138 0.153 0.109 0.107 0.122 0.195 0.274 0.331 0.364 0.330 0.416 0.268 0.166 0.122 0.102
0.000 0.018 0.011 0.000 0.001 0.033 0.027 0.022 0.026 0.023 0.034 0.040 0.052 0.056 0.068 0.061 0.093 0.060 0.043 0.044 0.044
0.000 0.087 0.037 0.000 0.010 0.148 0.118 0.089 0.100 0.097 0.138 0.159 0.219 0.234 0.259 0.230 0.351 0.340 0.129 0.094 0.075
0.169 0.000 0.000 0.000 0.169 0.169 0.016 0.021 0.015 0.013 0.014 0.030 0.034 0.051 0.056 0.059 0.058 0.027 0.027 0.030 0.027
0.338 0.000 0.000 0.000 0.338 0.338 0.010 0.012 0.010 0.010 0.019 0.018 0.024 0.024 0.024 0.025 0.039 0.016 0.026 0.033 0.046
0.027 0.024 0.000 0.000 0.042 0.296 0.014 0.015 0.015 0.025 0.017 0.029 0.028 0.031 0.031 0.057 0.054 0.032 0.029 0.025 0.026
0.720 0.075 0.001 0.001 0.622 0.712 0.018 0.017 0.028 0.015 0.030 0.036 0.036 0.035 0.047 0.048 0.081 0.053 0.035 0.032 0.041
0.773 0.211 0.000 0.000 0.466 0.968 0.053 0.048 0.110 0.050 0.063 0.101 0.104 0.115 0.182 0.190 0.252 0.314 0.064 0.042 0.041
Descriptions of the error levels for the runs are given in Table 3.
Figure 6. LSM predictions (100 shown) for the reactant and product concentrations of the species involved in the isomerization of the R-pinene. The predictions are for the whole time range reported by Box et al.,9 but the regressions were performed using the conditions reported for run 11 in Table 4.
where r is a Gaussian random deviate with mean 0 and variance 1, cvj is the assumed maximum level of error represents the values for the dependent variable j, ytrue ij predicted by the true parameters θ*, and yij corresponds
to the simulated experimental data. A maximum cv value of 0.1 was used for this case study. A typical synthetic data set is presented in Figure 1 together with the true data (continuous lines).
1400 Ind. Eng. Chem. Res., Vol. 43, No. 6, 2004
Figure 7. ML predictions (100 shown) for the reactant and product concentrations of the species involved in the isomerization of the R-pinene. The predictions are for the whole time range reported by Box et al.,9 but the regressions were performed using the conditions reported for run 11 in Table 3.
Two cases were considered: (a) use of 100 data sets of 20 points each generated with eq 10 and (b) use of the first half of the synthetic data of case a. The reason for using the first half in case b was to study the effect of having incomplete experimental data. This a common situation that merits attention regarding the capability of the regression method to obtain reasonable parameters under this type of uncertain conditions. Although having only the first half of the data set might not be common, it is still an interesting extreme scenario to test regression methods. The three objective function definitions described in section 2, namely, LSM, ML, and IVEM, were used to regress the parameters. For the maximum likelihood method, the objective function was the sum of the squares of the differences between the measured data and the model prediction for all variables, weighted by the estimated standard deviation of the difference between the experimental data and the data predicted by the least-squares minimization method. The corresponding weights in this case are σA ) 0.001 21, σC ) 0.001 67, and σP ) 0.0119. Notice that, by weighting the objective function in this way, a better approximation of the residual statistics is already obtained than would be the case using measurement equipment precision statistics. The optimization procedure used was a successive unidirectional minimization method using the modified
direction set of Powell,11 with the parameters reported by Maria4 as initial guesses. The parameter sets regressed by LSM, ML, and IVEM are presented in Table 1 for both cases of using the complete synthetic data set in the regressions and using only the first half of the data set. Figure 2 summarizes the general statistical results of this regression for the case of using the complete data set. The mean error in Figure 2 is defined as the mean of the difference in the predicted concentrations for species A, C, and P using the regressed parameters and the predictions given by the true parameters θ*, in other words, diff A ) A(θˆ ) A(θ*), diff C ) C(θˆ ) - C(θ*), and diff P ) P(θˆ ) - P(θ*), respectively (the predicted values of the compositions are obtained using eqs 7-9). From this figure, we observe that the three regression methods perform reasonably well and give similar results for the parameters. In this case, the three methods perform well under the effect of random Gaussian uncertainty only. On the other hand, Figure 3 presents the equivalent of Figure 2 but for the case of regressing the parameters using the first half of the data set only. It is clear that the LSM method does not perform as well as the other two (i.e., ML and IVEM), which suggests that ML and IVEM tend to be more robust regression methods under this type of uncertain condition (this is also observed in the next case study). However, the fact that both ML and IVEM produce similar results is not surprising if we
Ind. Eng. Chem. Res., Vol. 43, No. 6, 2004 1401
Figure 8. IVEM predictions (100 shown) for the reactant and product concentrations of the species involved in the isomerization of the R-pinene. The predictions are for the whole time range reported by Box et al.,9 but the regressions were performed using the conditions reported for run 11 in Table 3.
recall that the weights used for the ML regression function were obtained from the residual statistics of the LSM predictions with respect to the predictions of the true parameters. This also indicates the importance of having good residual statistics if one wants to use ML methods with a priori weights. Traditional practice involves the use of weights obtained from the precision of the measurements, in this case, for the concentrations of the species A, C, and P. Now, it is not uncommon to find that the precision statistics can be about the same for concentration measurements (depending on the experimental methods used), which will cause the weights (standard deviations of the measurement error) to be equal, thus transforming the ML regression function into an LSM regression case. If the precision statistics are different for the concentrations of the different species, then the ML regression function will have different weights (σj), but there is no guarantee that these σj’s will correspond to the final statistics of the residuals. There are some important differences between Figures 2 and 3 that are worth mentioning. For the ML method, the error present is about 15.8 times larger than that present when using the entire synthetic data set in the regression. For the inside-variance estimation method, this error is almost the same whether using the entire set or just the first half of the synthetic data set. Reducing the range of the synthetic data set even
more produces a substantial increase in the error statistics of the three regression methods. However, as the range used increases, the effectiveness of the IVEM method, indicated by low values of the variances, improves faster than the effectiveness of the other two methods. In general, it is observed that the error in the predicted data using the paremeter sets obtained from the LSM tend to be the largest. 5. Illustrative Case Study: r-Pinene Isomerization Box et al.9 present an interesting example of a complex chemical reaction for the thermal isomerization of R-pinene at 189.5 °C. The chemical reaction is the isomerization of R-pinene (y1) to dipentene (y2) and alloocimen (y3). Then, the latter reacts to yield R- and β-pyrone (y4) and a dimer (y5). The unknown parameters, θr (r ) 1-5) are the rate constants of the reactions. The reaction scheme is presented in Figure 4. This process was studied experimentally by Fuguit and Hawkins,12 who reported the concentrations of the reactant and the five products (see Table 2). If the chemical reaction orders are known, then mathematical models can be derived to model the concentrations of the five chemical species as a function of time. Hunter and McGregor9 assumed first-order kinetics and derived the following equations for the five products
1402 Ind. Eng. Chem. Res., Vol. 43, No. 6, 2004
Figure 9. AARDs for each of the parameters regressed in the chemical kinetics model of the R-pinene isomerization case study. The results are presented for different uncertainty levels. Case a corresponds to the conditions of run 8, case b to run 11, case c to run 13, and case d to run 17. (See Table 3 for the description of the uncertainty levels.)
E(yi1) ) y01e-φti
(11)
θ1y01 E(yi2) ) (1 - e-φti) φ
(12)
E(yi3) ) C1e-φti + C2eβti + C3eγti
(13)
[
E(yi4) ) θ3
E(yi5) ) θ4
]
C2 C3 C1 (1 - e-φti) + (eβti - 1) + (eγti - 1) φ β γ (14)
(
)
C2 βti C3 γti C1 -φti e e + e + θ5 - φ θ5 + β θ5 + γ
where y01 is the value of y1 at t ) 0 and
R ) θ3 + θ4 + θ5 1 β ) (-R + xR2 - 4θ3θ5) 2 1 γ ) (-R - xR2 - 4θ3θ5) 2 φ ) θ1 + θ2 θ5 - φ C1 ) θ2y01 (φ + β)(φ + γ)
(15)
θ5 + β C2 ) θ2y01 (φ + β)(β - γ) and
θ5 + γ C3 ) θ2y01 (φ + γ)(γ - β) The fitting results obtained by Box et al.9 are presented in Figure 5, together with the experimental data. In this work, the parameter set reported by Box et al.9 has been assumed to represent the true parameters, θ* ) (5.93 × 10-5, 2.96 × 10-5, 2.05 × 10-5, 2.75 × 10-4, 4.00 × 10-5), with all values expressed in min-1. The methodology followed is the same as in the previous case study, with minor variations. Synthetic data sets were generated by adding Gaussian random error at different levels to what we called the true predictions of the model according to the following equation
yij ) ytrue ij (1 + r × cvj)
(16)
where r is a Gaussian random number with mean 0 and variance 1 and cvj is the assumed coefficient of variation for dependent variable j. The range of cvj used was from 0 to 0.2. Notice that the error introduced by eq 16 is different from that of eq 10 in the previous case study. Equation
Ind. Eng. Chem. Res., Vol. 43, No. 6, 2004 1403
Figure 10. AARDs for each of the parameters regressed in the chemical kinetics model of the R-pinene isomerization case study. The results are presented for different data ranges used in the regression (effect of incomplete data sets). The regression data were chosen in the following time ranges: (a) 0-15 000, (b) 0-20 000, (c) 0-30 000, and (d) 0-40 000, all in minutes. The conditions used for the uncertainty levels are of those of run 12. (See Table 3 for the description of the uncertainty levels.)
16 simulates an error that varies proportionally with the magnitude of the variable being considered. The reason for this change with respect to eq 10 is to simulate situations in which experimental errors vary according to the magnitude of the measurements. Using the so-called true parameters, points were generated in the time range of 0-15 000 min (the equivalent of the solid lines in Figure 5), and then 15 of them (equally spaced) were chosen to represent ytrue in eq 16. Other time ranges studied in a similar manner were 0-20 000, 0-30 000, and 0-40 000 min. As before, the idea was to evaluate the robustness of the regression techniques using incomplete data sets and under the effect of experimental random Gaussian noise. To compare the aforementioned methods numerically, a goodness criterion was defined as follows
AARDi )
1
100
∑k
100
| | θˆ ik - θ/i θ/i
(17)
where θˆ i is an estimator of the ith true parameter θ/i for the kth simulation run. Note that AARDi is not any of the objective functions used in this work. It is, however, a traditional average absolute relative deviaton (AARD) measuring the goodness of fit. AARD values were calculated to characterize the simulations runs at
the different error levels (cvj) and for the different data sets (ytrue ij ) used. The data subsets used in this work were chosen arbitrarily (simulation of incomplete experimental data); therefore, many other combinations would be interesting as well. Depending on the degree of incompleteness of the data set, there will be cases in which all methods fail to obtain reasonable values of θˆ (when compared against θ*), as well as cases in which the best parameters are obtained depending on the ability of the regression approach to handle the simulated error imposed. Different levels of error as well as different combinations of these levels were used to analyze the performance of the regression approaches used. Table 3 describes the different error levels used in eq 16 for the uncertain variables yj. The AARD results are presented in Table 4. The interpretation of these results is better described by analyzing a specific run. Results corresponding to run 11 (see conditions for this run in Table 3) are shown in Figures 6-8. We can see the variation of the differents fits obtained by the regression methods under study. In general, the IVEM tends to be more robust than the other two, and as in the first case study, the LSM method seems to be the least robust. This type of analysis can be done for the other conditions, but in general, the same type of results are observed.
1404 Ind. Eng. Chem. Res., Vol. 43, No. 6, 2004
Another way to analyze these results is to plot the AARD values on spider diagrams for the different error levels. These diagrams summarize better the performance of the regression on the predicted variables. For instance, Figure 9 shows this type of results for four different cases of error levels. The error levels used correspond to the conditions of runs 8, 11, 13, and 17 (see Table 3) for panels a-d of Figure 9, respectively. In all cases, the IVEM approach is substantially better (from the robustness standpoint) compared to the other two. Also notice that the LSM tends to diverge substantially faster with increasing the error level. On the other hand, Figure 10 presents similar results but for the case of using different ranges for the data sets in the regression. In general, the IVEM approach is more robust under these circumstances as well, but the other two methods improve their performance when the range in the data set increases (from a to d). The improvement of the ML and LSM methods with increasing range of the data sets is not surprising, but it seems that, even for the case of using the whole range, IVEM is a more robust procedure. These results underscore the importance of having good statistics for the residuals in regression problems with uncertain experimental data. 6. Concluding Remarks The robustness characteristics of traditional regression techniques such as LSM and ML and of a more recent approach (i.e., IVEM) were studied using two case studies in the field of chemical kinetics. From the analysis of the results, it is observed that the performance of traditional regression methods can be sensitive to uncertainty in experimental data. This uncertainty might include, but is not limited to, random and systematic errors, incomplete data sets, and modeling errors; in particular, incomplete data sets seem to have significant effects on the performance of traditional regression methods. Maximum likelihood regression functions are usually weighted by the standard error deviations of the measurements based on equipment or experimental technique statistics. Depending on the circumstances (uncertain conditions), this is not always the best strategy, as shown in the examples studied. Although a priori knowledge of statistics of the residuals can be difficult to obtain, characterizing them by using a simpler regression model such as LSM and then using the results to define the weights in a more sophisticated approach such as ML proves to be significantly more robust than using arbitrary fixed weights in the regression objective function. The alternative procedure proposed by Vasquez and Whiting3 (IVEM), which automatically reweights the likelihood function during the course of the optimization, seems to be more robust under uncertain conditions. This method has the advantage of avoiding preliminary analyisis (performing preliminary regression using LSM or ML) to characterize the residuals, and it handles uncertainty better than the other two techniques studied. A drawback of IVEM is the computational convergence, which seems to be slower than that of more traditional methods. A way to improve this behavior is by providing good initial guesses for the parameters
being regressed. Although this work represents a preliminary attempt toward understanding the effect of uncertainty in nonlinear regression, we believe that the performance of regression techniques under uncertainty requires attention before a regression technique is chosen or the parameters obtained are deemed valid. Future work includes the analysis of other objective function formulations and different applications. Additionally, assurance tests should be performed to guarantee that the global optimum is found for each of the stochastic scenarios. Although such tests are not performed in this work, the three regression methods (LSM, ML, IVEM) are tested using many stochastic scenarios for two case studies. The general trends or behaviors of the results are characteristic of the regression methods. Therefore, we believe that the effect of having some scenarios with parameters corresponding to local minima will not drastically change the trends observed. Note Added after ASAP Posting This article was released ASAP on February 20, 2004. The order of the authors has been changed, and the new version was posted on February 24, 2004. Literature Cited (1) Sørensen, J. M.; Arlt, W. Liquid-Liquid Equilibrium Data Collection; Chemistry Data Series; DECHEMA: Frankfurt/Main, Germany, 1980; Vol. 5. (2) Whiting, W. B.; Vasquez, V. R.; Meerschaert, M. M. Techniques for Assessing the Effects of Uncertainties in Thermodynamic Models and Data. Fluid Phase Equilib. 1999, 158-160, 627-641. (3) Vasquez, V. R.; Whiting, W. B. Regression of Binary Interaction Parameters for Thermodynamic Models Using an Inside-Variance Estimation Method (IVEM). Fluid Phase Equilib. 2000 170, 235-253. (4) Maria, G. An Adaptive Strategy for Solving Kinetic Model Concomitant Estimation-Reduction Problems. Can. J. Chem. Eng. 1989, 67, 825-832. (5) Anthony, R. G. A Kinetic Model for Methanol Conversion to Hydrocarbons. Chem. Eng. Sci. 1981, 36, 789. (6) Chang, C. D. A Kinetic Model for Methanol Conversion to Hydrocarbons. Chem. Eng. Sci. 1980, 35, 619-622. (7) Esposito, W. R.; Floudas, C. A. Global Optimization for the Parameter Estimation of Differential-Algebraic Systems. Ind. Eng. Chem. Res. 2000, 39, 1291-1310. (8) Bates, D.; Watts, D. Nonlinear Regression Analysis and Its Applications; John Wiley and Sons: New York, 1988. (9) Box, G. E.; Hunter, W. G.; MacGregor, J. F.; Erjavec, J. Some Problems Associated with the Analysis of Multiresponse Data. Technometrics 1973, 15 (1), 33-51. (10) Tjoa, I.; Biegler, L. Simultaneous Solution and Optimization Strategies for Parameter Estimation of Differential-Algebraic Equation Systems. Ind. Eng. Chem. Res. 1991, 30, 376-385. (11) Press: W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. Numerical Recipes, 2nd ed.; Cambridge University Press: New York, 1994. (12) Fuguitt, R. E.; Hawkins, J. E. Rate of Thermal Isomerization of R-Pinene in the Liquid Phase. J. Am. Chem. Soc. 1947, 69, 319.
Received for review June 5, 2003 Revised manuscript received September 29, 2003 Accepted January 30, 2004 IE0304762