Bayesian Framework for Building Kinetic Models of Catalytic Systems

Apr 23, 2009 - of the three models and higher quality model parameters for the best model .... these sets of data for various model parameter values a...
0 downloads 0 Views 711KB Size
4768

Ind. Eng. Chem. Res. 2009, 48, 4768–4790

Bayesian Framework for Building Kinetic Models of Catalytic Systems Shuo-Huan Hsu, Stephen D. Stamatis, James M. Caruthers, W. Nicholas Delgass,* and Venkat Venkatasubramanian Center for Catalyst Design, School of Chemical Engineering, Purdue UniVersity, West Lafayette, Indiana 47907

Gary E. Blau, Mike Lasinski, and Seza Orcun E-enterprise Center, DiscoVery Park, Purdue UniVersity, West Lafayette, Indiana 47907

Recent advances in statistical procedures, coupled with the availability of high performance computational resources and the large mass of data generated from high throughput screening, have enabled a new paradigm for building mathematical models of the kinetic behavior of catalytic reactions. A Bayesian approach is used to formulate the model building problem, estimate model parameters by Monte Carlo based methods, discriminate rival models, and design new experiments to improve the discrimination and fidelity of the parameter estimates. The methodology is illustrated with a typical, model building problem involving three proposed Langmuir-Hinshelwood rate expressions. The Bayesian approach gives improved discrimination of the three models and higher quality model parameters for the best model selected as compared to the traditional methods that employ linearized statistical tools. This paper describes the methodology and its capabilities in sufficient detail to allow kinetic model builders to evaluate and implement its improved model discrimination and parameter estimation features. 1. Introduction A validated kinetic model is a critical and well-recognized tool for understanding the fundamental behavior of catalytic systems. Caruthers et al.1 and others2-6 have suggested that the fundamental understanding captured by the model can also be used to guide catalyst design, where a microkinetic7 analysis is a key component of any fundamental model that intends to predict catalytic performance. While the building of such kinetic models has seen many advances,8-40 the complexity of real reaction systems can still overwhelm the capabilities of even the most recent existing optimization-based model building procedures. The difficulty in discriminating rival models and in determining kinetic parameters even from designed experimental data is exacerbated by kinetic complexity as well as the fact that all experimental data contain error. Most approaches assume that the proposed model is true and that the data has a certain error structure, e.g., constant error or constant percentage error over the entire experimental range. Linear or nonlinear regression techniques are used to generate point estimates of the parameters. The uncertainty of the parameters is subsequently computed under the assumption that the model is linear in the neighborhood of these point estimates. There are several limitations with this approach. First, the model may be wrong and thus parameter estimates are corrupted by model bias. Even when the model is adequate, linearization of nonlinear models can lead to spurious confidence and joint confidence regions.41 An additional complication is the potential for generating multiple local optimal parameter sets depending on the starting “guesses” of the parameters supplied to the regression algorithm. The bottom line is that the existing point estimation methods frequently give the wrong parameters for the wrong model. In addition, they do not take advantage of the considerable a priori knowledge about the reaction system to design the experimental program. * To whom correspondence should be addressed. E-mail: delgass@ ecn.purdue.edu.

In contrast to traditional point estimate methods, Bayesian approaches make full use of the prior knowledge of the experimenter, do not require linearization of nonlinear models, and naturally develop probability distributions of the parameter estimates that are a more informative description of parameter values than point estimates. When combined with effective modeling of experimental error, Bayesian methods are ideally suited to kinetic modeling. The value of Bayesian approaches has been known for some time,11,12,14,16,18,22,42,43 but they require significant computational resources. However, recent advances in computational power have made the implementation viable.41 The purpose of this paper is to introduce the application of Bayesian methods for the development of microkinetic models. The emphasis is on identification of the most appropriate description of the reaction chemistry, i.e. sequence of elementary steps and on obtaining the best quantification for the associated kinetic parameters. While this level of detail may not be essential for some levels of reactor design, it is critical for catalyst design, where the objective of the research is to connect the various kinetic parameters in a microkinetic model with the molecular structure of the catalyst. In order to illustrate the principles and capabilities of the Bayesian approach and its differences from currently practiced point estimate approaches, we will analyze simulated sets of experimental data generated from known reaction equations. In an attempt to separate the capabilities of the method from the mathematical details of its execution, we will first present the modeling framework, introduce the models to be discriminated, and summarize the results. Then, when the advantages of the Bayesian approach are clear, we will describe the details of its utility. Its ability to treat nonlinearity without approximation, its inherent recognition of error, the effect it has on the probability distribution of kinetic parameters are the reasons it can differentiate models and improve confidence intervals for the parameters. The price for these advantages is a higher, but manageable, computational load. Our intent is to make the computational requirements clear, but to emphasize the benefits in analysis of catalytic kinetics that are the return on the investment in the mathematics.

10.1021/ie801651y CCC: $40.75  2009 American Chemical Society Published on Web 04/23/2009

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

4769

Figure 1. Model building procedure.

2. Framework for Mathematical Model Building 20

Our approach will follow that of Reilly and Blau, illustrated in Figure 1, where sequential design, experimentation, and analysis of experimental data can be holistically integrated to provide an efficient and accurate approach for model discrimination and parameter estimation of chemical reaction systems. Summarily, the first step in the model building process is model discrimination, where several physically meaningful models are postulated by the scientist to describe the reaction system being investigated. Then sets of experimental data are designed and carried out in the laboratory to discriminate among these candidate models. The abilities of the various models to match these sets of data for various model parameter values are compared until the best one is found. If none of the models is deemed adequate, additional ones are postulated and the sequential experimentation/analysis process is continued until a suitable one is found. Once an adequate model has been selected, the next step in the model building process is parameter estimation, where the quality of the parameter estimates is quantified by point estimates and their confidence regions. We note that there is a tendency among model builders and scientists to attribute meaning to these estimates before an adequate model is obtained. Clearly, parameter estimates for invalid models are meaningless. Consequently, the quality of parameter estimates can only be determined after an adequate model has been identified. Then, if the confidence regions for the parameters for that model are unacceptable, as quantified by their shape and size, additional experiments may be needed to improve their quality. The process of designing experiments followed by the determination of parametric uncertainty followed by additional experimentation is continued until acceptable quality parameters are realized. Only after this process has been completed can the model be used for reactor design and analysis. In the special case of catalyst design where the model parameters are paramount, they may now be used as the response variables in another model building exercise that relates these parameters to the structural and chemical descriptors of the catalyst. The model building process (discrimination, validation, and parameter estimation) as described above conceptually is wellknown although sometimes applied incorrectly because the statistical and mathematical sophistication required is not fully appreciated. What is new in the Bayesian approach presented here is the simplicity of the approach and the natural way it

follows the logic of the sequential model building paradigm. Unfortunately, this simplicity is achieved at the expense of a computational burden which has only recently permitted its use on user-friendly computer software and hardware available on the desktop of a typical reaction modeling specialist. For complex models, i.e. large numbers of reactions and parameters, considerable skills with the mathematical aspects of the method are still necessary, as well as the need to resort to a server or cluster of servers to achieve the necessary computing power. The strength of the Bayesian approach is its ability to identify an adequate model with realistic parameter estimates by incorporating the prior knowledge of the scientist/experimentalist. Using this knowledge is an anathema to statisticians who correctly point out that we are “biasing” the results and not letting the data “speak for itself.” This is precisely why these methods are being embraced by the engineering and scientific community who do not want their expertise silenced by the vagaries of experimental analysis. In traditional approaches, the main value of the experience of the investigator is in the translation of that experience into initial guesses for the parameters. In the Bayesian approach, the belief of the experimentalist in parameter ranges and the plausibility of candidate models are specifically acknowledged in the implementation. It is noteworthy to mention that as the amount of well-designed experimental data increases, the influence of the initial beliefs diminishes. 2.1. Example Problem. Before proceeding with the mathematical and statistical development of the Bayesian approach, we will first define a simple model reaction system A + B f C + D over a catalyst in a differential reactor. We assume that A, B, and C reversibly adsorb on the surface of the catalyst, D is not absorbed, and the reaction is irreversible under the operating conditions studied. To represent typical laboratory conditions, we use a feed stream V0 ) 100 cm3/min fed to a tubular reactor which contains w ) 0.1 g of catalyst. The stream is composed of various input concentrations of A, B, and C represented as partial pressures PA0, PB0, and PC0 selected from the following ranges 0.5 atm e PA0 e 1.5 atm 0.5 atm e PB0 e 1.5 atm 0 atm e PC0 e 0.2 atm The temperature of the system can be changed over the range 630 K e T e 670 K

4770

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

The temperature range is a bit narrow, but is representative of systems where complex reaction networks impose selectivity constraints. The outlet concentrations PA, PB, and PC in atmospheres are measured. From these values, the molar conversion, x, of the reaction is calculated and used to determine the rate of reaction from the design equation for a differential plug flow reactor: r)

xV0CA0 xV0PA0 ) (gmol ⁄ (min kg catalyst)) w wRT

(1)

where CA0 is the initial concentration of A in gram moles per cubic centimeter and R is the ideal gas law constant. The following three Langmuir-Hinshelwood models are postulated to describe the reaction. Model 1 (A, B, and C are adsorbed) K1

A + * T A* K2

B + * T B* K3

r)

C + * T C*

k4K1K2PAPB (1 + K1PA + K2PB + K3PC)2

(2)

k4

A* + B*f C* + D Model 2 (A and C are adsorbed) K1

A + * T A* K3

C + * T C*

r)

k4

k4K1PAPB 1 + K1PA + K3PC

(3)

A* + Bf C* + D Model 3 (B and C are adsorbed) K2

B + * T B* K3

C + * T C* k4

r)

k4K2PAPB 1 + K2PB + K3PC

(4)

A + B*f C* + D where S* is a species adsorbed on the surface, k4 is the kinetic constant of the rate determining step, and Ki’s are the equilibrium constants with proper units. The steps showing double ended arrows are assumed to be in quasi-equilibrium. It is known that one of the models is true because it is the one used to generate the simulated data. For simplicity, no additional models are considered. The temperature dependence of the parameters is assumed to follow a normal Arrhenius relationship:27

( ( )) ( ( ))

Ea 1 1 R T T0 (gmol ⁄ (min kg catalyst)) ∆Hi 1 1 ( Ki ) Ki0 exp 1 ⁄ atm) for i ) 1, 2, 3 (5) R T T0

k4 ) k40 exp -

where k40 is the rate constant at the reference temperature T0 ) 650 K, which is the middle of the experimental range, and Ea is the activation energy. The Ki0 are reference equilibrium constants at T0, and ∆Hi are the heats of adsorption for the corresponding equilibrium constant, Ki. The challenge is to determine the most suitable model and obtain high quality parameter estimates from the simulated experimental data. Reaction rate data for the A + B f C + D example problem were generated using model 2 defined by eq 3 for experimental factors T, PA0, PB0, and PC0, with K10 ) 1 1/atm, K30 ) 20 1/atm, k40 ) 3 gmol/(min kg catalyst), ∆H1/R ) -17.5 × 103 K, ∆H3/R ) -20 × 103 K, and Ea/R ) 11 × 103 K. The values were chosen to be representative and the heats of adsorption

adjusted to produce significant surface coverages. Experimental error was added to the rate r predicted by eq 3 and was assumed to be normally distributed with zero mean and a variance of 0.002 (gmol/(min kg catalyst))2, i.e. a standard deviation of 4.5%. An intuitive experimental program used by many catalysis researchers is to change one experimental factor at a time, keeping all other factors at their nominal values. We will designate such an experimental program as the “one-variableat-a-time” approach. The reaction rate r was simulated for the one-variable-at-a-time design with {T, PA0, PB0, PC0}. The resulting one-at-time data, shown in Table 1, is the initial set of experimental data D with 33 points, used for analysis. It is assumed that the error in the measured values of r is constant over the entire experimental region. This assumption is rarely valid, i.e., errors tend to be related to the magnitude of the r values. However, we will assume constant error values in our example to keep the focus strictly on the Bayesian approach. We note that since the data were generated directly from the rate expression, changes in the rate determining step or most abundant surface intermediate over the data range are not possible in this example. 2.2. Model Building Formalism. The general problem, for which the above three models are specific cases, is to postulate and then select the best model M* from a set of P ) {M1, M2,..., MP} models from experimental data collected in a batch, continuously stirred tank reactor (CSTR), or plug flow reactor. For simplicity, we will restrict ourselves to modeling data obtained from CSTR’s or differential reactors where the rate of reaction is measured or calculated directly. In a recent paper by Blau et al.,41 the more general problem of dealing with kinetic models consisting of differential equations used to characterize concentration versus time or space time data from batch and integral reactors is discussed. For the general case considered here, the models are related to N experimental data points data by Mk : ri ) fk(θk, ui) + εi(φ) i ) 1, ... , N, k ) 1, ... , P (6) where ri is the rate of the reaction for the ith experimental condition, fk(θk, ui) is the kth model with a Qk-dimensional vector of parameters, θk, and ui is an R-dimensional vector of experimental conditions. The experimental error is described by the error model εi(φ), which will be discussed shortly. The data set formally represented by D ) {(ui, ri)|i ) 1, 2,..., N}. In our example problem M ) 3, where the parameters to be estimated are Model 1 : θ1 ) (k40, E4, K10, ∆H1, K20, ∆H2, K30, ∆H3) Model 2 : θ2 ) (k40, E4, K10, ∆H1, K30, ∆H3) Model 3 : θ3 ) (k40, E4, K10, ∆H1, K20, ∆H2) Note that models 2 and 3 have two fewer parameters than model 1. The R ) 4 dimensional vector of experimental conditions for the ith measurement is ui )(Ti, PA0i, PB0i, PC0i). The rate of reaction ri is estimated from the conversion xi and initial concentrations via eq 1. The conversion is obtained from the measured input and output concentrations in the differential reactor; hence, the estimated rate ri has variability or error associated with it. This error is represented by an error model εi(φ) in eq 6, which is a joint probability density function for the errors associated with the reaction rate ri at experimental conditions ui. φ is an V-dimensional vector of statistical error model parameters. It is common to assume that the errors are unbiased so that the mean of the error εi(φ) is zero, while the parameters φ characterize the variability in the measured value.

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

4771

Table 1. Data Set D: Experimental Design Based on One-Variable-at-a-Time Approach expr no.

T (K)

PA0 (atm)

PB0 (atm)

PC0 (atm)

PA (atm)

PB (atm)

PC (atm)

rate (gmol/(min kg catalyst))

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

650 640 630 640 650 660 670 660 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650 650

1 1 1 1 1 1 1 1 1 1.25 1.5 1.25 1 0.75 0.5 0.75 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1.25 1.5 1.25 1 0.75 0.5 0.75 1 1 1 1 1 1 1 1 1

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.05 0 0.05 0.1 0.15 0.2 0.15 0.1

0.989 0.992 0.994 0.992 0.989 0.986 0.982 0.986 0.989 1.239 1.488 1.239 0.989 0.740 0.491 0.740 0.989 0.987 0.984 0.987 0.989 0.992 0.995 0.992 0.989 0.989 0.988 0.989 0.989 0.990 0.991 0.990 0.989

0.989 0.992 0.994 0.992 0.989 0.986 0.982 0.986 0.989 0.989 0.988 0.989 0.989 0.990 0.991 0.990 0.989 1.237 1.484 1.237 0.989 0.742 0.495 0.742 0.989 0.989 0.988 0.989 0.989 0.990 0.991 0.990 0.989

0.111 0.108 0.106 0.108 0.111 0.114 0.118 0.114 0.111 0.111 0.112 0.111 0.111 0.110 0.109 0.110 0.111 0.113 0.116 0.113 0.111 0.108 0.105 0.108 0.111 0.061 0.012 0.061 0.111 0.160 0.209 0.160 0.111

0.657 0.596 0.474 0.652 0.683 0.841 0.951 0.894 0.701 0.848 1.044 0.791 0.701 0.536 0.404 0.593 0.641 0.887 1.003 0.952 0.615 0.502 0.290 0.512 0.669 0.893 1.293 0.921 0.646 0.526 0.495 0.559 0.748

The proper identification and application of the error model is a step often overlooked in the modeling process. However, the determination of the error model is every bit as important as the determination of the kinetic model because selecting an acceptable kinetic model and estimating the kinetic model parameters depends critically on the quality of the experimental data as quantified by the error model. It should be pointed out that eq 6 assumes that all kinetic models adequately describe the data. However, all models, including even the “best model” M*, are only an approximation of reality and are probably biased. Thus, the experimental error εi(φ) may be confused or “confounded” with kinetic modeling error. In what follows, each model will first be assumed to be true, where deviations between the model predications and the data are unbiased estimates of experimental error. Given a set of data, the first step in the model building problem is to use this data to discriminate between the proposed models. The conventional way this is accomplished is to find the best, in some statistical sense, point estimate of parameters for each model candidate and then compare the models using these best estimates. Generating such estimates is a challenging task since the kinetic model equations are nonlinear in the parameters, requiring the use of nonlinear regression techniques which may lead to local or false optima which represent incorrect parameter estimates. Even when the best or global optimum is obtained, comparing models at their best single set of parameters can sometimes give very misleading results, playing havoc with the model building process.19 We demonstrate in section 3 below the pitfalls of using point estimates for the simple kinetic problem described in section 2.1, before developing an alternative Bayesian approach in section 4.

3. Evaluation of Models by Nonlinear Regression Analysis Nonlinear regression analysis is widely used for estimating parameters of kinetic models. Here, different values of the parameters are selected to minimize the sum of squares of the differences, or residuals, of one of the models defined by eq 6, assuming that it is correct, using an iterative search procedure (e.g., the Levenberg-Marquardt optimization method44). The parameter values in the set that minimizes this sum of squares are called the least-squares parameter estimates. If repeat experiments are available for each data point, then the method automatically accounts for experimental error. It has been generally agreed by practitioners in this field that the only assurance of achieving the “optimal” point estimate solution is to supply a starting guess sufficiently close to the optimal solution.19 Even then there is the possibility of stopping short of the best value or indeed finding a multiplicity of solutions because (1) the kinetic model may be incorrect or (2) the necessary information to provide meaningful parameter estimates may not be available due to poorly designed experiments. To demonstrate the real plausibility of obtaining false optima with the nondesigned but conventional one-at-a-time approach to experimentation, random initial estimates were supplied to the nonlinear regression program. Different sets of parameter estimates giving the same sum of squared error (SSE) are reported in Table 2. At first glance, these results are somewhat surprising, not only in that there are multiple minima, but also that they have the essentially the same residual sums of squares. This implies that the different sets of parameter values, even the wrong ones, would allow the models to fit the data equally well. Regardless of the statistical criteria used, it would be impossible to distinguish or discriminate the models, yet we know that the data was generated by one of the models.

4772

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

Table 2. Parameter Estimates from Nonlinear Regression Local Optimal Parameter Sets for Model 1 ln(k40)

Ea/R (103 K)

ln(K10)

∆H1/R (103 K)

ln(K20)

∆H2/R (103 K)

ln(K30)

∆H3/R (103 K)

SSE

25.42 15.61 25.18 13.79

34.04 29.28 17.26 27.69

-2.46 -2.46 -2.46 -2.46

0.00 0.00 0.00 0.00

-22.45 -12.64 -22.21 -10.82

-26.34 -21.54 -9.55 -20.17

1.39 1.39 1.39 1.39

-0.80 -0.75 -0.80 -1.11

0.0585 0.0585 0.0585 0.0585

Local Optimal Parameter Sets for Model 2 ln(k40) 1.85 1.84 1.85 1.97

3

Ea/R (10 K)

ln(K10)

∆H1/R (103 K)

ln(K30)

∆H3/R (103 K)

SSE

7.97 11.09 7.57 8.03

-1.20 -1.18 -1.20 -1.35

-1.71 -5.57 -0.69 0.00

2.53 2.54 2.53 2.50

-3.25 -3.88 -2.27 -0.24

0.0558 0.0558 0.0558 0.0558

Local Optimal Parameter Sets for Model 3 ln(k40)

3

Ea/R (10 K)

ln(K20)

∆H2/R (103 K)

ln(K30)

∆H3/R (103 K)

SSE

45.38 41.46 28.93 29.62

30.26 34.66 40.00 8.02

-44.99 -41.06 -28.53 -29.22

-22.58 -26.97 -32.42 0.00

2.28 2.28 2.28 2.28

-1.01 -1.01 -1.21 -0.38

0.0600 0.0600 0.0600 0.0600

The solution to this conundrum is to note that only some of the parameter values are different for the different minima. Let us look at the two models that were not used to generate the data in Table 2, namely models 1 and 3. For model 1 ln(K30) and ln(K10) are the same for the four different local optima while some of the other parameters change by orders of magnitude. Similarly, ln(K30) is constant while the other parameters take on various values for model 3. Even for model 2, ln(K30) is well-defined while the other parameters change but to a much lesser degree than with models 1 and 3. This implies that the data give a great deal of information about the adsorption of C on the surface of the catalyst but little else. This implies that changing the amount of C in the feed swamped out the effect of changes in the other experimental conditions. In fact, the first nine data points which account for the temperature changes are insufficient to provide meaningful estimates for the activation energy and adsorption coefficients defined by eq 6. This is the consequence of (1) a poorly designed experiment, i.e. the experimental conditions are changed individually (one-at-a-time approach), and (2) the use of a nonlinear regression approach where the uncertainties in the parameter estimates are not recognized. Some nonlinear parameter estimation programs attempt to estimate this uncertainty, but they fail badly because they do not properly account for the nonlinearities of the model about the optimal point estimates. As we will show, the Bayesian approach properly accommodates the uncertainty in parameter estimation, but there is no solution for a poorly designed experiment. Before leaving this section, it is worth commenting on experimental error. Examination of Table 1 shows that the center point of the design is repeated five times and some of the points are repeated twice. These repeat points may be used to estimate experimental error at different points in the operating region. Seldom is any attempt made to model this experimental error, and repeat points are simply used to weight the data differently. We show how to do this in Appendix A. The greater the size of the operating region, the better the ability to discriminate models and improve parameter estimates since the results are not confounded by experimental error and the different models have more operating space in which to show their divergence from one another. Consequently, the extremes of the experimental conditions are key points in any experimental program despite the challenges of running the equipment under such conditions.

4. Bayesian Approach to Parameter Estimation Bayesian methods have been suggested for building models of reaction systems for over thirty years.20,45 However, they have not been adopted by the catalyst model building community because of the computational challenges associated with using them properly. Fortunately, the power and cost effectiveness of high speed computation are making it possible for the researcher to exploit this important modeling tool. The Bayesian approach is fundamentally different from regression methods. Instead of finding single optimal point estimates of the model parameters to describe a given set of data, it uses methods that identify “regions” where one believes the true values of the parameters lie. Rather than review the details of the Bayesian approach reported in the literature,47 it will be sufficient to define the salient terms necessary to allow presentation of the results for our sample problem. The first quantity that must be defined is p(εi), which is the joint probability density function for the errors in the data, D, for all N experiments. This term acknowledges the existence of error in the data and accounts for it explicitly through an error model function εi(φ) and the associated error model parameter vector φ. The second quantity is also a probability distribution called the likelihood function, L(D|(θk,φ)), which is the “likelihood” that model k with its kinetic parameters θk generated the data D. This function is derived in Appendix A for a general model building system. Figure 2 gives some insight into this interpretation. It consists of a simple plot of a set of five (yi, xi) data points shown as solid dots on the graph where y is the response variable and x is the independent variable. At any value of xi, many different values of yi are experimentally plausible. The probability distribution for these points is sketched as a probability distribution p(yi) shown on the graph. Note that these probability distributions can change from point to point on the graph. Several models could be used to “generate” the five data points. When a model is used to predict the response y for a given value of x, this prediction is not a point but a probability distribution. Consequently, we can use this probability distribution to measure the probability that the model generated the measured response value yi. For example, the simplest model for the data of Figure 2 would be a straight line of the form y ) θ1 + θ2x. For any set of values of θˆ 1and θˆ 2, a straight line could be drawn and the “probability assessed” that it generated the observed five points by comparing the observed and model

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

predicted values assuming the model is correct and the error model is operative. This probability is called the likelihood. The dotted straight line in Figure 2 is one such instance. By finding the set of parameter values θ*1 and θ*i which maximize this probability, we have point estimates called the maximum likelihood estimates. In this figure, we have shown the location of the simulated data points as dashed vertical lines on the probability distributions. A similar analysis can be performed for any additional model. In Figure 2, for example, the mean predicted values with maximum likelihood estimators for a simple curvilinear model with an additional parameter is shown as the solid line. It is clear that the curvilinear model is more apt to generate the five data points than the straight line since the fit of the data to the calculated values is better. That is, the maximum likelihood value of generating the data is higher. The mean values predicted by the curvilinear model are shown as vertical solid values on the probability distributions. Note that these values are closer to the mean values of the distributions than those for the linear model. In the general case, the likelihood point estimate θ*, k which maximizes the likelihood function for known values of the error model parameters φ can be obtained by the nonlinear search approach described in section 2. There usually is some knowledge of the values of θk and φ a priori (i.e., before the data is collected), e.g., the rate constants are in a specific range, the activation energy must be greater than a minimum value, the equilibrium constant is greater than one, etc. This knowledge is captured in a third probability distribution called the prior distribution p(θk, φ). For example, if there is only one rate constant k1 in the model, i.e., θ1 ){k1} and if it is known a priori that k1,min < k1 < k1,max, then

{

k1 < k1,min 0 1 k1,min e k1 e k1,max p(θ1) ) k 1,max - k1,min k1,max < k1 0

Table 3 specifies a set of uniform prior probability distributions for the parameters of the three models in our example problem from section 2.1. Note that UB stands for upper bound and LB stands for lower bound. We are now ready to apply Bayes’ theorem for parameter estimation. The Bayesian approach is a natural way of thinking for scientists and engineering since it mimics the scientific method. In the parameter estimation application, the model builder postulates that the parameters follow some prior distribution. A new data set D is then collected. This data is then used to confirm or modify the prior belief about the

Figure 2. Schematic of likelihood function L(D|(θk,φ)).

Table 3. Bounds of the Uniform Prior Probability Distributions of the Parameters in the Three Proposed Models model 1

model 2

model 3

parameter

LB

UB

LB

UB

LB

UB

ln(k40) Ea/R (103 K) ln(K10) ∆H1/R (103 K) ln(K20) ∆H2/R (103 K) ln(K30) ∆H3/R (103 K)

-10 5 -10 -50 -10 -30 -10 -30

10 40 10 -10 10 -10 10 -10

-10 5 -10 -50

10 40 10 -10

-10 5

10 40

-10 -30 -10 -30

10 -10 30 -10

-10 -30

10 -10

parameters. Specifically, it quantifies the improvement in our knowledge about the parameters of the model by weighting the prior information with the new data D generated by the experimental program. This new knowledge about the parameters is captured by another probability distribution called the “posterior” distribution reflecting the fact that it represents our belief in the parameters “after” the data have been generated. Formally, Bayes’ theorem states that this posterior distribution p({θk,φ}|D) is related to the product of the prior distribution and likelihood function by the proportionality p((θk, φ)|D) ∝ L(D|(θk, φ))p(θk, φ) This proportionality can be turned into an equation so that the posterior distribution can be calculated directly by normalization of the probability distribution by integrating over the allowable parameter space to give p((θk, φ)|D) )

L(D|(θk, φ))p(θk, φ)

∫ ∫ L(D|(θ , φ))p(θ , φ) dφ dθ θk

(7)

4773

φ

k

k

(8) k

The Bayesian approach directly incorporates both knowledge about the error in the data (either via repeats or via an error model which is an inherent part of the likelihood function; see Appendix A) as well as prior information about the model parameters. It produces not a single value or point estimate for a parameter as in regression analysis but rather a joint probability distribution for the parameter set which can define a confidence region which can be used to assess the quality of the parameter estimates including their interactions. Another advantage of the Bayesian approach is that the parameter φi’s in the error model can be determined simultaneously with the kinetic model parameters once the error model has been specified. However, with high throughput experimentation, repeat experiments at different experimental conditions should be included in D so that the error model parameters can be estimated “before” the kinetic model parameters reducing the computational burden and validating the form of the error model. Before illustrating the advantages of the Bayesian approach over the point estimate regression method, it is necessary to point out the major limitation of the former and the reason it has not been generally applied despite the obvious limitations of the latter. It is a nontrivial exercise to integrate the denominator of eq 8. The conventional approach is to use Monte Carlo sampling procedures which will converge to the value of the definite integral if a sufficiently large number of parameter values is sampled. The problem is the dimensionality of the parameter space which is equal to the sum of the number of parameters in the kth model (Qk) plus the number of parameters in the error model (V). For small models, Qk + V e 10, this is computationally feasible for a well-defined prior distribution (e.g., uniform or normal). However, for larger model systems, simple Monte Carlo is

4774

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

Figure 3. Marginal posterior probability density functions of the parameters in different models, given data set D. (dashed line) θj,max, (dash-dotted line) MPPDE. The units of Ea/R, ∆H1/R, ∆H2/R, and ∆H3/R are 103 K.

too slow. Fortunately, the pioneering work of Metropolis et al.46 and its recent rediscovery by Gilks47 has led to the implementation of Markov Chain Monte Carlo (MCMC) sampling procedures which guide the selection of sampling points by conditioning each selection on the current value of the point available. The description of this sampling procedure has been relegated to Appendix B so as not to detract the experimentalist from application issues. However it should be pointed out that it is different conceptually from a simple Monte Carlo procedure. Rather than actually evaluating the integral of the denominator of eq 8, the MCMC scheme converges to a process which samples from the posterior probability distribution. These samples can be used to calculate statistical moments of the posterior distribution such as mean values as well as determine “shapes” of the posterior distribution from which joint confidence regions for the parameter values can be calculated. One particularly useful aid to the modeler/experimentalist in visualizing what is happening is the marginal distribution for a single parameter. It can be readily extracted from the posterior distributions p((θk,φ)|D) by integrating out the dependence on all the other parameters as follows: p(θk,j) )

∫ ′ ∫ p({θ , φ}|D) dθ′

p(φj) )

∫ ∫ ′ p({θ , φ}|D) dθ

θk,j

θk

φ

φj

k

k

k,j dφ,

k

dφj′,

θ′k,j ) {θk,l|l * j} (9) φj′ ) {φl|l * j} (10)

It must be remembered that by integrating out the effects of all other parameters, one hides any correlation with the other parameters. We are now ready to demonstrate these advantages of the Bayesian approach for estimating the parameters for the three models defining the simple A + B f C + D example introduced in section 2.1. The marginal probability distributions defined by eqs 9 and 10 for the various parameters for all three models are shown in Figure 3. Some of the marginal distributions like those for K30 and σ2 are symmetrical with a single well-defined peak; however, other distributions like those for ∆H2 and ∆H3 are highly skewed. The traditional approach misses all of this information since it assumes that the distributions of the

Figure 4. 100(1 - R)% HPD confidence interval for a parameter.

parameters are Gaussian with the maximum likelihood estimates being the mean values. The skewed nature of the parameter estimates demonstrates the need for the Bayesian approach and should be no surprise considering the inherently nonlinear features of chemical kinetics. Figure 3 also shows the maximum posterior probability density estimate (MPPDE), which is the value of that particular parameter at its maximum probability density value in the (θk, φ) hyperspace. This is not necessarily the mean of the marginal distribution but represents the most likely value for the parameter if a single point estimate needs to be selected. Specification of confidence limits for these non-Gaussian marginal distributions is more involved. There exist an infinite number of confidence limits that can be placed on the parameters. A commonly used confidence limit, called the highest probability density (HPD) confidence limit, is the smallest of all confidence regions that enclose the desired percentage of the parameter space. This is shown schematically in Figure 4. Note that (i) the expectation of p(θk,j) or p(φj) is not necessarily at the peak of the p(θk,j) or p(φj) distributions (u) (l) and (ii) the upper and lower bounds, i.e. θk,j and θk,j , are not symmetric around either the peak or the expectation of the distribution. The expectation of p(x) is calculated by the average

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

4775

Table 4. Estimated Parameters and Corresponding 95% Confidence Intervals for Model Candidates Data Set D and Uniform Prior Probability Distribution model 1 parameter

MPPDE

ln(k40) Ea/R (103 K) ln(K10) ∆H1/R (103 K) ln(K20) ∆H2/R (103 K) ln(K30) ∆H3/R (103 K) σ2 × 103 ln[E{L(Mk|D)}] Pr(Mk|D)

8.98 32.4 -2.75 -11.0 -5.75 -19.9 1.37 -10.1 1.84

model 2

95% LB

95% UB

MPPDE

θj,max

95% LB

95% UB

MPPDE

9.94 5.93 29.0 25.2 -2.38 -5.46 -12.6 -25.5 -5.13 -7.34 -17.0 -29.8 1.36 1.19 -14.5 -21.1 2.42 1.47 53.63 0.135

9.94 39.8 -1.31 -10.2 -2.66 -10.2 1.56 -10.2 4.48

1.74 12.5 -1.05 -10.7

1.81 19.7 -1.57 -18.7

0.69 9.4 -8.82 -26.4

9.04 27.5 0.70 -10.0

8.03 20.9

θj,max

2.57 -10.1 1.67

value of all the samples of x. The 100(1 - R)% highest probability confidence limit is defined by

∫ ))∫

(u) Pr(θ(l) k,j e θk,j e θk,j ) ) (u) Pr(φ(l) j e φj e φj

Ω1-R

p(θk,j) dθk,j ) 1 - R

(11)

p(φj) dφj ) 1 - R

(12)

Ω1-R

where Ω1-R is the HPD region, in which the probability density of any point is larger than that of any point in the complement of Ω1-R. Once R is specified, the upper and lower values on the parameters are found by simply examining points from the marginal distribution p(θk,j) or p(φj). If R is sufficiently small and the distribution is unimodal, the HPD confidence limits of eq 11 will contain various point estimates such as the expected values of p(θk,j) and θk,j,max, i.e., the value of θk,j at the peak in the marginal probability distribution. By definition, the expected value of θk,j is the first moment of the marginal distribution E(θk,j) )

∫θ

model 3

k,jp(θk,j)

dθk,j =

∑θ

k,jp(θk,j)

(13)

θk,j

It is quite conceivable that the three point estimates we have defined, namely, MPPDE, E(θk,j), and θk,j,max, will be significantly different. Which one should be used and reported as “the” point estimate? The answer is ambiguous and being argued among statisticians and scientists alike. Because of this controversy, it would be preferable to simply report the HPD confidence limits without specifying some arbitrary point estimate, although this would be a radical departure from current practice. The MPPDE and the upper and lower bounds at 95% confidence are given in Table 4 for the three models using the nondesigned one-variable-at-a-time data given in Table 1. These results are quite different from the nonlinear regression analysis shown in Table 2. The HPD confidence region is quite large for a number of parameters, and the θk,j,max and MPPDE of the posterior probability distribution are different. As before, the poorly designed data set accounts for much of the uncertainty in the activation energies. However, we shall not agonize over any interpretation of the model parameters until discrimination between the various candidate models has been completed and an adequate model selected. Since the posterior distribution, which is calculated in the foregoing by Bayes’ theorem, is biased or influenced by the prior distribution, it is natural to question the impact of the form and quality of this distribution. In Appendix C, we address this question for the simple problem by quantifying the effects of different prior distribution on the posterior distributions for the same data set, D. 5. Model Discrimination and Validation Now that Monte Carlo based Bayesian methods have been defined to generate confidence regions where we believe the

2.30 2.00 -10.3 -23.0 2.13 1.31 55.47 0.848

-7.65 -22.3 2.25 -20.3 2.14

3.06 -10.0 4.12

95% LB

θj,max 4.48 22.9

95% UB

2.43 19.9

-4.12 -9.76 -27.6 -30.0 2.36 1.54 -20.8 -29.2 5.79 1.02 51.60 0.018

10.0 26.1 -1.06 -21.3 2.95 -19.3 90.0

parameters are located for each of the candidate models, the next task is to determine which one of the candidate models best describes the data, i.e. model discrimination. Model discrimination is possible once the posterior probability distribution for the parameters p({θk,φ}|D) is available for each model or, equivalently, it is possible to sample from the posterior probability distribution. Once again, Bayes’ theorem can be applied. The first step is to assign discrete prior probabilities Pr(M1), Pr(M2),..., Pr(MP) to each of the models such that ∑k Pr(Mk) ) 1. These prior probabilities can be based on subjective opinions of the experimentalist/scientist, literature data, or simple exploratory analysis of any available data. It is important to select these prior probabilities “before” the actual experimental program such as the data shown in Table 1 is conducted. Once this data set D becomes available, the posterior probability of models Pr(Mk|D) is determined by applying Bayes’ theorem, this time on the models themselves using a different likelihood function, L(D|Mk) Pr(Mk|D) ∝ L(D|Mk)Pr(Mk)

(14)

Here the likelihood function L(D|Mk) is interpreted as the likelihood of the model generating the data set. Since∑k Pr(Mk|D) ) 1, the proportionality can be turned into an equation by the use of another normalization factor so that Pr(Mk|D) )

L(D|Mk)Pr(Mk)

(15)

∑ L(D|M )Pr(M ) k

k

k

Once the posterior probability is known, the preference for the different models is quantified directly. However, what is L(D|Mk)? It is simply E(L(D|{θk,φ}), i.e. the expected value of the likelihood function for model k, where the dependence upon the (θk,φ) model parameters have been integrated out so that the relationships are valid over all parameter space. Specifically, the expected value of the likelihood function from the posterior probability distribution of the parameters is given by L(D|Mk) ) E(L(D|{θk, φ}) )

∫ ∫ L(D|{θ , φ})p({θ , φ}|D) dφ dθ θk

φ

k

k

k

(16)

This expected value of L(D|Mk) is computed via sampling from the posterior probability distribution for the parameters determined earlier using the MCMC process discussed in Appendix B. Thus, only minimal additional computational effort is required. We emphasize that the Bayesian approach to model discrimination is fundamentally different from the more traditional single point regression based approach. In the regression-based approach, the likelihood value for each model is determined at a single point, the maximum likelihood estimate for the model,

4776

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

and error model parameters. The models are compared in a pairwise fashion using the ratio of the likelihood of each model versus the model with the highest likelihood. This likelihood ratio method20 may be wildly misleading as seen earlier because it suggests preference for one model over the other but only at a single point for each model. Intuitively the model that is preferred over the others is selected as the best candidate. It is then compared with the experimental error using a conventional lack of fit F-test to see if it is adequate to describe the data. This entire procedure is mentioned in passing only. It is flawed for the same reason alluded to earlier: it compares the models at one point only, and in the case of the data set from Table 1, multiple parameter sets can generate the same values of the likelihood ratios (see Table 2) leading to the question of which point estimate set should be utilized. Alternatively, the Bayesian approach compares models by sampling from the entire prior distribution hyperspace in the determination of the likelihood function and the associated posterior model probability Pr(Mk|D). We are now ready to revisit the A + B f C + D example to perform model discrimination; but still using the nondesigned one-variable-at-a-time data set of Table 1. Assuming a uniform prior distribution for the three models Pr(M1) ) Pr(M2) ) Pr(M3) ) 1/3, Pr(Mk|D) was determined via eq 15, using the procedure described above, and the results are shown in Table 4. Note that for model 3, Pr(M3|D) ) 0.018, which suggests there is a very low probability of this being the model that generated the data of Table 1. Although model 2 is strongly preferred to model 1, it is not possible to reject the later with the existing data. It is interesting that even with this nondesigned data set the analysis shows a strong preference for model 2 over model 1. This is definitely not the case for the traditional approach where all three models fit the data equally well; see SSE in Table 2. The discrete posterior probabilities for the model candidates Pr(Mk|D) only show the preference among the candidate models. However, they do not indicate if any of the models, even the one with the highest probability, adequately describes the data. If the entire experiment has been replicated, the conventional approach to this question is the lack-of-fit test, comparing the experimental error and the prediction error based on the “best” point estimate of the parameters. If replicates are not available, the Bayesian methods still can measure the adequacy of a model since an error model has been assumed and estimates of the error model parameters are available. Since the model parameters are represented as the joint posterior probability distribution p({θk,φ}|D), the model prediction at any point uj will be a probability distribution as well. By sampling from this distribution with the MCMC process, the distribution for the predicted rate of reaction rˆk,j for model k for the jth experiment including experimental error is determined using eq 6. Recall that ε(φ) is the error model and is assumed to follow a normal distribution with zero mean and variance σ2(φ). A schematic illustrating how to compute the confidence that a model could generate a specific observed rate is shown in Figure 5. The probability density for the predicted rate rˆ at a particular set of conditions is represented by the curve. We quantify our confidence in terms of the probability that the model could generate the observed rate r′ by taking one minus the area under the density that is marked off from the observed rate to the rate with equal density on the other side of the mode. If one imagines an experimental rate occurring that is equal to the mode, the confidence that the model could generate it would be equal to one. Whereas, if one imagines a point far into a tail of the density, the confidence would approach zero. A bar chart of the confidence values for all of the experiments given in Table 1 for models 1 and 2 are

Figure 5. Schematic illustration of how to compute the confidence that at a given set of conditions the experimentally observed rate r′ could be generated by the model. The confidence is quantified as the probability whose numeric value is equal to the area of the unshaded region.

given in Figure 6, where low values mean a low probability that the model was able to generate that specific experimental point. For all the experimental measurements with 1 - R > 0.05, both models 1 and 2 adequately describe the data in Table 1. Note that R is a positive number less than 1 and that 100(1 - R) is the percent confidence limit. By using the probabilities of generating the individual data points, an overall probability of the ability of the model to generate all the data can be used as a quantitative index of model adequacy. Since the individual data points are assumed to be statistically independent, the overall index of model adequacy for the mth model pm(D) is calculated by multiplying together the probability of generating each data point. For example, pm(D) for models 1 and 2, according to Figure 6, are p1(D) )1.59 × 10-11 and p2(D) )1.85 × 10-11, respectively. If sufficient replicate data is available, the probability of a specific data point r can be determined using the same methodology as described in the previous paragraph, and the overall probability pd(D) for a specific data set D is just the product of the probability of the individual p{r(uj)}. Theoretically, pm(D) must be less than pd(D), but if pm(D) is approximately pd(D), we can conclude that there is no significant lack of fit for the model at that point. We will not implement this procedure in this paper, since there are insufficient replicate points in the one-at-a-time data set given in Table 1. However, if significant replicated data are available via high throughput experimentation, this approach has value. With the data set D, neither models 1 nor 2 has a lack of fit at the 95% confidence level, since all points can be generated with 5% or higher probability. If the preceding analysis does not identify an adequate model, a conventional residual analysis may be used to identify specific model limitations and suggest possible modifications. Using the probability distribution of the model prediction, rˆi as determined via the procedure described above, calculate (i) the expected value of the predicted rate of reaction E{rˆi} ) jri and (ii) the residual ei ) ri - jri for each of the data points. Then two types of residual plots should be prepared: (i) ei versus jri to see if there is a pattern with the magnitude of the measured value, e.g, the deviation of the model from the data is larger for larger/smaller values of the rate, and (ii) ei versus the various experimental conditions, e.g. temperature, feed composition, etc., to determine if a pattern emerges which will suggest modifications to the model. When new models are postulated, they should then be fit to the data set D. This process of successively adding new

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

4777

Figure 6. Probability for generating the experimental data D by models 1 and 2.

improved models and deleting inadequate ones should continue until one or more candidate models emerge that have a high probability of generating the data. If there are several candidate models which cannot be distinguished using the posterior probabilities Pr(Mk) for the given data set, it is then necessary to design additional experiments to discriminate the rival models. It is important to point out before an adequate model is found no effort should be placed on interpretation of the physical nature or magnitude of the individual parameter estimates themselves. The parameter estimates are biased when the models are inadequate and/or the data is inadequate to discriminate between different candidate models. 6. Design of Experiments for Model Discrimination If two or more models emerge as candidates for describing the data in D, it is necessary to conduct additional experiments to discriminate between them. The basic concept of design of experiments for model discrimination is to locate new experiments in a region which has not already been investigated and which predict the greatest expected difference between the candidate models based on D. In the traditional point estimate approach, the criterion that has been used14,18,20 is to locate experimental conditions, u*, at a point that maximizes the absolute differences in the predicted values for each pair of two or more models. The traditional criterion is formally stated as: max |fˆk(θ*k , u) - ˆfm(θ*m, u)| k * m for all k, m u*

over the region umin e u e umax

(17)

where ˆfk(θ*, k u) is the predicted rate of reaction r at u using model k at some point estimate of the model parameters θ*k such as the maximum likelihood estimators.48-51 For the

example problem consider the range of operating conditions u as 630 e T(K) e 670 0.5 e PA0(atm) e 1.5 0.5 e PB0(atm) e 1.5 0.0 e PC0(atm) e 0.2 in which the upper and lower bounds correspond to the extremes of the one-at-a-time experiment of Table 1. A series of 34 ) 81 experiments was run over the region selecting all combinations of 3 values of each operating condition: the lowest point, the center point, and the highest point. Individuals familiar with statistical design of experiments will recognize this is a full factorial experiment for four factors at three levels. Predicted values for these experimental runs were calculated using the 4 × 4 ) 16 sets of “optimal” parameter point estimates from Table 2. The interested user can readily produce this grid of experiments with an associated set of predicted values with a simple spreadsheet, and it will not be reproduced here. The results are quite instructive for demonstrating the pitfalls both of using oneat-a-time experimentation and using point estimates. First, because of the uncertainty in the parameter estimates, the differences between the model predictions are small regardless of the location of u over the above region. In fact, the maximum absolute difference between the predicted values given by eq 18 for models 1 and 2 for all 16 combinations of local point estimates is only 6%. Second, the location of the maximum difference is highly dependent on the actual set of point estimates used. They include the upper and lower bound on T and PC0, the upper bound on PB0, and any value for the PA0 operating conditions. The most frequent value of u* is (T ) 670 K, PA0 ) 1.5 atm, PB0 ) 1.5 atm, PC0 ) 0 atm), at which the average difference between the models is predicted to be about 3%.

4778

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

Table 5. Optimal Parameters for Model Candidates Determined by Nonlinear Optimization, Including the Additional Experiment Designed by Traditional Techniques parameter

model 1

ln(k40) Ea/R (103 K) ln(K10) ∆H1/R (103 K) ln(K20) ∆H2/R (103 K) ln(K30) ∆H3/R (103 K) SSE

70.27 11.70 -23.76 -0.60 -46.14 -0.79 1.34 0.00 0.0849

model 2 46.61 11.08 -46.20 -0.63

2.30 0.00 0.0861

What is the “true” value at this point? The true value from the simulator used to generate the data in Table 1 with model 2, and ignoring experimental error, is 2.470 gmol/(min kg catalyst), while the average predicted rates at the new u* by the two model candidates are 4.194 and 4.053 gmol/(min kg catalyst), for models 1 and 2, respectively. Since neither of the models describe the rates at the new data point (2.470, T ) 670, PA0 ) 1.5, PB0 ) 1.5, PC0 ) 0), this one data point and the 33 other one-at-a-time points in Table 1 were reanalyzed using traditional nonlinear regression to generate new maximum likelihood estimates for models 1 and 2 which are reported in Table 5. Note that these new maximum likelihood estimates are “very different” from the original ones in Table 2. What is even more surprising is that it is not only impossible to discriminate the rival models with the additional point, but the fit of the model to the data is better for model 1 than for model 2 which is simply wrong! Further, different optima can be obtained depending on the starting point selected from Table 2 to start the regression algorithm, which suggests that the entire approach is futile if nondesigned data and the associated point estimates for the parameters are used. The Bayesian approach provides a more statistically compelling criterion to discriminate between any two models by locating the point u* in experimental space that minimizes the probability that the model candidates will predict the same reaction rates. This idea is schematically illustrated in Figure 7. In Figure 7a, the probability density pk{r(u*1 ; θk, φ)} for rate predictions of model k with additional experiment u*1 overlaps considerably with pm{r(u*1 ; θm, φ)} for model m. If a new experiment were performed at these conditions and the observed rate was at A, it would indicate that model k would be much more likely than model m; and conversely, if the experimental rate with error occurred at C, model m would be more likely. However, there is a considerable region in the neighborhood of B, where both models predict the same values. If the experimental rate occurred close to B, model discrimination would not be possible. In contrast, in Figure 7b there is very little region of overlap between pk{r(u*2 ; θk, φ)} and pm{r(u*2 ; θm, φ)} for the candidate experimental condition u*2 , providing clear discrimination between models k and m. Moreover, if the experimentally observed rate were in the region of B, it would indicate that neither of the models is probable and a new candidate model would need to be generated. Thus, experiment u*2 is a better choice for model discrimination. The formal statement of the design of experiment for the model discrimination illustrated in Figure 7 is the following: choose u* to minimize the overlap of pk{r(u*; θk, φ)} and pm{r(u*; θm, φ)}. The same concept can be applied for discriminating more than two model candidates by calculating some weighted pairwise overlap of the probability density functions for all the adequate models.41

This experimental design criterion for model discrimination using the minimum overlap criterion as described above is intuitively appealing but is computationally challenging. The steps in the calculation are as follows: (1) samples of new experimental conditions ui that span the allowable experimental space are taken; (2) pk{r(ui; θk, φ)},the probability density function for the reaction rate at ui for all of the k candidate models, must be computed; (3) the joint probability distribution for each pair of models must be determined; and, finally, (4) the joint probability distributions with the minimal area must optimal experiment be determined to finally arrive at u*sthe i for model discrimination. Rather than resorting to an optimization procedure for locating u*, i it should be possible to simply identify regions in experimental space which are the most attractive. This follows in the spirit of the Bayesian approach of replacing points with regions. One approach to finding these regions is to use a factorial or fractional factorial experiment analogous to the approach in the point estimate case described earlier. Such an approach is attractive when the dimension of u is small, which is typically the case. In our sample problem, there are four experimental variables and, if we select three values of each of these, for example, T ) {630, 650, 670}(K) PA0 ) {0.5, 1.0, 1.5}(atm) PB0 ) {0.5, 1.0, 1.5}(atm) PC0 ) {0, 0.1, 0.2}(atm) there are 34 ) 81 possible new experiments in a full factorial, which is computationally feasible. For u with larger dimensions, there may be too many experimental variables to examine with factorial methods and it may be necessary to resort to optimization procedures. Rather than searching for better regions along a promising direction from a starting point, which is the conventional approach to optimization, a simpler Monte Carlo or Latin hypercube sampling procedure could be used to find attractive regions. However, unless a complete grid search is done over the entire operating region, the user is not guaranteed to find the most promising operating region. Before resorting to such an approach, the experimentalist/modeler should use his or her “prior” knowledge of how he or she would anticipate the model behaving in these as yet unexplored regions and use this information to guide the search procedure. We acknowledge, however, that beyond knowledge of physical limitations on temperature or reactant concentration ratios or cases where residuals point to model inadequacies, “anticipation” of model response is often virtually impossible if the system complexity is high. That makes efficient, guided searches of these spaces an interesting area for further development. For the example problem with data given in Table 1, models 1 and 2 are still both viable following both conventional and Bayesian model discrimination procedures. Using the Bayesian methodology outlined above, 81 candidate ui values were chosen by factorial design at three levels for four different factors (T, PA0, PB0, PC0); the p{r(ui; θk, φ)} were computed for both models for all candidate ui values; the overlap area of p(r(ui; θ1, φ)) and p(r(ui; θ2, φ)) was determined for all ui values; and u* was determined. Table 7 lists the overlap area for each candidate experiment and identifies the best data point as u* ) (T ) 630 K, PA0 ) 1.5 atm, PB0 ) 1.5 atm, PC0 ) 0 atm). Figure 8 presents the probability distributions of the expected rates at u* and shows that there is still considerable overlap between the two distributions. Using this augmented experimental data set DAUG ) D + {u*, r*} (r* is the experimental rate under operating

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

4779

Figure 7. Experimental design for model discrimination: b is more preferable than a. Table 6. Estimated Parameters and Corresponding 95% Confidence Intervals of Model Candidates (Including the Designed Experiment for Model Discrimination) Data Set DAUG and Uniform Prior Probability Distribution model 1

model 2

parameter

MPPDE

θj,max

95% LB

95% UB

MPPDE

θj,max

95% LB

95% UB

ln(k40) Ea/R (103 K) ln(K10) ∆H1/R (103 K) ln(K20) ∆H2/R (103 K) ln(K30) ∆H3/R (103 K) σ2 × 103 ln[E{L(Mk|D)}] Pr(Mk|D)

8.41 39.7 -2.14 -16.1 -5.70 -23.2 1.40 -10.1 1.86

9.12 34.9 -2.04 -13.4 -5.35 -12.6 1.38 -10.5 2.26

6.05 22.6 -3.64 -23.1 -7.28 -26.6 1.18 -16.2 1.48

9.94 39.8 -1.37 -10.2 -2.53 -10.2 1.58 -10.1 4.27

1.58 14.6 -0.83 -13.3

1.90 13.1 -1.17 -20.9

0.99 10.2 -2.29 -28.1

2.72 30.0 0.09 -10.0

2.64 -10.2 1.70

2.50 -10.7 1.89

2.20 -19.1 1.17

3.02 -10.0 3.64

54.03 0.0027

condition u*), a new posterior probability distribution p((θk, φ)|DAUG) was computed using p((θk, φ)|D) as the prior distribution. The whole procedure of Bayesian parameter estimation described earlier was repeated, although an efficient algorithm has been developed that makes use of the previous MCMC calculations of p((θk, φ)|D) for the calculation of the new posterior distribution p((θk, φ)|DAUG) (see Appendix B). The results of the parameter estimates as well as the confidence regions determined from the marginal probability distribution p(θj) or p(φj)are shown in Table 6. Finally, the Pr(Mk|DAUG) was determined using the procedures described in section 5, where again the prior of the two models was assumed to be Pr(M1) ) 0.137 and Pr(M2) ) 0.863, which were obtained from the analysis in section 5. Note that we normalized the two probabilities to keep Pr(M1) + Pr(M2) ) 1 since model 3 is no longer considered. The relative probabilities Pr(Mk|DAUG) for the two models are reported in Table 6 and show clearly that model 1 can be eliminated. This is a remarkable result. With only one additional point and using the posterior probabilities from the data in Table 1, model 2 is identified as the preferred model even though the difference between the marginal probability distributions shown in Figure 8 is relatively small. Just to be sure, the probability of the various experimental data points as shown in Figure 4 for D was recomputed for DAUG, and it was found that all experimental points were within the 95% confidence region. In summary, the addition of just one well-designed experiment was able to unambiguously discriminate the models using Bayesian methods, while conventional regression methods failed to do so. Of course, this investigation has been illustrated using the one-variable-at-a-time data set, which is not properly designed. The impact of a properly designed initial data set on

58.10 0.9973

model discrimination will be discussed before moving on to the important topic of improving the parameter estimates after an adequate model has been selected. 6.1. Impact of Design of Initial Data set D. The one-at-atime approach to experimentation is practiced widely. It has the advantage of allowing the researchers to plot the results of changing one operating condition without confusion by the other operating conditions. This is precisely its limitation because it fails to accommodate interactions among the operating conditions and generates large unexplored regions of the operating space. It is interesting to compare the single point regression methods with the Bayesian approach to decide if the latter approach is still superior to the former despite the use of a well designed initial data set. A new set of designed experimental data, Dc, is presented in Table 8. It was generated over the same operating region as Table 1 using a slightly modified central composite design, which includes both the 2-level full factorial design (24 ) 16 experiments) and 2-level one-variable-at-a-time design (2 × 4 ) 8 experiments). The one-at a-time points are taken at the extremes of the data in Table 1. Nine replicates of the center point are included not only to test reproducibility but also to keep the number of data points the same at N ) 33. Basically, the only difference between the two sets is the location of the experiments. Nonlinear parameter estimation starting with the same randomly generated starting point as used for the nondesigned data set was used to fit the data of Table 8 and the results are shown in Table 9 for the three model candidates. The first observation is that once again models 1 and 3 have multiple optimal parameter sets, but model 2 has only one. Some of the parameters in models 1 and 3 are at their upper bounds. Once again by simply examining the sum of squared error, it is not

4780

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

Table 7. Design of Experiments for Model Discrimination: Overlapped Area for Each Candidate Experimenta expr no.

T (K)

PA0 (atm)

PB0 (atm)

PC0 (atm)

area

expr no.

T (K)

PA0 (atm)

PB0 (atm)

PC0 (atm)

area

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

630 670 650 670 650 630 650 650 630 630 650 670 670 630 650 630 630 650 670 630 630 650 670 630 670 670 630 630 630 650 630 630 670 650 630 670 670 670 630 650 670

1.5 1 1.5 0.5 1.5 1 1 1 1 1.5 0.5 1 1.5 1.5 1.5 0.5 1 1 0.5 1.5 1.5 1.5 1.5 0.5 1 1.5 1.5 1 1 0.5 1.5 0.5 1 1 1.5 1.5 0.5 0.5 1 0.5 1.5

1.5 1.5 1.5 1.5 1.5 1 1.5 1.5 1.5 1.5 1.5 1 1.5 1 1 1.5 1.5 1 1 0.5 1 1 1.5 1.5 1.5 1 1.5 1 1.5 1 1 1 0.5 1 0.5 1 0.5 1.5 0.5 1.5 0.5

0 0 0.2 0 0 0.2 0.2 0 0 0.1 0 0 0 0.2 0.2 0 0.2 0 0 0.2 0 0 0.2 0.2 0.2 0 0.2 0 0.1 0 0.1 0 0 0.2 0 0.2 0 0.2 0.2 0.2 0

0.602 0.603 0.610 0.617 0.618 0.628 0.631 0.651 0.657 0.657 0.664 0.664 0.670 0.671 0.672 0.678 0.680 0.682 0.690 0.698 0.701 0.703 0.716 0.722 0.729 0.730 0.732 0.734 0.746 0.747 0.752 0.763 0.763 0.770 0.790 0.799 0.800 0.801 0.804 0.807 0.809

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81

670 630 670 670 650 670 630 650 630 650 630 650 670 650 630 650 670 650 670 670 650 630 670 630 650 670 650 650 630 670 650 650 670 670 650 650 630 630 650 670

1 0.5 1 0.5 1.5 1.5 1 1.5 1 1 1.5 0.5 0.5 0.5 0.5 1 1 1.5 0.5 1.5 1.5 0.5 1 0.5 1 1.5 1 1.5 1 0.5 0.5 1 0.5 1 0.5 0.5 0.5 0.5 0.5 1.5

1.5 1 1 1.5 0.5 1.5 0.5 0.5 1 0.5 0.5 0.5 1 1 0.5 0.5 1 1.5 1 0.5 0.5 0.5 0.5 1.5 0.5 1 1 1 0.5 0.5 0.5 1.5 0.5 0.5 0.5 1 1 0.5 1.5 0.5

0.1 0.2 0.2 0.1 0.2 0.1 0 0 0.1 0 0.1 0 0.2 0.2 0 0.2 0.1 0.1 0.1 0.2 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

0.813 0.814 0.818 0.832 0.846 0.852 0.852 0.853 0.855 0.857 0.861 0.865 0.877 0.882 0.893 0.897 0.897 0.899 0.900 0.902 0.906 0.908 0.912 0.925 0.925 0.926 0.927 0.927 0.941 0.942 0.946 0.946 0.958 0.960 0.962 0.968 0.971 0.976 0.979 0.989

a

Experiment no. 1 is the suggested experiment for model discrimination.

Figure 8. Posterior probability distribution density of the predicted rate r from models 1 (dashed line) and 2 (solid line) at the suggested experimental condition u* ) (T ) 630 K, PA0 ) 1.5 atm, PB0 ) 1.5 atm, PC0 ) 0 atm).

clear whether models 1 and 3 can be rejected, although model 2 has a lower value. What is clear, however, is the presence of the local optima for the incorrect models despite the use of a designed data set. The Bayesian approach was then used to analyze the data set Dc using the same prior probability distributions as with the nondesigned data set. The marginal posterior probability density functions of the parameters for different model candidates are shown in Figure 9, and the 95% confidence intervals are reported in Table 10. Figure 9 shows that the parameters in model 2, which is the correct model, are better defined than those of

models 1 and 3. It was observed that for both models 1 and 3, the parameter distributions for E4/R and ∆H2/R are both wide and shallow. In the case of ∆H2/R, the entire prior distribution bounds are enclosed in the 95% confidence region. However, as seen in Figure 9, model 2 does not have any parameter distributions that are both wide and shallow. One could consider these shallow distributions as uniform and attribute the slight variation to incomplete sampling. To understand this observation, remember that models 1 and 3 assumed that species B was adsorbing while model 2 did not. Since Dc was generated with model 2, it does not contain information about B adsorbing. By comparing Table 10 with the analogous Table 6, the 95% confidence regions for the correct model 2 are significantly smaller with the designed data set than with the undesigned data whereas the uncertainty in the parameter estimates for the incorrect models 1 and 3 emphasizes the futility of attempting to interpret parameter estimates for an inadequate or false model. Comparing the posterior probabilities of the model candidates, one sees that model 2 is clearly superior to the others. There is less than a 1% chance of generating Dc with model 1 and essentially no probability with model 3. Hence, we are able to conclude that model 2 is superior to the others and are ready to test for model adequacy without additional discrimination experiments.

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

7. Quality of the Parameter Estimates Now that a single, acceptable model has been identified using the DAUG experimental data set, we are in a position to examine the quality of the parameter estimates. As discussed earlier in section 4, the marginal distribution of each individual parameter p(θj) and p(φj) defined in eqs 9 and 10 is a good way to visualize the probability distribution for the individual kinetic and error model parameters. The (θk, φ) parameter estimates are shown in Table 6 for the Bayesian analysis of model 2 with the DAUG data set. The associated probability distributions with DAUG are slightly narrower than the distributions with D reported in Table 4.

4781

The second moment or variance Var(θi) of the individual parameters may also be determined from the marginal distribution: Var(θi) )

∫ (θ - E(θ )) p(θ ) dθ 2

i

θi

i

i

i

Assuming that the p(θj) and p(φj) distributions are normal, Var(θi) or Var(φi) may be used to produce another set of confidence limits. However, the normality assumption is highly suspect for nonlinear models that usually result in nonsymmetric marginal probability distributions like those shown in Figure 3.

Table 8. Designed Data Set Dc Generated by a Modified Central Composite Design T (K)

PA0 (atm)

PB0 (atm)

PC0 (atm)

PA (atm)

PB (atm)

PC (atm)

rate (gmol/(min kg catalyst))

650 670 670 670 670 650 670 670 670 670 650 630 630 630 630 650 630 630 630 630 650 670 630 650 650 650 650 650 650 650 650 650 650

1 1.5 1.5 1.5 1.5 1 0.5 0.5 0.5 0.5 1 1.5 1.5 1.5 1.5 1 0.5 0.5 0.5 0.5 1 1 1 1 1.5 0.5 1 1 1 1 1 1 1

1 1.5 1.5 0.5 0.5 1 1.5 1.5 0.5 0.5 1 1.5 1.5 0.5 0.5 1 1.5 1.5 0.5 0.5 1 1 1 1 1 1 1 1.5 0.5 1 1 1 1

0.1 0.2 0 0.2 0 0.1 0.2 0 0.2 0 0.1 0.2 0 0.2 0 0.1 0.2 0 0.2 0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0 0.1

0.989 1.472 1.467 1.491 1.489 0.989 0.482 0.474 0.494 0.491 0.989 1.492 1.490 1.497 1.497 0.989 0.494 0.491 0.498 0.497 0.989 0.982 0.994 0.989 1.488 0.491 0.989 0.984 0.995 0.989 0.991 0.988 0.989

0.989 1.472 1.467 0.491 0.489 0.989 1.482 1.474 0.494 0.491 0.989 1.492 1.490 0.497 0.497 0.989 1.494 1.491 0.498 0.497 0.989 0.982 0.994 0.989 0.988 0.991 0.989 1.484 0.495 0.989 0.991 0.988 0.989

0.111 0.228 0.033 0.209 0.011 0.111 0.218 0.026 0.206 0.009 0.111 0.208 0.010 0.203 0.003 0.111 0.206 0.009 0.202 0.003 0.111 0.118 0.106 0.111 0.112 0.109 0.111 0.116 0.105 0.111 0.209 0.012 0.111

0.692 1.321 2.515 0.492 0.869 0.711 0.556 1.218 0.172 0.373 0.687 0.568 1.853 0.213 0.588 0.747 0.224 1.123 0.066 0.426 0.689 0.864 0.396 0.733 0.980 0.357 0.706 1.036 0.347 0.727 0.511 1.334 0.661

Table 9. Local Optimal Parameter Estimates for Model Candidates Using Nonlinear Optimization, Fitted against Data Set Dc Local Optimal Parameter Sets for Model 1 ln(k40)

E4/R (103 K)

ln(K10)

∆H1/R (103 K)

ln(K20)

∆H2/R (103 K)

ln(K30)

∆H3/R (103 K)

SSE

15.7 16.1 17.2 18.5 19.0 19.1 19.6 20.0 24.1 28.4 29.0 39.6 47.3 92.7

7.16 15.3 7.16 7.16 14.4 7.16 17.9 37.1 7.2 26.1 34.9 37.0 11.4 16.3

-1.19 -1.19 -1.19 -1.19 -1.19 -1.19 -1.19 -1.19 -1.19 -1.19 -1.19 -1.19 -1.19 -1.19

-8.57 -8.57 -8.57 -8.57 -8.57 -8.57 -8.57 -8.57 -8.57 -8.57 -8.57 -8.57 -8.57 -8.57

-13.7 -14.1 -15.2 -16.4 -16.9 -17.0 -17.5 -17.9 -22.1 -26.4 -26.9 -37.6 -45.3 -90.6

0.00 -8.11 0.00 0.00 -7.21 0.00 -10.8 -30.0 0.00 -18.9 -27.7 -29.9 -4.27 -9.09

1.54 1.54 1.54 1.54 1.54 1.54 1.54 1.54 1.54 1.54 1.54 1.54 1.54 1.54

-11.0 -11.0 -11.0 -11.0 -11.0 -11.0 -11.0 -11.0 -11.0 -11.0 -11.0 -11.0 -11.0 -11.0

0.0556 0.0556 0.0556 0.0556 0.0556 0.0556 0.0556 0.0556 0.0556 0.0556 0.0556 0.0556 0.0556 0.0556

Local Optimal Parameter Sets for Model 2 ln(k40)

E4/R (103 K)

ln(K10)

∆H1/R (103 K)

ln(K30)

∆H3/R (103 K)

SSE

1.02

9.54

0.145

-14.9

3.06

-18.4

0.0446

ln(k40)

E4/R (103 K)

ln(K20)

∆H2/R (103 K)

ln(K30)

∆H3/R (103 K)

SSE

3.40 3.40

30.0 30.0

-3.08 -3.09

-28.5 -28.6

2.14 2.15

-10.5 -11.0

0.545 0.545

Local Optimal Parameter Sets for Model 3

4782

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

The marginal probability density integrates out any correlation between the various model parameters; however, the model parameter estimates are rarely independent of one another. For any pair of parameters, θi and θj, the covariance among these parameters is given by Cov(θi, θj) )

∫ ∫ (θ - E(θ ))(θ - E(θ ))p(θ , θ ) dθ dθ θi

θj

i

i

j

j

i

j

j

i

(18) where the joint probability density function between a pair of parameters θi and θj is obtained by integrating out all the other parameters in the joint posterior probability distribution p(θi,θj): p(θi, θj) )

∫ ′ ∫ p({θ, φ}|D θi,j

φ

AUG)dφ

dθi,j ′,

θi,j ′ ) {θl|l * i, l * j} (19)

p(θi, θj) is the two-dimensional analog to a marginal distribution. The correlation coefficient Fij for these two parameters is simply Fij )

Cov(θi, θj)

(20)

√Var(θi)√Var(θj)

The correlation coefficient has the conventional interpretationsvalues close to +1 or -1 imply that the parameters are highly positively or negatively correlated, respectively. However, Fij is a linearized point estimate and may indicate spurious results.52 A more useful way to evaluate these pairwise relationships among the parameters is to plot confidence region contours for the marginal joint probability density function p(θi, θj). As was the case for a single parameter, there is not a unique way to specify a 100(1 - R)% confidence region for a parameter pair; however, the HPD confidence region defined in section 4 can again be used since it yields the smallest area in hyperspace. Let Ω1-R ) {θi, θj} be the points in the HPD region and ΩR be the complement or points not in the HPD region. Then in an analogous fashion to the approach described for determining the confidence limits of the one-dimensional marginal probability distribution, the two-dimensional 100(1 - R)% confidence contours can be generated. For a linear model, the confidence regions will be ellipses, and if there is no correlation between a particular pair of parameters, the major/minor axes of the ellipse will be parallel to the x- and y-axes of the contour plot. The confidence region contours are shown in Figure 10 for all pairs of the parameters in model 2 as determined from

Figure 9. Marginal probability density functions of the parameters in different model candidates, fitted against data set Dc. Table 10. 95% Confidence Intervals of the Parameters in Three Model Candidates Dc and Uniform Prior Probability Distribution model 1 parameter

MPPDE

ln(k40) E4/R (103 K) ln(K10) ∆H1/R (103 K) ln(K20) ∆H2/R (103 K) ln(K30) ∆H3/R (103 K) σ2 × 103 ln[E{L(Mk|D)}] Pr(Mk|D)

9.83 29.8 -1.20 -10.1 -7.74 -21.9 1.54 -11.5 1.80

model 2

model 3

95% LB

95% UB

MPPDE

θj,max

95% LB

95% UB

MPPDE

7.31 6.36 28.5 19.8 -1.26 -1.46 -10.1 -14.0 -5.15 -7.77 -19.8 -29.8 1.55 1.45 -12.0 -14.2 2.14 1.29 54.7419 0.009

9.95 36.3 -1.06 -10.1 -4.18 -11.4 1.64 -10.1 4.01

1.03 9.54 0.13 -14.7

1.05 10.1 0.13 -16.8

0.92 8.24 -0.11 -19.9

1.17 12.9 0.41 -14.2

3.32 29.7

θj,max

3.06 -18.1 1.38

3.08 2.93 -19.7 -22.1 1.66 1.07 59.4249 0.991

3.26 -18.0 3.08

-3.00 -28.6 2.13 -11.0 16.3

95% LB

95% UB

1.98 12.1

9.91 31.2

-2.61 -9.63 -11.0 -29.8 2.08 1.70 -10.2 -19.8 18.8 13.3 18.0566 0.000

-1.64 -10.2 2.42 -10.2 33.9

θj,max 3.07 16.0

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

4783

Figure 10. Pairwise of the marginal confidence regions of the parameters in model 2 with data set DAUG. The contours are 50%, 70%, 90%, and 95% confidence regions from inside to outside, respectively. The units of Ea/R, ∆H1/R, and ∆H3/R are 103 K.

model 2 the correlations are relatively weak. This weak correlation is observed because we have already compensated for this correlation in eq 5 by using a reference temperature T0 in the data range rather than the traditional pre-exponential factor defined at T ) ∞. If there is a strong correlation between two parameters, often there is a way to transform the relationship so that the correlation can be reduced significantly. 8. Design of Experiments for Improving Parameter Estimates

Figure 11. Pairwise confidence regions of the model parameters and error parameters in model 2. The units of Ea/R, ∆H1/R, and ∆H3/R are 103 K.

the DAUG data set. Correlations are observed between (i) k40 and K30, (ii) K10 and K30, (iii) Ea/R and ∆H1/R, and (iv) especially between k40 and K10. For example, the error of estimating K10 can be compensated by the estimation error of k40. Figure 11 shows that the experimental error σ2 is independent of the other parameters. Notice that the contours are not always ellipses as would be expected from linear analysis, e.g. the contour surface for K30 vs k40. In many applications the authors have observed significant departure from elliptical behavior. Correlation between pre-exponential factors and activation energies is known to be a problem in parameter estimation; however, for the K10 vs ∆H1/R, K30 vs ∆H3/R, and k40 vs Ea for

Even after the parameters have been transformed, there is still the very real possibility that pairwise correlation will exist and the HPD marginal confidence limits will be too large. In this situation, it is necessary to set up another sequential experimental program designed to improve the quality of the parameter estimates by minimizing the volume of their region of uncertainty in hyperspace or to design groups of experiments with a similar aim and iterate until the quality of the parameter estimates is acceptable. Prasad and Vlachos have recently described an effective information-based optimal design of experiments applied to a kinetic model describing the catalytic decomposition of ruthenium.53 An extension of these methods to the Bayesian point of view is presented in Appendix D. The procedure has been applied to design new experiments for the example problem with particular focus on the energy parameters which are poorly defined. Another three level factorial design for the four experimental variables, and thus, 81 possible experimental candidates were investigated. Table 11 shows the six experiments with the smallest det(Ψ), the quantitative measure of the region of uncertainty described in Appendix D. The corresponding rates simulated for model 2 including error

4784

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

Figure 12. Marginal posterior probability distribution of model 2 for different data sets: (bold solid line) DAUG+6; (dotted line) DAUG; (thin solid line) D. The units of Ea/R, ∆H1/R, and ∆H3/R are 103 K.

for the six new experiments are also given in the same table. The new data set DAUG+6 includes the original data set D and the experiment for model discrimination (one experiment) as well as those for parameter estimation (six experiments), and thus DAUG+6 has 40 points. The complete Bayesian analysis of this augmented data set was performed for model 2. The probability distributions for the model parameters using the augmented data set DAUG+6 are shown in Figure 12, where the distributions are significantly tighter as compared to the DAUG data set employed for model discrimination. The pairwise marginal confidence regions are shown in Figure 13, where the confidence regions are considerably smaller than the ones in Figure 10 determined by the data set DAUG. The correlations were not completely removed by the additional experiments, but they are less important. The confidence limits for the parameters using DAUG+6 are given in Table 12 where the range has decreased significantly for a number of the parameters, although ln(K30) and ∆H3/R ranges have not changed much with the addition of the new experiments. The confidence interval of ln(K30) was relatively tight before adding the selected six experiments to the data set. ∆H3/R is still not well estimated even after the new experiments were added. Further designing new experiments would help to improve the confidence limits of this parameter. The criterion for stopping this parameter estimation improvement process is subjective. A typical criterion might be to ensure that the univariate 95% confidence interval is less than 10% of a selected point estimate, such as the MMPDE or the mean value. When all the parameters satisfy the specification, the lack of fit test should be performed again to ensure the model adequacy if replicated data points are available in the final data set. Also, if the variance parameter φ is significantly larger than the measurement error, then

the model may need to be improved, which is the next step of the model building process. 9. Discussion A Bayesian framework for building kinetic models by means of a sequential experimentation-analysis program has been described. It allows the knowledge of the catalyst researcher to drive the model building both by postulating viable models and assessing the quality of model parameter estimates. Procedures are available for designing experiments to discriminate model candidates, assesses their suitability against experimental error, and design experiments for the best model selected. The key features of this modeling approach are (i) the use of distributions for the parameters rather than point estimates so that regression procedures can be avoided, (ii) the ability to incorporate the knowledge of the researcher, and (iii) the ability to more accurately predict the behavior of a validated model under different operating conditions. Probability densities are the most appropriate way to describe a model’s predictive capabilities since they fully acknowledge the consequences of unavoidable experimental error in real catalytic systems. Specifically, for a given model and a given experimental data set that includes error, there will be a distribution of predicted outcomes from that model. Thus, both the analysis of a specific model and the comparison of different models need to explicitly acknowledge this distribution of predictions from the models rather than just analyzing/comparing unique predictions coming from single point estimates. In contrast, traditional linear analysis and nonlinear optimization only provide point estimates of model parameters, often luring the researcher into the erroneous impression that there is a unique prediction for a given model.

Table 11. Six Experiments Added to the Data Set DAUG for Improving Parameter Estimates expr no.

T (K)

PA0 (atm)

PB0 (atm)

PC0 (atm)

ln(det(Ψ))

PA (atm)

PB (atm)

PC (atm)

rate (gmol/(min kg catalyst))

1 2 3 4 5 6

670 630 650 650 670 670

1.5 1.5 1.5 0.5 0.5 1

1.5 1 1.5 1.5 1.5 1

0.1 0 0 0 0 0

-24.9 -24.3 -23.4 -23.3 -23.2 -23.2

1.470 1.493 1.481 0.483 0.474 0.979

1.470 0.993 1.481 1.483 1.474 0.979

0.130 0.007 0.019 0.017 0.026 0.021

1.727 1.284 2.324 1.150 1.125 1.365

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

4785

Figure 13. Pairwise of the marginal confidence regions of the parameters in model 2 with data set DAUG+6. The contours are 50%, 70%, 90%, and 95% confidence regions from inside to outside, respectively. The units of Ea/R, ∆H1/R, and ∆H3/R are 103 K. Table 12. Estimated Parameters and Corresponding 95% Confidence Intervals for Model 2 Data Set DAUG+6 and Uniform Prior parameter

MPPDE

θj,max

95% LB

95% UB

ln(k40) Ea/R ln(K10) ∆H1/R ln(K30) ∆H3/R σ2 × 103

1.17 10.6 -0.156 -14.2 2.91 -14.7 1.91

1.18 10.2 -0.139 -12.2 2.93 -15.0 2.24

0.97 7.41 -0.512 -19.1 2.73 -22.2 1.43

1.35 13.6 0.234 -10.2 3.16 -11.0 3.85

Another key feature of the methods discussed here is the importance of modeling the error as well as modeling the kinetic relationships. The error associated with any specific experiment is assumed to be normally distributed around the average value for that experiment. Typically, the error is assumed to be constant or proportional to the magnitude of the response (i.e., the latter is the case when the logarithm of data is fit). However, much more complex error structures are more typical, e.g. the error is proportional to the magnitude of the experimentally measured response except for very small responses, lower than the detector sensitivity. A simple three-parameter model for capturing this “heteroscedasity” in the experimental error has been described in Appendix A. The key issue is that although the experimental error is normally distributed, when that error is propagated through the nonlinear models that are used in kinetic analysis, the resulting errors in the parameter estimates are often anything but Gaussian. The parameter distribution estimates given in Figure 3 for the three candidate models for the test data set given in Table 1 are a good example. Considering all the parameters for the three models, the only distribution that is even remotely Gaussian is ln(K30) for models

1 and 3. When an augmented data set with seven additional experiments was employed, the parameter distributions, as shown in Figure 12, became much smoother and nearly Gaussian. However, the additional data set was designed to provide these highly improved (and now nearly Gaussian) parameter estimate distributions using the Bayesian methods. The need for incorporating the nonlinear error structure of the models was demonstrated for a relatively simple kinetic model with data that was of reasonable quality. We expect the need for directly including the nonlinearity will become even more important as the complexity of the models increases or the data becomes noisier.54 It is possible that, with an overwhelming amount of data, nonlinear optimization with linear error analysis may also be able to determine reasonable parameter estimates and improve model discrimination, but even in an era of high throughput experimentation, maximizing the impact of each experiment still has value. A distinguishing feature of the Bayesian method is that it takes full advantage of the catalyst researcher’s expertise prior to the determination of the parameters. This is in contrast to regression methods, which only deal with the data and the model without any subjective input from the expert other than specification of candidate models and supply of initial guesses.19,43,55 Although both model building approaches can be used, we believe that the Bayesian approach is of particular value to kinetic problems, where the potential rate and equilibrium constants can vary by orders of magnitude. Any information that the expert can provide can significantly reduce the amount of experimental data needed for model discrimination and parameter estimation. Moreover, in studies of a series of catalytic structures (e.g., the systematic change of ligand molecular structure, the stoichiometry of a mixed metal surface,

4786

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

etc.) and/or a series of different reactants, one might anticipate that the rate/equilibrium constants and activation energies for one member of a catalyst family would provide a good point of departure for analysis of other members of the family. Bayesian methods can take full advantage of such expert prior knowledge of related kinetic systems. The expert knowledge is captured via the prior probability distribution p(θk, φ). In the simple example presented here, the prior knowledge was used only to place limits on various parameters (see Table 7). The assumption of a uniform distribution is called an uninformative prior, although this is a misnomer: there is often considerable expert knowledge in specifying the parameter limits. Sometimes additional information is available from screening studies, a linear analysis, etc. If an initial point estimate is available, an alternative is a triangular prior distribution that has a maximum at the initial point estimate and goes to zero at the expert defined limits on the parameters. The triangular prior distribution incorporates more knowledge in the prior than the uninformative prior. The relationship between the information content of the prior probability distribution and the final parameter confidence region will be an interesting one for further study. New criteria for model discrimination and for parameter estimation have been proposed in this work. First, additional experiments for model discrimination are designed in the region where the probability distributions of the model predictions are the most different. This is a general criterion without any assumptions. The conventional approach of experimental design for model discrimination compares the model prediction calculated at the maximum likelihood estimates only. By using probability distributions, our proposed criterion for model discrimination incorporates the uncertainty of the model prediction as well, and this appears to be its source of improved ability to discriminate models. Second, to improve the parameter estimates, the additional experiments are designed to minimize the volume of the confidence region, which is approximated by the determinant of the covariance matrix. We can calculate the covariance matrix from the samples of the Markov Chain Monte Carlo process and then use it to search for new experiments that narrow the parameter probability distribution. These two criteria are intuitive but have not been discussed in available literature, perhaps due to the computational complexity. Since the era of high-speed computation has arrived, they should now be exploited because they are both powerful and general. We have not compared these approaches directly to the conventional D-optimal design for parameter estimation, but note that D-optimal design uses the linearized model around the maximum likelihood estimate, a step avoided by the Bayesian approach. The entire Bayesian framework was demonstrated on a very simple catalyst kinetics test problem, where data were generated from one of the models, including reasonable amounts of experimental error. The results of this exercise show that linear methods and nonlinear optimization were unable to identify the correct model from the three candidate models, even with an additional experiment. In contrast, Bayesian methods were able to robustly eliminate one of the models and suggest a single additional experiment that was able to discriminate between the remaining models. The importance of appropriate experimental design is also indicated, in that the one-variable-at-a-time approach did not provide a robust parameter estimates. However, Bayesian-based design of experiments suggested the optimal set of new experiments needed to improve parameter estimates. Although the Bayesian approach is much preferred over the conventional approach to model building, the paper has pointed

out the computational challenges imposed by its proper implementation. The MCMC sampling methodology outlined in Appendix B was implemented in two different programs. A PCbased program called MODQUEST was written in MATLAB and used to solve the example problem described in this paper. About 10 min were required on a Dell precision 6300 Intel Core 2 Extreme CPU X7900 2.8 GHZ with 4 GB RAM running 32bit Microsoft Windows Vista to obtain convergence for an eight parameter model such as model 1 on the original data set D. Another software program written as a combination of C++ and Fortran was used to handle larger real world microkinetic problems consisting of systems of differential algebraic equations. In this case, convergence for a 25 parameter microkinetic problem was achieved in less than 48 h on a large data set using an Intel Xeon 3.2 GHz CPU with 4GB RAM running 64-bit Redhat Enterprise Linux 4.0. Experience has shown that the computational time is dramatically reduced with the quality of the proposal distribution for the posterior and the starting guess as well as the suitability of the model. Conversely, computational times increase with the size of the systems of differential algebraic equations to be solved and the stiffness of the system. It is also evident that the computational times could be dramatically reduced by using parallel processors or computing clusters, which are ideal for Monte Carlo like formulations. The latter approach is being vigorously pursued by the authors. In summary, the power of Bayesian methods for model discrimination and parameter estimation, including design of experiments, has been shown for a simple, model kinetics problem. The traditional tool for analyzing kinetics is nonlinear optimization with a linear statistical analysis around the optimal solution, which can provide good results if there are sufficient data that have been well designed and the potential models are quite different. The Bayesian approach outperformed these methods in this comparison, however. The ability of this approach to treat nonlinearity without approximation gives it high potential for a wide variety of problems in catalytic kinetics and thus provides a new set of tools to be added to the arsenal of any researcher who is developing models of catalytic systems. Acknowledgment The authors would like to acknowledge the financial support of the Department of Energy of the United States (DOE-BES Catalysis Science Grant DE-FG02-03ER15466) and ExxonMobil through collaboration on the design of hydrotreating catalysts. Appendix A: Likelihood Function and Error Model It is necessary to specify a form of the probability distributions for the likelihood function L(D|{θk, φ}) and the prior joint probability distributions p(θk, φ) for the model parameters θk and the error model parameters φ before eq 8 can be solved. Selecting the joint prior probability distributions for the parameters will be discussed in Appendix C. In this section, we will develop the form of the likelihood function. It is reasonable to assume that the experimental errors εi(φ) for each of the N data points are independent and normally distributed random variables with mean zero and variance σi2. On the basis of this assumption, the joint probability density function for the N data points in the data set D is N

p(ε) )

N

∏ p(ε ) ) ∏ i

i)1

i)1

{

( )}

εi2 1 exp - 2 (2π)1⁄2σi 2σi

(A.1)

For any set of values of the model parameters θk, the difference between the measured rates ri and the values predicted by model

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

k, fk(θk, ui) are the residuals eik ) ri - fk(θk, ui)

(A.2)

Assuming a high probability that the kth model generated D with the θk parameters, i.e., the model is valid, the residuals are estimates of experimental error and may be substituted for the errors in A.1 to give the joint probability distribution function N

L(D|θk, φ) )

∏ i)1

{

(

(ri - fk(θk, ui)) 1 exp 1⁄2 (2π) σi 2σi2

2

)}

(A.3)

which is called the likelihood function for model k. When the model is incorrect, the residuals are biased so they are not estimates of experimental error, but the substitution is made anyway so that a likelihood function is defined for all k models. If replicates are available at each set of experimental conditions ui it is possible to estimate the variance of the normal distribution σi2. In this case, it is not necessary to define a statistical error model or define the parameters φ. However, when replicates are not available, it is convenient to model the error. Since we have assumed that the errors are normally distributed with mean zero, the error model represents the statistical parameters that characterize uncertainties in the experimental setup and the response variable measurements. In statistical terms, we are going to use a model to characterize the heteroscedasticity in the data. The following three-parameter model has been found to be very useful for representing the variability in reaction systems20 σi2 ) ω2rˆγi + η2 i ) 1, .... , n

(A.4)

where ω, γ, and η are independent statistical model parameters, i.e. φ ) {ω, γ, η} and rˆi is the predicted reaction rate at the current value of ui and θk. This model has physically meaningful boundary conditions. If the measurement errors are constant over the entire experimental region, then γ ) 0 and the variance is constant. If the measurement errors are directly proportional so that there is a constant percent error in the data, then γ ) 2. Finally, if the error in the measurements goes from being proportional to the measurements until the limit of detection is reached, all three terms are needed in the model, with the limit of detection being η2. In the analysis of the simple model presented in this paper, the mathematical model parameters θk and statistical model parameters φ were estimated simultaneously. This is particularly challenging in the point estimate approach but very natural in the Bayesian approach affording insights into the quality and interaction between the mathematical and statistical model parameters. Appendix B. Posterior Probability Distribution Evaluation Methods In order to determine the joint posterior probability distribution of the parameters p({θk, φ}|D) after the data have been collected, it is necessary to integrate over the entire parameter space to determine the normalization factor for the denominator of eq 8. Once this factor has been determined, the posterior probability distribution is obtained by multiplying the likelihood function with samples taken from the prior distribution function for the parameters. This is computationally feasible when the space of the parameters is small i.e. Qk + V e 10. One approach to obtaining this integral is to recognize that the integral to be evaluated is simply the

4787

expected value of the likelihood function E(L(D|θk, φ)). This value can be approximated by the relationship E{L(D|θk, φ)} )

∫ ∫ L(D|(θ , φ))p(θ , φ) dφ dθ ≈ θk

φ

k

k

k

T



1 L(D|{θk,j, φj}) (B.1) T j)1 where{θk,j, φj} for j ) 1,..., T is the discrete set of values of model and statistical parameters selected from the prior distribution p(θk, φ) using a Monte Carlo sampling process.56 The size of T required to give a good approximation to the integral depends on the nonlinearity of the model, the dimension and size of parameter space Qk + V, and the accuracy required. If evaluating fk(θk,j, ui) is fast and the dimensionality of the parameter space is small, such a standardized Monte Carlo sampling procedure is an effective way to evaluate the integral. A more efficient sampling approach called the Markov Chain Monte Carlo (MCMC) method has been developed by Metropolis et al.46 and later modified by Hastings.57 In this procedure, it is not necessary to evaluate the integral directly. Rather, the MCMC process converges to a sampling procedure which randomly generates samples from the posterior probability distribution. By collecting a sufficient number of samples from this converged process, it is possible to calculate moments of the distribution (i.e., means, variances), confidence regions, and predictions made with the model itself. The interested reader is referred to the literature for the mechanics of this process.47 Basically, the form of the posterior distribution is proposed, a series of samples is selected from the proposal distribution, and a decision rule is defined involving the prior distribution and the previous point in the series. By repeated application of this rule, the proposal distribution is gradually modified until samples from this modified distribution evolve into a sampling scheme for the desired posterior distribution. The efficiency of the MCMC method compared to the MC scheme will now be shown. Consider the situation when the prior probability distributions of the parameters in the three models from the sample problem are all uniform between the expert chosen bounds given in Table 3. Also assume that the error is normally distributed with a constant but unknown variance φ ) σ2 (This is the situation where γ ) 0, η2 ) 0, and ω2 ) σ2). Using the data set D, the expected value of the likelihood function for model 2 can be calculated from eq B.1 using a simple Monte Carlo approach. Because of the size of the numbers, the log of the likelihood function is calculated and shown in Figure B.1 as the number of samples T increases from 103 to 107. We have also reported the maximum value of the log of the likelihood function achieved by this sampling scheme since it is the point which defines the maximum likelihood point estimators. For comparison purposes, the MCMC method is used to generate samples from the posterior probability distribution and the samples are used to calculate the logarithm of the posterior probability distribution using eq 8. The results, plotted in Figure B.1, are quite dramatic. Note that significantly more trials are needed using Monte Carlo than with MCMC to obtain the same degree of accuracy. The MCMC converges after approximately 5 × 104 simulations, while more than 106 trials are need for the simple Monte Carlo methodsa 20-fold increase in efficiency. This is an isolated example but comparable results can be seen for the other models. When the experimental data set used in the MCMC method is expanded for model discrimination D (see section 6) or for improving the quality of the parameter estimates (see section 8), it is desirable to take advantage of MCMC calculations that

4788

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

Figure B.1. Expected log likelihood value obtained by Monte Carlo method with different numbers of samples.

were employed for the initial data set during the MCMC calculations for the new, augmented data set DAUG ) D + DMD. First, Bayes’ theorem is applied using the posterior p({θk, φ}|D) of the first data set D as the prior and p(DAUG|{θk, φ}) as the likelihood function to calculate a new posterior probability distribution p({θk, φ}|DAUG) for the augmented data set DAUG. This posterior probability distribution for the augmented data will be calculated using the same MCMC methods and as described above. If effective discrimination or sufficient quality of parameter estimation has still not been achieved with the first augmentation of the data set, it may be necessary to repeat the process and collect additional data sets. At each stage of the iterative process, the augmented data set becomes

p(θk) )

j

DAUG,j ) D +



tion (section 5). In parameter estimation, a set of independent uniform distributions were assumed for the parameters with arbitrary but realistic upper and lower bounds for the various parameters. The formalism is described in eq 7 and the actual values used in Table 3. The purpose of this appendix is to show how information from exploratory experiments can be used via an exploratory data analysis to gain prior information to start the model building process. Linear or nonlinear regression analysis (see section 3) is a common technique to obtain point estimates of the model parameters as well as approximations of the confidence limits. Unfortunately, in this type of analysis, the model parameters are confounded with error model parameters unless the error model parameters are determined separately. Such an approach using replicate sampling is highly desirable and gives an assessment of the variability in the system to help design future experiments. If the rate expression is linearized so that the linear regression can be used, then it is convenient to use a triangular distribution where the most likely values are the point estimates from the linear analysis and the maximum and minimum at values suggested by the experimentalist/modeler. If nonlinear regression is used to analyze the exploratory data with the model directly, then the optimal point estimates of the parameters are generated as well as the covariance matrix Σˆ k of the model parameters. In most nonlinear regression packages, this value is directly calculated from the inverse of the Hessian matrix, Hk, that results from a Taylor series approximation of the objective function about the optimal point estimates of parameters θ*. k Therefore, the prior probability distribution of θk can be formulated as a normal distribution with mean θ*k (nonlinear least-squares parameter point estimates) and covariance Σˆ k.

DMD,l

(B.2)

l)1

With each addition to the data set, i.e. DMD,l, the posterior from the previous set of data is used as the prior in determination of the new augmented data set. To the best of our knowledge, this method for more efficient MCMC calculations as the data is extended is novel. Appendix C: Determination of the Prior Distribution In Bayesian analysis, any knowledge about the model should be captured before the data is collected. This knowledge is quantified via the prior probabilities Pr(Mk) during model discrimination and p(θk, φ) during parameter estimation. The prior probability distribution for the models may be obtained from analysis on similar catalytic systems or expert opinion; however, for new catalytic systems, their unique character makes it difficult to indicate a preference between models before data is collected. In these cases one can only use the uninformative prior Pr(Mk) ) 1/P where P is the number of postulated models at the start of the model building exercise. However, if data is available from exploratory experiments or the literature, it should be used to help estimate the prior probabilities. Although exploratory experiments are not in the true spirit of Bayesian statistics (i.e., every experiment should be designed based upon currently available knowledge), in practice, catalysis researchers nearly always conduct a number of “exploratory” runs on the apparatus to discern the idiosyncrasies of the experimental setup such as the feasible operating range of u, the expected analytical and sampling variability, and the reproducibility of the catalytic system as well as catalyst deactivation. In the body of the paper an uninformative prior was employed during model discrimina-

)

1 (2π) |Σˆ k|1⁄2 m⁄2

|Hk|1⁄2

1 exp - (θk - θ*k )TΣˆ k-1(θk - θ*k ) 2

(

)

1 exp - (θk - θ*k )THk(θk - θ*k ) 2

(

)

(C.1) (2π) In linear regression it is unnecessary to specify a starting point since the method guarantees a unique solution. When using nonlinear regression, however, a starting point for the algorithm is required. The convergence to a solution is strongly influenced by this starting point and convergence to different optima is possible from different starting points. This is possible especially with sparse nondesigned data. We have already demonstrated the presence of multi optima in the body of the paper even with the large data set D. It is absolutely critical that the variability around the optimum not be too constrained so that the true optima can be considered in the Bayesian approach. Note that one of the advantages of the Bayesian approach is the ability to search the entire region of the prior distribution and generate an entire posterior distribution instead of meaningless point estimates. Rather than using a linear or nonlinear point estimation approach it is also possible to use a Bayesian approach to generate a posterior distribution pretending that the exploratory data are new. However, this requires assuming a prior without any data which defeats the purpose of the exploratory data analysis. m⁄2

Appendix D: Design of Experiments to Improve Parameter Estimation Given the augmented data set DAUG, eqs 19 and 20 can be used to calculate the variance-covariance matrix Ψ. It can then be used to design set of experiments to efficiently improve the quality of the parameter estimates. Ψ is defined by

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

Ψ)

[

Σθ Σθ,φ Σφ,θ Σφ

]

[ [ [ [

where

(D.1)

Var(θ1) Cov(θ1, θ2) · · · Cov(θ2, θ1) Var(θ2) ··· Σθ ) · ·· l l Cov(θp, θ1) Cov(θp, θ2) · · ·

Cov(θ1, θp) Cov(θ2, θp)

Var(φ1) Cov(φ1, ϑ2) · · · Cov(φ2, φ1) Var(φ2) ··· Σφ ) · ·· l l Cov(φn, φ1) Cov(φn, φ2) · · ·

Cov(φ1, φn) Cov(φ2, φn)

l Var(θp)

l Var(φn)

] ] ] ]

Cov(φ1, θ1) Cov(φ1, θ2) · · · Cov(φ1, θp) Cov(φ2, θ1) Cov(φ2, θ2) · · · Cov(φ2, θp) Σφ,θ ) · ·· l l l Cov(φn, θ1) Cov(φn, θ2) · · · Cov(φn, θp) Cov(θ1, φ1) Cov(θ1, φ2) · · · Cov(θ1, φn) Cov(θ2, φ1) Cov(θ2, φ2) · · · Cov(θ2, φn) Σθ,φ ) · ·· l l l Cov(θp, φ1) Cov(θp, φ2) · · · Cov(θp, φn)

The diagonal elements of Ψ are the variances for the individual parameters and the off-diagonal elements are their covariances. Ψ is a function of both the parameters and the experimental conditions u and can be readily evaluated for the data set DAUG using suitable point estimates of the parameters. The quality of the parameter estimate is related to the elements of Ψ. For example, if the off diagonal elements are small, the contours of the posterior probability density function will be spherical and the parameter estimates are uncorrelated. Also, the smaller the variance estimates the smaller the uncertainty in the parameter estimates. It may be shown8 that det(Ψ) is proportional to the size of confidence regions for linear models. The hypervolumes of the confidence regions can be numerically calculated from the volume inside the contours of the full probability density distribution, i.e. eq 8; however, this is a difficult calculation. The linearized volume, i.e. det(Ψ), computed from one million realizations from the posterior probability distribution of the parameters can be used to approximate the confidence region. To improve the confidence region, q additional experiments ui, i ) 1,..., q should be selected to minimize this determinant. The design procedure to select these q experiments is the following: 1. Generate a new candidate experiment, ui. 2. Calculate the expected model predictions E{yi} for ui by E{yi|DAUG} )



f(θ, ui)p({θ, φ}|DAUG) dφ dθ = θ,φ

T



1 f(θ , u ) (E.2) T j)1 j i

where {θj} are sampled from the posterior probability distribution p({θ, φ}|DAUG). 3. Use E{yi} to form a modified data set DMD ) (ui, E{yi|D}). 4. Generate the posterior probability distribution from DAUG+1 ) DAUG + DMD. 5. Calculate the variance and covariance of the parameters and form matrix Ψ. 6. Calculate det(Ψ). 7. Go to step 1 until det(Ψ) is minimized. The computational burden required to implement this procedure is enormous. Optimum seeking methods can readily be substituted for finding the best point u1(t) when q ) 1. In

4789

practice, however, it is more useful to find regions in the space of experimental conditions that show lower values for det(Ψ) rather than focusing on finding a single new experimental condition that minimizes det(Ψ). Similar to the approach for the experimental design for model discrimination, described in section 6, a factorial or fractional factorial design can be applied to determine the best experiments which accomplish this goal. Each additional experiment can provide insight into improvement of the parameter estimates. Consequently, once an additional designed experiment is available, a new posterior probability distribution should be calculated for DAUG+1. Then, the confidence limits and contour regions can be calculated and a new experiment is determined using ΨAUG+1. This iterative process (experimentation and analysis) is continued until suitably refined parameter estimates are obtained. This is a viable approach if the experiments are expensive and time-consuming to generate. If high throughput experimentation, where data can be generated rapidly, is available, then the computational challenges associated with implementing a one-experiment-at-a-time sequential approach dominate. In this case it is more efficient to propose a set of designed experiments ui, i ) 1,..., q, where the estimate of the det(ΨAUG+1) for each individual experiment, the posterior distribution is determined by MCMC for the augmented data set DAUG+q, the joint marginal probability distribution of parameters determined and the confidence estimates of parameters examined. This time the iterative process will involve q experiments. In either case, the sequential process will continue until the catalyst researcher is comfortable with the quality of the estimates or additional experimentation does not improve the quality of the results. If the researcher is still not comfortable with the quality of the parameter estimates, it is necessary to revisit the experimental apparatus and attempt to improve the overall quality of the data itself. Literature Cited (1) Caruthers, J. M.; Lauterbach, J. A.; Thomson, K. T.; Venkatasubramanian, V.; Snively, C. M.; Bhan, A.; Katare, S.; Oskarsdottir, G. Catalyst design: knowledge extraction from high-throughput experimentation. J. Catal. 2003, 216 (1-2), 98–109. (2) Dumesic, J. A.; Milligan, B. A.; Greppi, L. A.; Balse, V. R.; Sarnowski, K. T.; Beall, C. E.; Kataoka, T.; Rudd, D. F.; Trevino, A. A. A Kinetic Modeling Approach to the Design of Catalysts - Formulation of a Catalyst Design Advisory Program. Ind. Eng. Chem. Res. 1987, 26 (7), 1399–1407. (3) Banaresalcantara, R.; Westerberg, A. W.; Ko, E. I.; Rychener, M. D. DECADEsA Hybrid Expert System for Catalyst Selection. 1. Expert System Consideration. Comput. Chem. Eng. 1987, 11 (3), 265–277. (4) Banaresalcantara, R.; Ko, E. I.; Westerberg, A. W.; Rychener, M. D. DECADEsA Hybrid Expert System for Catalyst Selection. 2. Final Architecture and Results. Comput. Chem. Eng. 1988, 12 (9-10), 923–938. (5) Ammal, S. S. C.; Takami, S.; Kubo, M.; Miyamoto, A. Integrated computational chemistry system for catalysts design. Bull. Mater. Sci. 1999, 22 (5), 851–861. (6) Burello, E.; Rothenberg, G. In silico design in homogeneous catalysis using descriptor modelling. Int. J. Mol. Sci. 2006, 7 (9), 375–404. (7) Dumesic, J. A.; Rudd, D. F.; Aparicio, L. M.; Rekoske, J. E.; Trevino, A. A. The Microkinetics of Heterogeneous Catalysis; American Chemical Society: Washington, D.C., 1993; p 316. (8) Box, G. E. P.; Lucas, H. L. Design of Experiments in Non-Linear Situations. Biometrika 1959, 46 (1/2), 77–90. (9) Chernoff, H. Sequential design of experiments. Ann. Math. Statist. 1959, 30, 755–770. (10) Franckaerts, J.; Froment, G. F. Kinetic study of the dehydrogenation of ethanol. Chem. Eng. Sci. 1964, 19 (10), 807–818. (11) Box, G. E. P.; Draper, N. R. The Bayesian Estimation of Common Parameters from Several Responses. Biometrika 1965, 52 (3), 355–365. (12) Hunter, W., G.; Reiner, A. M. Designs for discriminating between two rival models. Technometrics 1965, 7 (3), 307–323.

4790

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

(13) Kittrel, J. R.; Hunter, W. G.; Watson, C. C. Nonlinear Least Squares Analysis of Catalytic Rate Models. AIChE J. 1965, 11 (6), 1051–1057. (14) Box, G. E. P.; Hill, W. J. Discrimination among mechanistic models. Technometrics 1967, 9 (1), 57–71. (15) Hunter, W. G.; Mezaki, R. An experimental design strategy for distinguishing among rival mechanistic models. An application to the catalytic hydrogenation of propylene. Can. J. Chem. Eng. 1967, 45, 247– 249. (16) Froment, G. F.; Mezaki, R. Sequential Discrimination and Estimation Procedures for Rate Modeling in Heterogeneous Catalysis. Chem. Eng. Sci. 1970, 25, 293–301. (17) Van Welsenaere, R. J.; Froment, G. F. Parametric Sensitivity and Runaway in Fixed Bed Catalytic Reactors. Chem. Eng. Sci. 1970, 25, 1503– 1516. (18) Reilly, P. M. Statistical methods in model discrimination. Can. J. Chem. Eng. 1970, 48, 168–173. (19) Bard, Y. Nonlinear parameter estimation; Academic Press: New York, 1974. (20) Reilly, P. M.; Blau, G. E. The Use of Statistical Methods to Build Mathematical Models of Chemical Reaction Systems. Can. J. Chem. Eng. 1974, 52, 289–299. (21) Reilly, P. M.; Bajramovic, R.; Blau, G. E.; Branson, D. R.; Sauerhoff, M. W. Guidelines for the optimal desing of experiments to estimate parameters in first order kinetic models. Can. J. Chem. Eng. 1977, 55, 614. (22) Stewart, W. E.; Sorensen, J. P. Bayesian Estimation of Common Parameters From Multiresponse Data With Missing Observations. Technometrics 1981, 23 (2), 131–141. (23) Rabitz, H.; Kramer, M.; Dacol, D. Sensitivity Analysis in Chemical Kinetics. Annu. ReV. Phys. Chem. 1983, 34, 419–461. (24) Froment, G. F. The kinetics of complex catalytic reactions. Chem. Eng. Sci. 1987, 42 (5), 1073–1087. (25) Stewart, W. E.; Caracotsios, M.; Sorensen, J. P. Parameter estimation from multiresponse data. AIChE J. 1992, 38 (5), 641–650. (26) Stewart, W. E.; Shon, Y.; Box, G. E. P. Discrimination and goodness of fit of multiresponse mechanistic models. AIChE J. 1998, 44 (6), 1404–1412. (27) Asprey, S. P.; Naka, Y. Mathematical Problems in Fitting Kinetic ModelssSome New Persopectives. J. Chem. Eng. Jpn. 1999, 32 (3), 328– 337. (28) Stewart, W. E.; Henson, T. L.; Box, G. E. P. Model discrimination and criticism with single-response data. AIChE J. 1996, 42 (11), 3055– 3062. (29) Park, T.-Y.; Froment, G. F. A Hybrid Genetic Algorithm for the Estimation of Paramters in Detailed Kinetic Models. Comput. Chem. Eng. 1998, 22 (Suppl.), S103-S110. (30) Petzold, L.; Zhu, W. Model reduction for chemical kinetics: an optimization approach. AIChE J. 1999, 45 (4), 869–886. (31) Ross, J.; Vlad, M. O. Nonlinear Kinetics and New Approaches to Complex Reaction Mechanisms. Annu. ReV. Phys. Chem. 1999, 50, 51–78. (32) Atkinson, A. C. Non-constant variance and the design of experiments for chemical kinetic models. In Dynamic model deVelopmentsmethods, theory and applications; Asprey, S. P., Macchietto, S., Eds.; Elsevier: Amsterdam, 2000; pp 141-158. (33) Cortright, R. D.; Dumesic, J. A. Kinetics of Heterogeneous Catalytic Reactions: Analysis of Reaction Schemes. AdV. Catal. 2001, 46, 161–264. (34) Song, J.; Stephanopoulos, G.; Green, W. H. Valid Parameter Range Analyses for Chemical Reaction Kinetic Models. Chem. Eng. Sci. 2002, 57, 4475–4491. (35) Sirdeshpande, A. R.; Ierapetritou, M. G.; Androulakis, I. P. Design of Flexible Reduced Kinetic Mechanisms. AIChE J. 2001, 47 (11), 2461– 2473.

(36) Katare, S.; Bhan, A.; Caruthers, J. M.; Delgass, W. N.; Venkatasubramanian, V. A hybrid genetic algorithm for efficient parameter estimation of large kinetic models. Comput. Chem. Eng. 2004, 28 (12), 2569–2581. (37) Bhan, A.; Hsu, S.-H.; Blau, G.; Caruthers, J. M.; Venkatasubramanian, V.; Delgass, W. N. Microkinetic Modeling of Propane Aromatization over HZSM-5. J. Catal. 2005, 235 (1), 35–51. (38) Bogacha, B.; Wright, F. Non-linear design problem in a chemical kinetic model with non-constant error variance. J. Stat. Plan. Inference 2005, 128, 633–648. (39) Ucinski, D.; Bogacha, B. T-optimum design for discrimination between two multiresponse dynamic models. J. R. Stat. Soc. Ser. B: Stat. Method. 2005, 67, 3–18. (40) Englezos, P. J.; Kalogerakis, N. Applied Parameter Estimation for Chemical Engineers; Marcel-Decker, Inc.: New York, 2001. (41) Blau, G. E.; Lasinski, M.; Orcun, S.; Hsu, S.-H.; Caruthers, J. M.; Delgass, W. N.; Venkatasubramanian, V. High Fidelity Mathematical Model Building with Experimental Data: A Bayesian Approach. Comput. Chem. Eng. 2008, 32 (4-5), 971–989. (42) Draper, N. R.; Hunter, W. G. Design of experiments for parameter estimation in multiresponse situations. Biometrika 1966, 53 (3/4), 525– 533. (43) Draper, D. Bayesian Hierarchical Modeling; Springer-Verlag: New York, 2000. (44) Nocedal, J.; Wright, S. J. Numerical optimization; Springer: New York, 1999; p 636. (45) Froment, G. F. Model discrimination and parameter estimation in heterogeneous catalysis. AIChE J. 1975, 21 (6), 1041–1057. (46) Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; Teller, E. Equations of State Calculations by Fast Computing Machines. J. Chem. Phys. 1953, 21, 1087–1092. (47) Gilks, W. R.; Richardson, S.; Spiegelhalter, D. J. MarkoV Chain Monte Carlo in Practice; Chapman & Hall/CRC: New York, 1996. (48) Atkinson, A. C.; Cox, D. R. Planning Experiments for Discriminating between Models. J. R. Stat. Soc. Ser. B: Stat. Method. 1974, 36 (3), 321–348. (49) Atkinson, A. C.; Fedorov, V. V. Optimal design: Experiments for discriminating between several models. Biometrika 1975, 62 (2), 289–303. (50) Fedorov, V. V.; Pazman, A. Design of physical experiments. Fortschr. Phys. 1968, 16, 325–355. (51) Hsiang, T.; Reilly, P. M. A practical method for discriminating among mechanistic models. Can. J. Chem. Eng. 1971, 49, 865–871. (52) Montgomary, D. C.; Runger, G. C. Applied Statistics and Probability for Engineers, 3rd ed.; Wiley: Hoboken, NJ, 2002; p 720. (53) Prasad, V.; Vlachos, D. G. Multiscale Model and Informatics Based Optimal Design of Experiments: Application to the Catalytic Decomposition of Ammonia on Ruthenium. Ind. Eng. Chem. Res. 2008, 47, 6555–6567. (54) Hsu, S.-H. Bayesian Model Building Strategy and Chemistry Knowledge Compilation for Kinetic Behaviors of Catalytic Systems. Ph.D. Thesis, Purdue University, West Lafayette, IN, 2006. (55) Bates, D. M.; Watts, D. G. Nonlinear Regression Analysis and its Applications; Wiley and Sons: New York, 1988. (56) Fishman, G. S. A First Course in Monte Carlo; Thomson Brooks/ Cole: Belmont, CA, 2005. (57) Hastings, W. K. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 1970, 57, 97–109.

ReceiVed for reView October 30, 2008 ReVised manuscript receiVed February 24, 2009 Accepted March 13, 2009 IE801651Y