Parameter Estimation in Batch Bioreactor ... - ACS Publications

Aug 6, 2011 - ABSTRACT: In this study, we propose a parameter estimation method for genome-scale dynamic flux balance (DFBA) models. A...
6 downloads 0 Views 2MB Size
ARTICLE pubs.acs.org/IECR

Parameter Estimation in Batch Bioreactor Simulation Using Metabolic Models: Sequential Solution with Direct Sensitivities Juha T. Lepp€avuori VTT Technical Research Centre of Finland, P.O. Box 1000, FI-02044, Finland

Michael M. Domach and Lorenz T. Biegler* Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States ABSTRACT: In this study, we propose a parameter estimation method for genome-scale dynamic flux balance (DFBA) models. A bilevel optimization problem is reformulated as a differential-algebraic equation (DAE) optimization problem and solved sequentially, using gradient-based optimization with direct sensitivity equations. The resulting solution method is computationally efficient for today’s largest genome-scale metabolic models. The parameter estimation method combined with parameter selection algorithm is applied on simulated and experimental data. This paper presents the parameter estimation and selection method and numerical results of estimation of kinetic parameters of the DFBA model of anaerobic batch fermentation. The results show improved computational performance over previous approaches, thus making parameter estimation available for genome-scale DFBA models.

1. INTRODUCTION Bioreactors are key unit operations in food, pharmaceutical, and biofuels processes. In recent years, a great amount of research effort has been exerted in developing biological processes and models to support the production of ethanol and platform chemicals from sugars, especially from lignocellulosic materials. On the other hand, genome-scale models for micro-organisms are continuously being developed and used for the analysis of existing strains or for designing new strains. Combining these models to process engineering applications has been seen as a great opportunity for the design and optimization of fermentation processes. Genome-scale metabolic network reconstructions are being developed for various micro-organisms, such as Saccharomyces cerevisiae and Escherichia coli, which are the workhorses of many industrial bioprocesses today. One of the most extensive metabolic networks is the latest version of the yeast consensus network, Yeast 4.0, which contains 2183 individual reactions and 2657 chemical species (1494 metabolites and 1163 proteins or complexes).1 These networks, together with various mathematical tools, such as the MATLAB-based Cobra toolbox,2 have made a significant difference in the understanding of microbiological systems. The estimation of kinetic parameters in biological models is a key challenge in computational biology. The models usually contain a large number of unknown kinetic parameters that cannot be measured. The number of parameters that can be reliably estimated, based on available experimental data, is often much smaller than the total number of parameters. Several model identification approaches have been developed to face these challenges and obtain statistically consistent parameter estimates. Wilkinson et al. proposed a sequential linear-programming-based approach that simultaneously matched the model prediction with measurement data and minimized the distance between r 2011 American Chemical Society

the estimated and nominal parameter values.3,4 The objective of the method is to determine an ensemble of parameters that are not necessarily unique, but do provide the best overall model behavior with the parameter values closest to nominal values. This method has been successfully applied to parameter estimation in a yeast glycolysis signaling model, a model of thermal isomerization of R-pinene, and translation initiation in yeast.3,5 Gadkar et al. presented a three-step model identification method applied to cell signaling network models.6 First, the identifiable set of parameters and measurement datasets to be used were selected based on identifiability analysis performed with nominal parameter values. Second, unmeasured model states were estimated based on selected data with a special approach that utilized knowledge of the principles of cellular processes. Third, the model parameters were identified based on measured and estimated states through least-squares estimation. Lillacci and Khammash recently proposed a three-step parameter estimation method that intended to provide statistically valid parameter estimates.7 The parameters were first estimated through an extended Kalman filtering technique and estimates were then validated using χ2 tests. If the test was not passed, the estimates were further refined by moment minimization. This yielded statistically valid parameter estimates in the examples presented in Lillacci and Khammash.7 All the models used in the above examples were formulated as ordinary differential equations (ODEs). The number of ODEs in the models was in the range of 3125, and the number of estimated parameters varied from 2 to 27.

Received: May 11, 2011 Accepted: August 5, 2011 Revised: August 2, 2011 Published: August 06, 2011 12080

dx.doi.org/10.1021/ie201020g | Ind. Eng. Chem. Res. 2011, 50, 12080–12091

Industrial & Engineering Chemistry Research In this study, we use dynamic flux balance analysis (DFBA) models to simulate and identify the batch bioreactor. DFBA models are based on a stoichiometric model of cell metabolism with dynamic extensions. The number of parameters in DFBA models is typically in the same range as those in the above-mentioned kinetic models. The parameters can be highly dependent on each other, and parameter estimation can be computationally very expensive, especially with genome-scale DFBA models. Furthermore, quantification of the uncertainty of parameter estimates is the key issue for the models developed for process design and optimization purposes. Parameter estimation and optimization of DFBA models leads to a bilevel optimization model, where the flux balance model is a linear program (LP) embedded within the outer parameter estimation problem. This bilevel problem has been solved in two ways: (i) formulating the bilevel problem as a single-level mathematical program with complementarity constraints (MPCC), and (ii) sequentially solving the LP model nested inside the optimization. The MPCC formulation was developed by Raghunathan et al.8 They discretized the differential equations and replaced the inner LP by its stationary conditions for each time step. The discretized differential equations and stationary conditions formed a large set of nonlinear equations. The complementarities were included in the optimization, using a penalty approach. The problem was solved using a general-purpose nonlinear programming (NLP) solver IPOPT9 on the AMPL10 platform. This formulation has been used to estimate stoichiometric biomass coefficients based on experimental data,8 and also using the work of Hjersted and Henson11 to optimize a substrate feeding strategy in batch ethanol production. In both applications, a metabolic network containing > xethanol > > : 1 þ ethanol K

!"

xglucose K glucose 1 þ

xfructose K fructose

!

9 > > > > =

# > > > þ xglucose > ;

ð5Þ

AvðtÞ ¼ 0 vl e vðtÞ e vu vf ¼ bf ðxðtÞÞ vl ¼ bl ðxðtÞÞ vu ¼ bu ðxðtÞÞ

vfructose ¼ Vmax

where vf is the subset of fluxes (in this study, glucose and fructose) that are specified by a transport model, and the bounds vl and vu are determined by transporter and flux boundary models. The dynamic flux balance model and the compartments are illustrated in Figure 1. 2.1. Anaerobic Yeast Fermentation Model. The anaerobic yeast fermentation DFBA model is based on the model reported in refs 18 and 19. The extracellular model contains concentrations of biomass, glucose, fructose, glycerol, ethanol, and ammonium. For the intracellular model, we used two yeast metabolic networks with different levels of detail. Glucose, fructose, and ammonium uptake rates and minimum ATP consumption are specified based on ref 19. Intracellular Model. We applied two metabolic networks as an intracellular model. The first network is a metabolic network of anaerobic central carbon metabolism of yeast taken from Pizarro et al.19 The network has a single compartment intracellular model, which contains 38 intracellular metabolites and 39 reactions, and 7 transfer reactions between intracellular and extracellular compartments. Biomass formation is expressed as θPro Protein þ θCarb Carbohydrates þ θLip Lipids þ θDNA DNA þ θRNA RNA f 1 g of biomass

8 > > > >
> > > < > > xethanol > > : 1 þ ethanol K

!"

xfructose K fructose

xglucose 1 þ glucose K

!

# þ

xfructose

9 > > > > = > > > > ;

ð6Þ where Vmax is the maximum glucose and fructose uptake rate, Kglucose and Kfructose are glucose and fructose saturation constants, and Kethanol is the ethanol inhibition constant. Here, the saturation constants are also used as competitive inhibition constants for the other sugar. In Sainz et al., growth is limited by nitrogen concentration in the external media.18 Ammonium is assumed to be the only nitrogen source. The upper bound for ammonium uptake was 0.3 mol/gdw/h when the external ammonium concentration was >0.1 g/L; otherwise, it was 0 mol/gdw/h. Here, we assumed maximum ammonium uptake to be a continuous function of the external ammonium concentration. Maximum ammonium uptake is modeled in the MichaelisMenten form: ! xammonium u; ammonium ammonium ¼ Vmax ð7Þ v K ammonium þ xammonium is the maximum ammonium uptake rate and where Vammonium max Kammonium is the ammonium saturation constant. The lower bound of maintenance is a function of sugar uptake and ethanol concentration:    vl; maintenance ¼ a þ bxethanol vglucose þ vfructose

ð8Þ

The nominal parameter values are presented in Table 4 later in Section 4.2. Parameter values 14 (Vmax, Kglucose, Kfructose, and Kethanol) are the values for HTX3,18,19 which is the main hexose transporter and is expressed throughout the fermentation process until both sugars are depleted.19 Parameters 5 and 6 (a and b) are linear estimators in eq 8 for the data presented in ref 18. and Kammonium) in eq 7 were Parameters 7 and 8 (Vammonium max selected based on ref 18, in such a way that the bound was 12082

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091

Industrial & Engineering Chemistry Research

ARTICLE

active at low ammonium concentrations and inactive at higher concentrations.

3. SEQUENTIAL PARAMETER ESTIMATION STRATEGY WITH DIRECT SENSITIVITIES We now develop a sequential solution strategy for parameter estimation of DFBA models. We formulate the model 3 as a set of differential-algebraic equations in time t and develop sensitivity equations to yield gradients for the optimization. This way, we are able to use gradient-based optimization methods instead of derivative-free search methods. A general DFBA parameter estimation problem is formulated as  T   1 min ϕ ¼ ∑ ∑ xei  xem, i W 1 xei  xem, i ð9Þ 2 i e subject to x_ e ¼ ve xbio ðtÞ max f T vðtÞ subject to Aðx, pÞvðtÞ ¼ 0 vl e vðtÞ e vu vf ¼ bf ðx, pÞ vl ¼ bl ðx, pÞ vu ¼ bu ðx, pÞ

Figure 2. MATLAB implementation of parameter estimation.

where W is the covariance matrix of the measurements and xei and e are the predicted and measured concentrations, respectively, xm,i for external metabolites e at time step i. The estimable parameters can be parameters of the intracellular model (such as the P/O ratio or biomass composition) or parameters of transporter and flux boundary models (such as the MichaelisMenten constants). The model is solved by discretizing the differential equations in time. At each time step, flux boundary models are solved for flux boundaries vl and vu. The fluxes v are solved from the LP described by eq 1. Extracellular metabolite concentrations are integrated for the next time step, i + 1, using analytical solution for biomass and implicit trapezoidal method for other metabolites:  bio  bio xbio i þ 1 ¼ xi exp vi h ð10Þ hve ðxbio þ xbio i þ 1Þ xei þ 1 ¼ xei þ i i 2 where h = ti+1  ti is the length of the time step. The gradients for parameter estimation are obtained by developing sensitivity equations for the model described by eq 3. To enable the evaluation of the sensitivity equations, the model is reformulated as a DAE by substituting the stationary conditions f þ A T λ  μ l þ μu ¼ 0 Av ¼ 0 0 e ðv  vl Þ ^ μl g 0 0 e ðvu  vÞ ^ μu g 0

suppress the subscript i for the sake of brevity. The direct use of these linear stationary conditions requires expressing the complements operator. This can be done by introducing binary decision variables (e.g., those given in Burgard et al.22) or by replacing binary variables by additional constraints. Raghunathan et al. formulated the discretized equations described by eqs 10 and 11 as constraints for all time steps, and the parameter estimation problem in eq 9 was replaced by a large nonlinear program, where the discretized equations were solved simultaneously with the parameter estimation.8 However, this approach has difficulties in the presence of linearly dependent rows in stoichiometric matrix A in LP in eq 1. Also, in this formulation, the size of the problem grows large with the genome-scale-metabolic models, and would be outside of the capabilities of today’s laptop hardware. Because we focus here on the more-challenging genome-scale FBA systems, we consider instead a sequential solution method where FBA is solved at each time step, aided by a robust LP solver. To obtain necessary derivative information for the optimization, we differentiate the discretized model, with respect to the estimated parameters p, mentioned in Section 2.1. In order to differentiate the model, the complementary conditions in eq 11 were approximated by smooth differentiable functions similarly as Kaplan et al.,14 when the stationary conditions are written as f þ AT λ  μl þ μu ¼ 0 Av ¼ 0 rhffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i2ffi     l l l l vðjÞ  vðjÞ  μðjÞ þ ε ¼ 0 vðjÞ  vðjÞ þ μðjÞ  qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  u   ðvðjÞu  vðjÞ  μðjÞu þ ε2 ¼ 0 vðjÞ  vðjÞ þ μðjÞu 

ð11Þ

ð12Þ

for the LP described by eq 1 at each time step, where λ, μl, and μu are Lagrange multipliers, ^ is the complements operator, and we

where j ∈ {1, nf}, nf is the number of variables in LP in eq 1, and ε is a small positive constant. We note that by substituting eq 12 for LP in the model in eq 3, the model is a continuously differentiable DAE formulation that 12083

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091

Industrial & Engineering Chemistry Research

ARTICLE

can be solved using a state-of-the-art DAE solver. The sensitivity equations developed in the following would apply for this formulation as well, and, thus, standard DAE optimization methods with direct sensitivities could be applied. However, in order to speed up the computational performance of the parameter estimation, we used a robust LP solver to solve all the primal and dual variables (v, λ, μl, μu) of LP in eq 1, and formulation given in eq 12 was used to develop the sensitivity equations. The parameter ε in eq 12 was set to be vanishingly small (ε ≈ 0), so that the following sensitivity equations remain consistent with the LP solution. Next, we differentiate the model in eq 13 to obtain the derivatives of the differential state variables x with respect to parameter p. By defining the sensitivity matrix S = (dx/dp)T and algebraic variables y = [v, λ, μl, μu]T, we obtain the sensitivity equations: Sbio iþ1

¼

Sbio i

exp





vbio i h

þ

xbio i

exp





vbio i h

dvbio i h dp

!T

e hve ðSbio þ Sbio hðxbio þ xbio i þ 1Þ i i þ 1 Þ dvi þ ¼ þ i i dp 2 2 !T ∂gi T ∂gi T dyi ∂gi T Si þ þ ¼0 ∂x ∂y dp ∂p

Sei þ 1

!T

Sei

ð13Þ where 8 T l u > > f þ A λi  μi þ μi > > > > > Avi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi >  h  i2ffi > qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi >   u  > u u u 2 > > vðjÞi  vðjÞi  μðjÞi > > vðjÞi  vðjÞi þ μðjÞi  > > : vf ¼ bf ðx, pÞ, ¼ vl ¼ bl ðx, pÞ, vu ¼ bu ðx, pÞ

ð14Þ and y = [v, λ, μ , μ ] includes all of the primal and dual variables from the LP (1). The gradient dϕ/dp for the objective function in (9), with respect to parameter p, is l

dϕ ¼ dp

u T

∑i ðxei  xem, i ÞT W 1Si

ð15Þ

and the Hessian d2ϕ/dp2 is replaced by a GaussNewton approximation: d2 ϕ ≈H ¼ dp2

∑i STi W 1Si

ð16Þ

which assumes a good fit between the measurements and model predictions. We note that analytical solution for biomass in eq 10 allows for the use of the implicit trapezoidal method in integration of other metabolites. The implicit trapezoidal method is second order Astable, and because the sensitivity equations (eqs 13 and 14) take the same form, the same stability properties also apply to the sensitivity equations. The integration time step was chosen by trial and error in order to meet the accuracy requirements. 3.1. Computer implementation. Parameter estimation method was implemented on the MATLAB platform and the LPs were solved using the GNU Linear Programming Kit (http://www.gnu.org/software/glpk/) as the LP solver. The

sensitivity equations were solved at each time step in parallel with simulation. Minimization of the squared errors was done with the fmincon optimizer with the trust-region reflective method in MATLAB. The implementation is illustrated in Figure 2. Finally, we note that some care is needed in the treatment of LP solutions that have different active constraint sets from one time point to the next. These active set changes may occur between time steps and can lead to discontinuous right-hand sides of the differential equations in eq 3. Moreover, they lead to discontinuities in the sensitivities, with respect to p, and convergence difficulties in the parameter estimation algorithm. Avoiding this troublesome feature requires the active set change to be tracked carefully as a function of time, and the time point located exactly where the active set changes. While thorough treatment of this condition is beyond the scope of this study, we carefully monitored changes of active sets over time. If the optimization stopped before the first-order optimality was near zero, it indicated that the algorithm terminated at such a derivative discontinuity. We overcame this by restarting the solver with manual perturbations for the optimization variables in such cases. 3.2. Selection of Estimated Parameters. The goal of parameter estimation is to determine good estimators for parameters based on measured data. Usually, the goodness of fit in parameter estimation and small confidence regions of estimators are conflicting criteria, especially if the model is overparameterized. The parameters are often estimated from limited data, which further increases the risk of over-fitting. There are a number of statistical criteria that can be used for selecting a set of estimated parameters; these include an Akaike test, as well as several maximum likelihood-based tests. Our goal was to choose a minimal set of parameters that can be estimated with tight confidence intervals and still give a good fit of experimental data. We addressed this challenge with the following algorithm modified from ref 23, and combined it with posterior goodness-of-fit analyses,24 as shown below in eq 22: (1) Start with nominal parameter values and analyze the goodness of fit of the model. (2) Rank parameters based on their variance contribution at the current solution. (3) Select the parameter that has the smallest variance contribution among nonincluded parameters. (4) Estimate parameters using the optimization model described above. (5) Analyze the goodness of fit of the model. • If the fit is better than any of the existing ones, add a candidate parameter to the estimated parameters. Go to step 2. • If the fit is not better, set the value of the candidate parameter to its nominal value and choose the parameter that has the next-smallest variance contribution among nonincluded parameters as a candidate parameter. Go to step 4. (6) Analyze the confidence intervals of estimated parameters. If the confidence intervals are not within acceptable limits, go back to previous solutions until a solution with acceptable confidence intervals is found. This heuristic algorithm starts with one parameter and adds more parameters if the goodness of fit is improved over previous solution and the confidence intervals of all estimates are within 12084

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091

2.20 ( 0.37 2.06 ( 1.70 0.25 ( 0.02 28.45 ( 18.51 38.67 ( 38.27 18

2.60 ( 0.15 1, 5, 6, 8

3.80 ( 1.18

20.69 ( 20.95

20.04 ( 2.55 2.53 ( 0.10 1, 3, 5, 6

2.60 ( 0.15 2.46 ( 0.26

2.55 ( 0.13 1, 2, 5, 6

1, 5, 6, 7 1, 4, 5, 6

2.54 ( 0.09 1, 5, 6

1, 4, 6

2.07 ( 0.77

12.03 ( 1.56

12.27 ( 1.38 2.31 ( 0.04 1, 2, 6

The bold value in this column highlights the best goodness of fit. a

2.12 ( 0.74

1.74 ( 1.11

0.25 ( 0.01

0.26 ( 0.01

2.08 ( 1.02

1.74 ( 1.11 2.09 ( 1.19

0.25 ( 0.01

0.25 ( 0.01 0.25 ( 0.01

59.60 ( 8.02

47.16 ( 11.07

2.04 ( 0.66

4.55 ( 81.8

0.25 ( 0.01

4.50 ( 0.12

4.94 ( 0.08

4.50 ( 0.11 2.30 ( 0.04

6

1, 6

predefined limits. If the goodness of fit is not improved when adding a new parameter, then the parameter is set to its nominal value. We note that this algorithm does not necessarily find the set of parameters that give the best fit, and there might be several different sets of parameters that yield similar fits. However, we used this approach to test the parameter estimation algorithm and avoid overfitting of the model. The details of the individual steps are outlined in the following. 3.2.1. Variance Contributions of Individual Model Parameters. We used the method proposed in ref 25 and applied in ref 23 to estimate the variance contributions of individual parameters. The goal of this step is to select parameters that are most likely to be determined accurately, and to avoid estimating parameters that cannot be determined accurately. The sensitivity of the model outputs to the parameter vector is defined as combination of sensitivities at selected sample points: 2 3 Wp1=2 Sðp, t1 Þ 6 7 7 l SðpÞ ¼ 6 ð17Þ 4 5 1=2 Wp Sðp, tN Þ

3.40 36.52 0.06 ( 0.06

6.93

6.86

76.97

77.61 5.79 ( 1.22

6.96

6.86 7.15

76.71

77.58 75.15

ARTICLE

4.32 ( 8.89

7.79

9.32

90.71

79.99

8.22

6.32 103.30

0.039 449.00

108.30

goodness of fita [ 102] objective function [8]

Kammonium (g/L) (mol/gdw/h) b [ 104] (L/g) a Kethanol (g/L) Kfructose (g/L) Kglucose (g/L) Vmax(g/gdw/h) parameters

[7]

[ 104] Vammonium max [6] [5] [4] [3] [2] [1] estimated

Table 1. Estimated Parameter Values with Confidence Intervals, as Well as the Objective Function and Goodness-of-Fit Values, Using the Small-Scale Model with High-Variance Data

Industrial & Engineering Chemistry Research

where Wp is a diagonal weighting matrix of parameters and (t1, ..., tN) are selected sample points in time and S(p,t) is obtained from eq 14. The column vectors of S express the sensitivity of outputs to a particular parameter, and it can also be understood as the sensitivity direction of that parameter. The norm of the column is an indicator of the sensitivity, and the degree of the linear dependence between two columns expresses the similarity of the effects of parameters. The sensitivity ranking of parameters is done by QR decomposition with column permutation SE = QR for S, where E is the column permutation matrix. Sensitivity ranking is directly given by matrix E. By further redefining matrix R as R = DR, where D is a diagonal matrix, it can be used to analyze the variance contributions of individual parameters. The lower bound of the covariance of the estimators of deterministic parameters is given by the Cramer Rao inequality cov(p) g M1, where M is the Fisher information matrix. By assuming stochastic model error with Gaussian properties, the Fisher information matrix can be expressed as M = STS. The trace of the covariance matrix can be regarded as the variance of the parameter set: varðpÞ ¼ trðS̅ T S̅ Þ1

ð18Þ

The individual variance contributions of each parameter are given by varðpi Þ ¼

jjui jj2 di 2

ð19Þ

where ui is a column vector of R 1 and di is the corresponding diagonal element of D. More details can be found in refs 25 and 23. 3.2.2. Covariance and Confidence Regions of Estimators. Once parameters are selected and estimated, the variances of the estimated parameters at optimal solution must be obtained. The covariance of parameters is estimated according to ref 26 by assuming a linear model near the optimal solution: Vp ¼ H 1

ð20Þ

where H is the GaussNewton Hessian (eq 16). The variance of individual parameter values can be obtained from the diagonal of 12085

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091

Industrial & Engineering Chemistry Research

ARTICLE

Table 2. Parameter Values with Confidence Intervals Estimated from Simulated Data Small-Scale Model

Genome-Scale Model

estimated from parameter

true value

nominal value

high-variance data 2.54 ( 0.09

estimated from low-variance data

true value

nominal value

estimated from

estimated from

high-variance data

low-variance data

Vmax(g/gdw/h)

3

2.7

3.08 ( 0.09

3

2.7

Kglucose (g/L)

30

10.8

27.7 ( 2.81

20

10.8

2.45 ( 0.15

3.21 ( 0.08 20.30 ( 2.21

Kfructose (g/L)

60

22.5

55.5 ( 5.47

60

22.5

61.30 ( 6.58

Kethanol (g/L)

50

41

46.0 ( 3.48

50

41

45.82 ( 5.15

a

0.26

0.24

0.25 ( 0.01

0.26 ( 0.01

0.7

0.8

b [ 104] (L/g) [ 104] Vammonium max

1 3

5.83 50

2.04 ( 0.66

0.98 ( 0.19 3.20 ( 0.57

1 30

5.83 50

0.2

5

0.22 ( 0.04

0.2

5

29.65 ( 0.82

(mol/gdw/h) Kammonium (g/L)

2.45 ( 0.15

1.01 ( 0.37

Figure 3. Predicted concentrations versus high-variance simulated data with the genome-scale model.

the covariance matrix. Using this covariance matrix, the confidence regions for the parameter estimates are given by ðptrue  pÞT Vp 1 ðptrue  pÞ e cðχÞ

ð21Þ

where c(χ) is the χ value of confidence with the number of parameters and degrees of freedom, and p* is the estimated parameter value. 3.2.3. Goodness of Fit. The goodness of the fit with different sets of estimated parameters was analyzed to discriminate between 2

parameter sets. We applied the method proposed by Stewart et al.,24 which uses reverse Bayesian statistics to measure the posterior goodness of fit. When the covariance matrix W of the measurements is known, with the assumption of normally distributed errors, the posterior probability of a candidate model Mk can be expressed: PðM k jY , ΣÞ  PðM k Þ2np, k =2 expð  ϕk Þ

ð22Þ

where Y represents the observed measurements, PðM Þk is the prior probability of the model k, np,k is the number of estimable parameters 12086

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091

Industrial & Engineering Chemistry Research

ARTICLE

Figure 4. Predicted concentrations versus low-variance simulated data with the genome-scale model.

Table 3. Computational Performance on Simulated Dataa Small-Scale Model High-Variance Data

Genome-Scale Model Low-Variance Data

High-Variance Data

2 parameters 4 parameters 8 parameters 2 parameters 4 parameters 8 parameters number of NLP iterations 5 CPU time [s] 8 a

29 43

201* 356

5 8

135 174

334 677

2 parameters 11 199

Low Variance Data 2 parameters 4 parameters 6 132

150 2000

Numbers marked with an asterisk (*) indicate situations where the algorithm terminated because of discontinuous right-hand sides of the models.

in model k, and 2ϕk is the residual sum of squares in parameter estimation. This method gives penalty of 21/2 for the exponential term when adding a new parameter to the estimation. The nominal parameter values are used for the parameters that are not estimated. More details of the method can be found in ref 24. In the analysis, we set all prior probabilities to 1, and the residual sum of squares were normalized with the number of sample measurements.

4. ESTIMATION OF KINETIC PARAMETERS Two metabolic models were used: small-scale and genomescale. The small-scale model is taken from ref 19 as is. On the other hand, the genome-scale model (based on the iND750 metabolic network) contains many linearly dependent metabolites. These occur as linearly dependent constraints in the LP, which can be handled by most of today’s LP solvers. However, linear dependencies in matrix A cause problems in the sensitivity

calculation, e.g., when solving for R from eq 14. Therefore, we first removed the linear dependencies from A by doing a QR factorization, A = QR, and removing rows with zero values on the diagonal of R. Another feature with the iND750 network is that it leads to degenerate LP solutions. This causes problems in parameter estimation if some of the transfer fluxes between the intracellular and extracellular model have nonunique solutions. Hjersted et al. addressed similar issues when using the iND750 network in developing fed-batch operation strategies for ethanol production.12 To deal with this, we modified the FBA objective function by adding small coefficients to maximize ethanol and glycerol production, together with biomass. The coefficient in f was 1 for biomass, 103 for ethanol, and 106 for glycerol. This has very little effect on growth rate, and, in the case of nonunique solutions for ethanol and glycerol outlet fluxes, it fixes these fluxes to the largest possible values. 12087

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091

Industrial & Engineering Chemistry Research Table 4. Nominal and Estimated Parameter Values with Confidence Intervals Estimated from Experimental Data parameter

nominal value

estimated value

Vmax (g/gdw/h)

2.7

8.41 ( 4.65

Kglucose (g/L) Kfructose (g/L)

10.8 22.5

Kethanol (g/L)

41

a

0.24

b [ 104] (L/g)

5.83

[ 104] (mol/gdw/h) Vammonium max

50

Kammonium (g/L)

5

ARTICLE

Table 5. Computational Performance on Experimental Winemaking Dataa Value property

1 parameter 2 parameters 3 parameters 4 parameters

number of NLP 3 CPU time [s] 6.30 ( 0.39

For these models, the usability and computational performance of the parameter estimation method was analyzed by estimating the kinetic parameters in the model described in Section 2.1, based on simulated data and experimental winemaking data from Pizarro et al.19 We were particularly interested in how many parameters can be estimated and which model best fits the data. Two intracellular models were considered for parameter estimation, using the optimization model from Section 2: a small-scale model and a genome-scale model. Moreover, the estimability of individual parameters was addressed, and the confidence intervals for the parameter estimates as well as the goodness of the fit of the model were analyzed. The results are presented, and differences between both models are compared. Finally, the computational performance of the parameter estimation method is analyzed. 4.1. Simulated Data. The parameter estimation method and the parameter selection algorithm were tested on simulated data with random noise. Four sets of test data were produced by simulating the model described by eqs 38 with parameters values shown in the “true value” column in Table 2 (shown later in this work), and by adding Gaussian noise with a covariance of Whigh= diag(0.04, 4, 4, 0.4, 4, 0.0004) and Wlow = 0.01 Whigh. This was done with both small-scale and genome-scale models. Since the simulated data are different for the models, direct comparisons between the models cannot be made. The parameter estimation and selection was done using the algorithm in Section 3.2. The performance of the algorithm is illustrated for the small-scale model with high variance data in Table 1. Each row in the table presents one parameter estimation step with a subset of parameters. The numbers in the first column indicate which parameters are estimated. Starting with the nominal values, the smallest covariance was for parameter b. The estimated value with the 95% confidence interval, objective function, and goodness of the fit are shown in the first data column of Table 1. Now the parameter variance contributions were analyzed for this new solution. The smallest variance contribution among the rest of the parameters was observed for parameter Vmax. Estimating parameters Vmax and b yields improvement in both the objective function and the goodness of fit, as shown in the last two columns. Adding a third parameter, Kglucose, improves the objective function but the goodness of fit does not improve. The same happens with the Kethanol parameter. Adding parameter a (in the fifth data column) improves the goodness of fit, and this is shown to be the best fit among all parameter subsets analyzed. Adding a fourth parameter improves the objective function, but the goodness of fit is not improved. Thus, the final selection of estimable parameters are Vmax, a, and b. The results also show that the confidence intervals increase

9

8285*

14161*

15

13480

20300

iterations

4.02 ( 2.48

7

a

Numbers marked with an asterisk (*) indicate the situations when the algorithm terminated on the objective function discontinuity.

when more parameters are added. The last row in Table 1 shows the estimated parameter values, the objective function, and the goodness of fit when all of the parameters are estimated. The results show that, by adding more parameters, the residual sum of squares gets smaller, but the confidence intervals of estimates, as well as the posterior properties, get worse. The same parameter estimation and selection procedure was also applied for the low-variance dataset. Now, the covariance of the data was significantly smaller and the best goodness of fit was achieved by estimating all eight parameters. The estimated values correspond well with the true parameter values, although some of the confidence intervals seem to be slightly too optimistic. The results are shown in Table 2. The predicted concentrations match well with the simulated data. A similar study was conducted with the genome-scale model. The best goodness of fit was achieved when estimating two parameters from high-variance data and six parameters from lowvariance data. Figures 3 and 4 compare the predicted concentrations to the low- and high-variance simulated measurements. Visually, the predicted concentrations match well in both cases, even though, in the high-variance case, only two parameters were estimated, and the parameter values are not close to their true values. Parameters a and b had very little effect on the model output at the current operating point, and thus estimating them in the low-variance case did not improve the goodness of fit. The final estimates with confidence intervals are shown in Table 2. Table 3 shows the number of iterations and CPU time for parameter estimation with both models and datasets. With largely independent parameters, the algorithm converges within only a few iterations. As the parameters become more dependent on each other, the number of iterations increased. For low-variance data, for the first three parameters, the algorithm converged with 512 iterations, four parameters required 29140 iterations, and for more than five parameters, the iterations varied from 20 to 500 iterations, depending on the starting point and parameters on both small-scale and genome-scale models. The number of iterations often correlates with large confidence intervals of the estimates. The numbers marked with (*) indicate situations where the algorithm terminated because of discontinuous right-hand sides of the models. In these cases, variables were perturbed by trial and error to help the algorithm jump over the discontinuity and converge to first-order optimality conditions. The iterations and CPU times in these cases are the sums of all runs. Computational performance is further discussed in Section 4.3. The CPU time difference between small-scale and genomescale models comes mainly from solving the LP at each time step, which is the most costly operation in both models. The sensitivity calculation is the second-most-expensive operation with the genome-scale model and is negligible with the small-scale model. 12088

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091

Industrial & Engineering Chemistry Research

ARTICLE

Figure 5. Predicted concentrations of biomass, glucose, fructose, glycerol, and ethanol, compared with experimental data on the small-scale model.

The trapezoidal integration of ODEs is computationally fast and is the same for both models. The CPU times for the small-scale model are of the same order of magnitude as the results reported in ref 8. On the other hand, the simultaneous formulation of the parameter estimation problem with the genome-scale model would lead to an optimization problem with approximately a million variables and equations and would be beyond the capabilities of today’s laptop computer hardware. 4.2. Winemaking Data. The experimental winemaking data was obtained from Pizarro et al. and was generated from a 120-h anaerobic batch fermentation with measurements of the glucose, fructose, glycerol, ethanol, and biomass concentrations. The initial assimilable nitrogen concentration in the beginning of the batch is 0.3 g/L. In the small-scale model, ammonium was considered to be the only nitrogen source. More details on the data and experimental settings can be found in ref 19. The Covariance of Measurements. The covariance of the measurements was not known. We assumed that the different concentration measurements were independent, i.e. the nondiagonal values in the covariance matrix are zeros. The diagonal elements, which are the variances of individual measurements, were approximated from the difference between the best-fitting model and data. The parameter estimation method was used to fit one concentration at a time by varying all parameters. The measurement variance for measurement of e, σe2, can be expressed as σe 2 ¼

1 nt  np

∑i ðx ei  xei, m Þ2

ð23Þ

where xe is the concentration given by the best-fitting model, nt the number of sample points. and np the number of estimatable parameters. This procedure can yield a highly overfitted model with extremely unreliable parameter estimates; hence, it was only used to estimate the measurement variances. The covariance estimation was done with the small-scale model, yielding the following variance estimate of [σbiomass2, σglucose2, σfructose2, σglycerol2, σethanol2] = [0.180, 10.6, 5.60, 0.300, 15.9]. Parameter Estimation. Parameter selection and estimation was done with the small-scale model, in which ammonium is the only nitrogen source. The genome-scale model contains many nitrogen compounds, and, for accurate simulation, the most important nitrogen sources would have needed to be included. Because modifying the nitrogen uptake of the genome-scale model is beyond the scope of this paper, it has been omitted here. Parameter selection and estimation was done similarly to that described in Section 4.1, and the results are presented in Tables 4 and 5 and Figure 5. The best goodness of fit was achieved by estimating the parameters Vmax, Kethanol, and b. The residual sum of squares was improved when more parameters were added, but neither the goodness of fit nor confidence intervals of the estimates decreased. The confidence intervals for the estimates, especially for the Vmax and Kethanol parameters, are already large, and by adding more parameters, they became even larger. These results highlight the risk of overparameterizing the model. The computational performance of the algorithm was similar to that shown in Section 4.1. The results in Table 5 show that, with few parameters, the algorithm 12089

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091

Industrial & Engineering Chemistry Research Table 6. Comparison of the Computational Performance between Sequential Gradient-Based, Derivative-Free, and Simultaneous Solutionsa

number of

sequential

sequential

gradient-based

derivative-free

simultaneous

solution

solution

solutionb

8

8

6

201*334

18643234

325752

356677

600910955

133463

parameters number of iterations CPU time [s] a

Number marked with an asterisk (*) indicates a situation where the algorithm terminated because of the discontinuous right-hand side of the model. b Data taken from Raghunathan et al.8

converged with few iterations. However, with more-dependent parameters, more iterations were needed. The numbers marked with an asterisk (*) indicate the situations when the algorithm terminated on the objective function discontinuity, and manual perturbation for the parameters was done to help the algorithm converge properly. The large number of iterations in these cases is also partly due to this discontinuous objective function. 4.3. Computational Performance. Computational efficiency was the key motivation for the development of the proposed DFBA parameter estimation strategy. As seen in sections 4.1 and 4.2, the CPU time requirement is proportional to the size of the metabolic network and NLP iterations. The difference in CPU times between small and genome-scale models comes mainly from solving the LP at each time step, which is the most costly operation with both models. The sensitivity calculation is the second most expensive operation with the genome-scale model and is negligible with the small model. The trapezoidal integration of ODEs is computationally fast, and the cost is the same for both models, and can be run on today’s regular laptop hardware. We compared the performance of our parameter estimation method with the simultaneous method and sequential gradientfree methods.8,1214 Because the CPU time for sequential approach with derivative-free optimization is not reported in these papers, we estimated all eight parameters of the small-scale model from low-and high-variance simulated data, using the derivative-free fminsearch MATLAB function. Table 6 shows a comparison between the three methods. The results show an order-of-magnitude decrease in iterations and CPU time for the gradient-based method, in comparison to the derivative-free method. Compared with the simultaneous approach, the CPU values are close to those reported in Raghunathan et al.8 However, the simultaneous formulation with the genome-scale model would lead to an optimization problem with approximately one million variables and equations and would be beyond of the capabilities of today's laptop hardware. We note that the model, parameters, and data are different in Raghunathan et al.,8 but it is assumed that the computational cost is dominated by the problem side rather than the choice of individual parameters.

5. CONCLUSIONS A parameter estimation method for dynamic flux balance models (DFBA) models was developed, based on sequential solution with direct sensitivity equations. The sensitivity

ARTICLE

equations were developed by formulating the model as a differential and algebraic equation (DAE) system by substituting stationary conditions of the linear program (LP) to the model. The sensitivity equations allow the use of gradient-based optimization methods that usually require significantly fewer iterations than derivative-free optimization methods. The extra computational cost coming from the sensitivity calculation is low, because of trapezoidal integration of ODEs. The main computational cost arises from solving the LP described by eq 1 at each time step. The algorithm proved to be computationally efficient with both small-scale and genome-scale DFBA models, and can be run on today’s regular laptop hardware. The proposed parameter estimation strategy for DFBA models was complemented with parameter estimability analysis and parameter selection strategy. Kinetic parameters of a DFBA model can be highly dependent, and all parameters cannot be reliably identified. In these cases, one would need to include more experimental data in the parameter estimation or estimate only a subset of parameters. In this study, the preselection of estimated parameters was based on their individual variance contributions before parameter estimation. The final selection was done after the estimation by selecting parameter sets that yield the best goodness of fit, with the confidence regions of estimates within specified bounds. The goodness-of-fit criteria includes a penalty term for the number of parameters and thus the residual sum of squares has to decrease enough in order to justify the inclusion of a new parameter. Additionally, the confidence regions of all the estimates have to lie within the prespecified bounds in order to be accepted as the final set of estimates. This procedure yielded a minimal set of estimated parameters that produce the best goodness of fit with acceptable confidence intervals for estimates. We conclude that proposed parameter estimation strategy makes efficient DFBA parameter estimation available also for genome-scale models. The parameter selection method in turn improves the reliability of the estimates by reducing the risk of overfitting the data. Future research is still needed in utilizing DFBA models in process optimization using the proposed solution strategy. First, better treatment of active set changes in the LP is needed so that continuous objective functions and derivatives can be obtained. Second, substituting second-order sensitivity equations instead of the Gauss-Newton Hessian would lead to better convergence properties, but at the expense of extra work. Finally, further work is needed on model validation and discrimination between rival DFBA models (e.g., small and genome-scale model, and with different subsets of estimated parameters) in order to find the model that is most likely to yield the best prediction. While the parameter selection method presented in this paper is statistically sound, it requires the user to specify the acceptable limits for the confidence intervals of the estimates.

’ AUTHOR INFORMATION Corresponding Author

*E-mail: [email protected].

’ REFERENCES (1) Dobson, P. D.; Smallbone, K.; Jameson, D.; Simeonidis, E.; Lanthaler, K.; Pir, P.; Lu, C.; Swainston, N.; Dunn, W. B.; Fisher, P.; Hull, D.; Brown, M.; Oshota, O.; Stanford, N. J.; Kell, D. B.; King, R. D.; Oliver, S. G.; Stevens, R. D.; Mendes, P. Further developments 12090

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091

Industrial & Engineering Chemistry Research towards a genome-scale metabolic model of yeast. BMC Syst. Biol. 2010, 4, 145. (2) Becker, S. A.; Feist, A. M.; Mo, M. L.; Hannum, G.; Palsson,  B. O.; Herrgard, M. J. Quantitative prediction of cellular metabolism with constraint-based models: The COBRA Toolbox. Nat. Protocols 2007, 2, 727–738. (3) Wilkinson, S. J.; Benson, N; Kell, D. B. Proximate parameter tuning for biochemical networks with uncertain kinetic parameters. Mol. Biosyst. 2008, 4, 74–97. (4) Brown, M.; He, F.; Wilkinson, S. J. Properties of the proximate parameter tuning regularization algorithm. Bull. Math. Biol. 2010, 72 (3), 697–718. (5) Dimelow, R. J.; Wilkinson, S. J. Control of translation initiation: a model-based analysis from limited experimental data. J. R. Soc. Interface 2009, 6, 51–61. (6) Gadkar, K. G.; Varner, J.; Doyle, F. J. Model identification of signal transduction networks from data using a state regulator problem. IEE Proc. Syst. Biol. 2005, 2, 17–30. (7) Lillacci, G.; Khammash, M. Parameter estimation and model selection in computational biology. PLoS Comput. Biol. 2010, 6 (3), No. e1000696 (DOI: 10.1371/journal.pcbi.1000696). (8) Raghunathan, A. U.; Perez-Correa, J. R.; Agosin, E.; Biegler, L. T. Parameter estimation in metabolic flux balance models for batch fermentation-formulation and solution using differential variational inequalities. Ann. Oper. Res. 2006, 148, 251–270. (9) W€achter, A.; Biegler, L. On the implementation of an interiorpoint filter linesearch algorithm for large-scale nonlinear programming. Math. Program. 2004, 106, 25–57. (10) Fourer, R.; Gay, D. M.; Kernighan, B. W. AMPL: A Modeling Language for Mathematical Programming ; Scientific Press Series; Boyd & Fraser Publishing Co.: Danvers, MA, 1993. (11) Hjersted, J. L.; Henson, M. A. Optimization of fed-batch Saccharomyces cerevisiae fermentation using dynamic flux balance models. Biotechnol. Prog. 2006, 22, 1239–1248. (12) Hjersted, J. L.; Henson, M. A.; Mahadevan, R. Genome-Scale Analysis of Saccharomyces cerevisiae Metabolism and Ethanol Productionin Fed-Batch Culture. Biotechnol. Bioeng. 2007, 97 (5), 11901204. (13) Hjersted, J. L.; Henson, M. A. Steady-state and dynamic flux balance analysis of ethanol production by Saccharomyces cerevisiae. IET Syst. Biol. 2009, 3 (3), 167–179. (14) Kaplan, U.; Turkay, M.; Biegler, L.; Karas€ozen, B. Modelling and simulation of metabolic networks for estimation of biomass accumulation parameters. Discrete Appl. Math. 2009, 1757, 2483– 2493. (15) Stewart, D.; Anitescu, M. Optimal control of systems with discontinuous differential equations. Numer. Math. 2010, 114, 653–695. (16) Varma, A.; Palsson, B. O. Stoichiometric flux balance models quantitatively predict growth and metabolic byproduct secretion in wildtype Escherichia coli. Appl. Environ. Microbiol. 1994, 60, 3724–3731. (17) Mahadevan, R; Edwards, J. S.; Doyle, F. J. Dynamic flux balance analysis of diauxic growth in Escherichia coli. Biophys. J. 2002, 83 (3), 1331–1340. (18) Sainz, J.; Pizarro, F.; Perez-Correa, J.; Agosin, E. Modeling of yeast metabolism and process dynamics in batch fermentation. Biotechnol. Bioeng. 2003, 81, 818–828. (19) Pizarro, F.; Varela, C.; Martabit, C.; Bruno, C.; Perez-Correa, J. R.; Agosin, E. Coupling kinetic expressions and metabolic networks for predicting wine fermentations. Biotechnol. Bioeng. 2007, 98 (5), 986998.  (20) Duarte, N. C.; Herrgard, M. J.; Palsson, B. O. Reconstruction and Validation of Saccharomyces cerevisiae iND750, a Fully Compartmentalized Genome-Scale Metabolic Model. Genome Res. 2004, 14, 1298–1309. (21) Mo, M. L.; Palsson, B. O.; Herrgard, M. J. Connecting extracellular metabolomic measurements to intracellular flux states in yeast. BMC Syst. Biol. 2009, 3, 37. (22) Burgard, A. P.; Pharkya, P.; Maranas, C. D. Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization. Biotechnol. Bioeng. 2003, 84, 647–657.

ARTICLE

(23) Lin, W.; Biegler, L. T.; Jacobson, A. M. Modeling and Optimization of Seed Polymerization Process. Chem. Eng. Sci. 2010, 65, 4350–4362. (24) Stewart, W. E.; Shon, Y.; Box, G. E. P. Discrimination and goodness of fit for multiresponse mechanistic models. AIChE J. 1998, 44 (6), 1404–1412. (25) Lund, B.; Foss, B. Parameter ranking by orthogonalizationapplied to nonlinear mechanistic models. Automatica 2008, 44, 278– 281. (26) Bard, Y. Nonlinear Parameter Estimation; Academic Press: New York and London, 1974.

’ NOTE ADDED AFTER ASAP PUBLICATION After this paper was published online August 30, 2011, several corrections were made to the text. The revised version was published October 4, 2011.

12091

dx.doi.org/10.1021/ie201020g |Ind. Eng. Chem. Res. 2011, 50, 12080–12091