868
Ind. Eng. Chem. Res. 2005, 44, 868-878
Toward Global Parametric Estimability of a Large-Scale Kinetic Single-Cell Model for Mammalian Cell Cultures Fabio R. Sidoli,†,‡ Athanasios Mantalaris,*,†,‡ and Steven P. Asprey§ Centre for Process Systems Engineering and Department of Chemical Engineering and Chemical Technology, Imperial College London, South Kensington campus, London SW7 2AZ, U.K., and Orbis Investment Advisory Ltd., Orbis House, 5 Mansfield Street, London W1G 9NG, U.K.
A practical strategy is presented addressing the related issues of model parameter identifiability and estimability that was applied to a large-scale, dynamic, and highly nonlinear biological process model describing the metabolic behavior of mammalian cell cultures in a continuous bioreactor. The model used consists of 27 inputs, 32 outputs, and more than 350 parameters; is compartmental in nature; and represents the state of the art in terms of model complexity and fidelity. The strategy adopted falls under the scope of estimability and comprises of two parts: (a) a parameter perturbation study that singly perturbs parameters under a number of deterministically sampled model input vectors and consequently partitions them into those that yield significant changes in the outputs (the estimable parameter set) and those that do not and (b) subsequent evaluation of Monte Carlo estimates of global sensitivity indices of these two sets, which quantitatively assess the amount of parameter sensitivity contained both within and between the sets. Of the 357 parameters, 37 were found to be estimable to within at least (25% of their nominal parameter value and, under nominal experiment conditions, accounted for 48% of the model’s sensitivity. The remaining 320 parameters accounted for just 4% of the model’s sensitivity. As expected, significant interactions were found to exist between these two sets. Interactions of the estimable parameter set with the inestimable set accounted for 48% of the model’s sensitivity. Introduction Mammalian cell cultures are a major source of a number of very high value pharmaceutical products that include monoclonal antibodies, viral vaccines, and hormones. The productivity of these cultures is generally low because the culture conditions are highly specialized and the cells are especially susceptible to either reduced productivity or death as a result of minor deviations in the culture conditions. The complexity of the underlying biology that governs intracellular processes and physiological responses to the external environment is immense, making accurate prediction of culture behavior a formidable challenge. Hence, any robust predictive mathematical model needs to capture a significant degree of internal structure before it can accurately predict culture behavior. Kinetic models, being dynamic models, are the only class of models that can incorporate the time-evolving metabolic regulation that is characteristic of cellular processes. So-called “structured” kinetic models have been developed to simulate whole single-cell metabolism,1,2 the most recent of which was proposed by Sanderson.3 The model by Sanderson3 supersedes previous models in terms of cellular structure, offering increasing fidelity to the real system, which, in turn, requires additional model state variables, equations, and parameters. * To whom correspondence should be addressed. E-mail:
[email protected]. † Centre for Process Systems Engineering, Imperial College London. ‡ Department of Chemical Engineering and Chemical Technology, Imperial College London. § Orbis Investment Advisory Ltd.
An inherent problem with the kinetic approach highlighted by this model is the accurate determination of the model parameters, in this case numbering over 350. It is the numerical values of these parameters that govern model precision once model structural form is established. These model parameters need to be estimated from subsequent model validation experiments. However, before any such experiments are performed, it is worth ascertaining whether the model can be validated, in theory, from data in the noise-free case and, in practice, under some assumptions about noise corruption. The answers to these two questions can influence the decision as to whether to carry out experiments and can provide valuable information relating to the validity of point estimates computed from parameter estimation procedures. Omitting this a priori model development step and proceeding immediately to experimental validation, implicitly assumes that the model can be validated and therefore also assumes certain model properties to be trueswhich might not, in fact, be the case. This step and its properties constitute model parameter identifiability and estimability. Parameter identifiability is concerned with establishing whether the values of measured model variables (outputs) correspond to unique parameter values (see Walter4). That is, can unique parameter estimates be recovered, in theory, from noise-free output data? If a model is unidentifiable, then this means that several parameter vectors exist that correspond to exactly the same input-output behavior of the model and elimination of any of them is impossible from (even noise-free) data alone. Parameter estimability has a broader defini-
10.1021/ie0401556 CCC: $30.25 © 2005 American Chemical Society Published on Web 01/21/2005
Ind. Eng. Chem. Res., Vol. 44, No. 4, 2005 869
tion and involves analyses concerned with actual samples or sampling error and includes, for example, estimation of variances and the impact of cross-correlation of parameters.5 For relatively simple models, identfiability theory is well-developed, and rigorous structural identifiability can be applied (see Walter4). For nonlinear (inputs or parameters) models with few parameters and equations, techniques exist that make them amenable to analysis.6 These techniques are based on series expansions (Taylor series expansion of outputs, generating series expansions of inputs) and differential algebra (transformation of a nonlinear model to a linear model, similarity transformations). In contrast, very few techniques for arbitrarily large nonlinear models consisting of sets of differential-algebraic equations (DAEs) with many inputs, outputs, and parameters exist.7 One method for determining local identifiability numerically was presented by Jacquez and Perry;8 more recently, Asprey and Macchietto9 developed an optimizationbased method for global identifiability, but there are no reports on the application of these techniques to very large models. In this paper, we consider parametric estimability of a large dynamic model for mammalian cell cultures. The objective of the study is to assess global properties of the model with respect to the inputs and parameters rather than the local values, which are of limited practical use. Initially, the global identifiability of the model was examined; however, the computationally intensive nature of the optimization-based approach together with the practical problems associated with successfully performing large-scale optimizations led us to adopt the method presented in this paper. As aspects of this new method originate from the original identifiability problem, the latter is briefly outlined so as to make clear how the estimability method was conceived. An outline of the methodology used for the solution of the global identifiability problem and the subsequent global estimability problem incorporating parameter perturbation and global sensitivity studies is presented. Details of the mathematical model considered are provided, and the results from the estimability analysis are presented and discussed. Assessing the Global Estimability of Parameters Individual Parameter Perturbations. Theoretical model identifiability seeks to establish whether parameter estimates can be uniquely identified assuming noise-free output data, where “noise” refers to nonsystematic measurement errors.10 The qualifier “theoretical” emphasizes the fact that this is an analysis to be performed a priori of any experimentation and actual data gathering. In the literature, model identifiability applicable to large nonlinear dynamic models with many parameters, inputs, and outputs is scarce.8,9 Furthermore, to the authors’ knowledge, there are no reports of the application of such methods to large (mammalian cell culture) models such as that attempted in this paper. Initially, the optimization-based method proposed by Asprey and Macchietto9 was adopted, and the mathematical problem statement is as follows: A model giving output trajectory y(θ,u(t),ω,t) is globally identifiable if, for any two parameter sets θ ∈ Θ and θ* ∈ Θ*, a time horizon of interest t ∈ [0, tf], all system inputs u(t) ∈ U and ω ∈ Ω, and the same initial conditions y(t ) 0), the global maximum
ΦI )
max (θ - θ*)TWθ(θ - θ*)
θ∈Θ,θ*∈Θ*
(1)
subject to
∫0t [y(u(t),ω,θ,t) - y(u(t),ω,θ*,t)]TWy[y(u(t),ω,θ,t) f
y(u(t),ω,θ*,t)] dt < Ey
∀ u(t) ∈ U, ω ∈ Ω (2)
f(y(t),x3 (t),x(t),u(t),ω,θ,t) ) 0
(3)
θLi e θi e θU i ,
(4)
/ /U θ/L i e θi e θi ,
i ) 1, ..., P i ) 1, ..., P
(5)
gives ΦI e ΦI, where y is the M-dimensional vector of model outputs; θ is the P-dimensional vector of model parameters; u(t) is an nu-dimensional vector of timevarying model inputs; ω is an nω-dimensional vector of time-invariant model inputs; x(t) is an nx-dimensional vector of model state variables; f is a vector of model equations, in this case a DAE system; and t is time. This is a dynamic optimization problem that must be solved globally. However, for the particular model considered in this paper, such an approach is very challenging for modern optimization techniques. To begin, the model size is very large, requiring the solution of a DAE system of nearly 1000 equations. Second, the optimization problem is very large and is to be performed in the space of more than 700 optimization variables. To add further to the difficulty, the issue of scaling also brings problems, and reformulation is a prohibitively lengthy and cumbersome task. In addition to these numerical considerations, eq 2 has a major limitation. The purpose of this constraint is critical and is to ensure that the solution does not include parameter space that yields model outputs that are, in fact, measurably different, i.e., identifiable. The major limitation of this constraint is that it might be dominated by large terms, such that any single output might not be detectable but the summation across all outputs might exceed the tolerance. It would be more accurate for a similar constraint to hold for each output; however, this would lead to an even more difficult optimization problem. These principal reasons necessitated a different approach to the problem that retained consideration of all inputs, parameters, and outputs but still rendered the analysis workable. Instead, the problem was formulated so that a finite, a priori known number of simulations was required. The new problem is strictly one of determining parameter estimability but makes use of mathematical objects presented for the global identifiability problem 1-5. In any alternative analysis, it is important to maintain focus on assessing the effect of interval changes in the parameter space, as this is directly relevant to the unique identification of the parameter estimate. Clearly, computation of the effects of changing inputs and parameters is of central importance, and so use is still made of constraint eq 2. However, the question now arises as to how best search the input space for control policies leading to constraint violation: the exceeding of the assumed experiment noise in the output. This is done by testing a representative sample of the admissible input space necessitating the a priori selection of a set of control vectors uj(t) and ωj covering the combined feasible space of inputs given by U ∪ Ω. A deterministic sampling strategy was used that selects
870
Ind. Eng. Chem. Res., Vol. 44, No. 4, 2005
Nsp samples in an (nu + nω)-dimensional sample space making use of a low-discrepancy number sequence known as the Sobol’ sequence.11 A low-discrepancy number sequence tends to sample space more uniformly than random numbers and avoids clustering. This means that coverage of the space is generally more effective. In addition, because a Sobol’ sequence is deterministic, a progressively refined coverage can be obtained without having to reinitialize sampling of the space (i.e., a priori selection of sample size is not required). It has the foremost advantage that projection of the Nsp sample points on any lower-dimensional space results in Nsp unique points existing in the reduced space. Consideration must also be given to addressing the parameter separations (θ - θ*) that, in the optimization-based approach (Asprey and Macchietto9), were initially set and then determined by the optimizer. One possibility is to treat the admissible parameter space Θ in the same way as the inputssby taking a sample of separations. Although this would, in effect, be performing the optimization by manual exploration, the dimensionality of Θ is such that exploration of the constraint in the product of U ∪ Ω and Θ space would require an intractably large number of samples and therefore simulations. Instead, for each input vector j [i.e., uj(t) and ωj], each parameter in the vector θ is sequentially perturbed from its upper bound θU i to its lower bound θLi while all other parameters are held at their nominal values, θ0i . This requires P perturbations (where P ) dim{θ}) for each input vector j; thus, for Nsp sample input vectors it requires PNsp perturbations. Each perturbation involves a single simulation of two model instances. Perturbation of the ith parameter with application of input vector j will yield a value for the kth output separation denoted, Ψijk, given by
Ψijk )
(
∫
)
tf
L [yk(uj(t),ωj,θU i ,t) - yk(uj(t),ωj,θi ,t)] dt 0
∫0t yrange,k dt f
t
t
(6)
(7)
Equation 7 dynamically normalizes the output separations according to the global maximum and minimum values of the model outputs y(t) or y*(t) over their entire trajectory. This allows direct comparison between the different outputs. For a parameter to be estimable within the parameter perturbation range, it must result in at least one model output separation Ψijk being measurably different (i.e., exceeding a tolerance) under at least one realizable set of inputs uj(t) and ωj when its nominal value is perturbed between its admissible lower and upper bounds. Mathematically we check for violation of output tolerances that apply to each output
Ψijk g yk
parameter’s nominal value, assuming that the output under which it is estimable is measured. Furthermore, because the entire input space is explored and the parameter perturbation range is not local, the property of being estimable is global. The algorithm used is set out in Figure 1 and was implemented in C++ code. Simulation of the model is achieved through use of the gPROMS equation set object. On the basis of this study, the model parameter set θ can be partitioned into parameters deemed estimable, θE, and those deemed inestimable, θI, so that
dim{θ} ) dim{θE} + dim{θI}
2
where
yrange,k ) max[yk(t),y/k(t)] - min[yk(t),y/k(t)]
Figure 1. Computer algorithm used for the perturbation study.
(8)
The practical meaning of this definition is that there is at least one input vector or “experiment” that can be implemented that will enable the parameter to be estimated to within an inference region of (25% of the
(9)
With this, we can expect, at best, to be able to estimate (to within at least 25% of nominal) parameter values for the estimable set, θE, under the most informative experiment. Furthermore, we know that the model properties are such as to make refined inference of the parameters in θI difficult or impossible depending on the value of the output separations. A limitation of this approach is that the effects of parameter interactions are neglected as parameters are singly perturbed, which contains only first-order parameter sensitivity information. Extending this approach to handle second- and higher-order terms (“interactions”) makes for an intractable analysis. Although no attempt has been made to consider these interactions in the method described so far, the effect of interactions existing in the sets resulting from the parameter perturbation study is quantified. This aspect is discussed next. Quantifying the Sensitivity and Interaction between Parameter Sets. Having isolated a set of estimable parameters (θE) and an inestimable set (θI), it is important to quantify the sensitivity contained within the sets. This enables us to determine how much sensitivity is retained by consideration of only the estimable parameters, information useful for model reduction. With a model as highly parametrized as this one, we would expect parameter interactions to be equally as significant as actions of individual parameters. The sensitivities of each set can be calculated via computation of global sensitivity indices.11 A global
Ind. Eng. Chem. Res., Vol. 44, No. 4, 2005 871
sensitivity index can be defined as the ratio of variances
Si1,...,is )
Di1,...,is
(10)
D
where Di1,...,is is the variance in the model output associated with simultaneous changes in parameters i1is and D is the variance associated with the simultaneous change of all model parameters. First-order indices Sis account for the sensitivity of the individual parameters s, and second- (Siris) and higher-order (Sir,...,is) sensitivities account for the effects of interactions of two or more parameters. Indices are described as global for two reasons: (a) the effects of changing parameters simultaneously is considered as opposed to individual and isolated parameter changes, hence warranting the qualifier local, and (b) parameter changes might span the entire admissible region of the parameter space rather than its neighborhood, which again would be described as local. Using E and I to denote the estimable and inestimable parameter sets, these two sets are treated as macroscopic parameters and their influence on the model outputs are contained in the following four pertinent sensitivity indices
SE )
∑ Si
is∈E
∑ Si
Stot E ) SI )
is∈E
Si ∑ i ∈I
(ir