Constrained Version of the Dynamic Response ... - ACS Publications

May 1, 2019 - apply it to the modeling of several pharmaceutical case studies provided by our industrial .... Here, we briefly describe the DRSM-2 mod...
0 downloads 0 Views 4MB Size
Article pubs.acs.org/IECR

Cite This: Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Constrained Version of the Dynamic Response Surface Methodology for Modeling Pharmaceutical Reactions Yachao Dong,† Christos Georgakis,*,† Jason Mustakis,‡ Joel M. Hawkins,‡ Lu Han,‡ Ke Wang,‡ Jonathan P. McMullen,§ Shane T. Grosser,§ and Kevin Stone§ †

Systems Research Institute, Tufts University, Medford, Massachusetts 02155, United States Pfizer Worldwide R&D, Groton, Connecticut 06340, United States § Process Research and Development, Merck & Co., Inc., Rahway, New Jersey 07065, United States Downloaded via MOUNT ALLISON UNIV on August 4, 2019 at 14:04:41 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.



S Supporting Information *

ABSTRACT: The Dynamic Response Surface Methodology (DRSM; Klebanov, N.; Georgakis, C. Ind. Eng. Chem. Res 2016, 55, 4022) is a data-driven modeling methodology for time-resolved measurements from Design of Experiments data sets. To extend its use to semi-infinite domains, a second version was proposed (Wang, Z. Y.; Georgakis, C. Ind. Eng. Chem. Res 2017, 56, 10770), which uses an exponential transformation of time as the independent variable. Here, we further improve upon this later version by imposing regression constraints, motivated by our qualitative understanding of the physical nature of the reaction system. We apply it to the modeling of several pharmaceutical case studies provided by our industrial collaborators. This constrained DRSM approach is able to model these compositional data more parsimoniously and with greater accuracy than the initial DRSM, eliminating the existence of oscillations near the end of some experiments from the model predictions.

1. INTRODUCTION For many different processes, especially in the pharmaceutical and the biopharmaceutical industries where a detailed knowledge-driven model might not be available, it is essential to develop data-driven models.1,2 Such data-driven models are instrumental for understanding the processes in multiple aspects, such as quantifying the effects of different parameters on the outputs and optimizing the operating conditions.3−6 This is often achieved using the Response Surface Methodology (RSM), a companion modeling approach to the Design of Experiments (DoE) methodology. Hanrahan and Lu7 have surveyed numerous applications of the classical DoE and RSM tools for estimating such models, including studies for drug screening, drug absorption prediction, batch fermentation, and high performance liquid chromatography separations, among many others.7,8 The recently proposed Design of Dynamic Experiments (DoDE)9 and the Dynamic Response Surface Methodology (DRSM) models10,11 are two important and powerful extensions of the classical DoE and RSM methodologies. The DoDE and DRSM methodologies have the advantages of allowing time-varying inputs to the experiments and also efficiently modeling the time-resolved output measurements. Other methods of addressing time-resolved measurements include semiparametric regression,12 generalized additive models,13 Gaussian processes,14 and machine learning techniques.15 In this paper, we study the improved DRSM methodology in modeling the data obtained from chemical reactions of small © 2019 American Chemical Society

molecule pharmaceutical synthesis. This includes a challenging simulated example as well as two sets of experimental data. The effectiveness of the original DRSM approach (referred herein as DRSM-1) in modeling the challenging simulated reaction network has been reported previously.16 There, the power of the DRSM-1 approach was demonstrated in modeling timeresolved composition measurements as well as in the discovery of the reaction stoichiometries of the major reactions. Nevertheless, the DRSM-1 approach had limitations, especially in modeling the composition of intermediate species. The corresponding models obtained showed unwanted oscillatory behavior at long times that could not be eliminated even after several parameter estimation algorithms were used. This limitation was primarily attributed to the model’s polynomial dependence on time. The associated DRSM-1 software was also restricted in modeling data that were collected at equidistant time intervals over equal total durations. Because experimental sampling strategies vary across reaction applications, these restrictions prevented DRSM-1 from being applied broadly in process characterization workflows. Here, we will use the improved version of the DRSM methodology that was initially derived for the modeling of Special Issue: Dominique Bonvin Festschrift Received: Revised: Accepted: Published: 13611

February 5, 2019 April 19, 2019 May 1, 2019 May 1, 2019 DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research

inputs. Such data-driven models can be successfully used in the optimization of many process operations, with respect to a specific performance index. Moreover, DRSM models for the reacting systems of specific interest here can also be used for obtaining valuable insights into the reaction stoichiometry and kinetics of the effective reaction network or for providing a recursive dynamic model for process control.17 The general workflow of the DoDE and DRSM methodologies includes the following steps:

continuous processes, whose dynamics last, in principle, over semi-infinite time intervals10,11 and are denoted here by DRSM2. The key innovation in the postulation of DRSM-2 is the use of an exponential transformation of time in the model. This Article will demonstrate that the previously observed oscillations near the end of some experiments are eliminated, when the DRSM-2 approach is used. Furthermore, the related DRSM-2 software can now handle data that are collected at time instants not uniformly spaced in time as well as data that are collected from batch runs with different durations. Here, we also introduce an improved constrained version of DRSM-2, denoted by DRSM2c. It imposes in the regression some additional constraints, which are of particular interest to the modeling of chemical reactions. The approach can also effectively deal with some missing data and is able to more accurately identify the underlying reaction stoichiometry compared to DRSM-1. The main constraints that are imposed in DRSM-2c fix the model prediction to the known initial concentrations. In the context of reaction mixtures, DRSM-2c incorporates additional constraints on the concentration values to be non-negative. The application of the DRSM modeling algorithms to data provided by our industrial collaborators is also achieved with ease and accuracy. In one of the two experimental data sets, the composition measurements are obtained at instants that become further apart from each other as the reaction proceeds toward completion. DRSM-2c handles this case effortlessly, whereas DRSM-1 fails to provide a credible model. For the other industrial data set, a substantial fraction of the concentration data is missing, due to human error or equipment malfunctioning; DRSM-2c still provides a very accurate model. We further note that the DRSM methodology can model data that are not necessarily in concentration units but also in chromatographic units such as peak area. This is very useful in the case where the response factors, translating chromatographic peak areas to concentrations, have not been calculated for all measured species, especially the secondary ones. In Section 2, we summarize the previously proposed DoDE and DRSM methodologies. In Section 3, we show the implemented improvements in DRSM-2c. In Sections 4−6, we show the results obtained in the examined case studies using simulated and real experimental data.

(1) Design and perform the experiments, using DoE or DoDE. If there are time-varying input variables, choose a parametrization of the input profiles. Typically, the shifted Legendre polynomials are used for the parametrization of inputs. The number of polynomials is chosen on the basis of the trade-off between more refined input profiles and the expense of conducting the needed experiments. (2) Acquire the time-resolved data and build the DRSM model. The number and type of experiments determines the model type (linear, two-factor interaction, or quadratic), relating the input factors to the output variables. The measurements could be obtained through sparser but more accurate chromatographic measurements or through spectroscopic measurements, which can be very frequent but less accurate, especially for low concentrations. (3) Use the DRSM model. This could be used for process optimization, identification of the reaction mechanism, process characterization, or process control purposes.

3. CONSTRAINED DRSM METHODOLOGY Here, we briefly describe the DRSM-2 modeling algorithm, compare it to DRSM-1, and introduce the additional knowledgedriven constraints that are appropriate for modeling the timeresolved measurements in a reaction system. We demonstrate that the use of an exponential transformation of time as the independent variable significantly increases the effectiveness of the DRSM model. This is demonstrated by the complex simulated data set of ten species taking part in eight reactions, inspired by chemistry found in pharmaceutical research, which provided a substantial challenge to the DRSM-1 algorithm.16 Even though the DRSM-2 algorithm generated a greatly improved model compared to the one estimated by DRSM-1, some additional weaknesses were revealed that motivated the imposition of some knowledge-driven constraints in the estimation of the DRSM parameters. The resulting modeling approach is denoted by DRSM-2c. 3.1. DRSM-2 Methodology. The initial DRSM-1 approach10 used a linear combination of Shifted Legendre Polynomials (SLP) in the dimensionless time τ = t/t0, where t0 is a maximal batch duration, to model the time evolution of the measured data of the output variable y(τ) with a generalization of the classical RSM model framework. For example, consider a quadratic DRSM model, where the estimated output ỹ(τ) is written as

2. DoDE AND DRSM OVERVIEW Here, we briefly overview the DoDE and DRSM data-driven methodologies mentioned above. First, DoDE9 is a generalized Design of Experiments (DoE) framework that allows timedependent inputs or dynamic factors. These dynamic factors are parametrized through a finite linear combination of shifted Legendre polynomials with the parametrizing constants playing the role of traditional factors, often called dynamic subfactors in this methodology. With DoDE, the effect of time varying input variables (e.g., reactant feeding profile or/and reaction temperature) on the process outputs can be evaluated, modeled, and optimized. At present, this is ideally suited for batch operations over a finite time horizon. On the other hand, DRSM is the modeling approach used in either DoE or DoDE experiments when time-resolved output data are available. It builds a relationship between the input variables and the time profiles of the output responses, which are measured at discrete time instances.10,11 The DoDE and DRSM, generalizing DoE and the related RSM model of decades past, can be used to quantify the dynamic behavior of process outputs to dynamic changes in the process

n

y ̃(τ ) = y ̃(τ , x) = β0(τ ) +

n

n

∑ βi (τ)xi + ∑ ∑ i=1

βij(τ )xixj

i=1 j=i+1

n

+

∑ βii(τ)xi 2 i=1

13612

(1) DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research

Figure 1. Effects of exponential transformation (a) and fixing initial output and constraining derivative (b).

where xi ∈ (x1, ..., xn) denotes the value of the input factors in a DoE or DoDE design. A similar expression can be written for a linear or a two-factor interaction DRSM model. The parametric functions βq(τ) for q = 0 or q = i,ij,ii denote the generalization of the parameters βq in the classical RSM model. The time dependence of each βq(τ) is approximated by a linear combination of the first few SLP. Specifically,

R

βq(θ ) =

∑ γq,rPr(τ),

(7)

Time constant tc will be selected to represent the slowest dynamics of the process, and we will rely on higher powers of θ to represent faster dynamic components. Despite its initial purpose for nonlinear continuous processes with semi-infinite horizons,11 the above exponential transformation also proves to be very effective in batch problems discussed here. It is crucial in reducing the oscillatory behaviors, which DRSM-1 exhibited when modeling the compositional evolution of intermediate compounds, whose profiles start at zero,16 then rise, and eventually diminish to a zero value at long times. This is illustrated in Figure 1a. More detailed results are discussed in Section 4. Besides selecting the value of tc, we also need to decide on the value of R, representing the order of the polynomials. We will call these two parameters global, because they affect all βq(θ) terms, while parameters γ in eq 6 will be called local, because they are specific for each βq(θ). We determine the optimal values of R and tc by minimizing the value of the Bayesian Information Criterion (BIC), and we employ stepwise regression to estimate the values of the significant local parameters γ. Stepwise regression and Analysis of Variance (ANOVA) examine the significance of each parameter, and the insignificant ones are removed from the model. Moreover, to avoid modeling dynamic behaviors faster than the sampling intervals Δt, we require that

(2)

where Pr(τ) denotes the SLP of the rth order, and R denotes the highest order of polynomials used for the parametrization. The first three SLPs are P0(τ ) = 1; P1(τ ) = −1 + 2τ ; P2(τ ) = 1 − 6τ + 6τ 2 (3)

The transformation from the original DRSM-1 to the DRSM2 model, through the exponential transformation of time,11 was aimed to address problems with a semi-infinite horizon. The same transformation also has utility in batch processes, and we will demonstrate its advantages here. The exponential transformation of time to a new independent variable θ is given by θ = 1 − exp( −t /tc)

(4)

Here, tc is the time constant characterizing the slowest dynamic of the output. It will be selected according to Bayesian Information Criterion to provide the best model fit of the data. This exponential transformation transforms the semiinfinite time interval (0, +∞) into a finite time interval: θ ∈ (0,1). Rather than using eq 1, DRSM-2 uses the following type of model n

y ̃(θ ) = β0(θ ) +

n

i=1

tc/R ≥ Δt

n

∑ βi (θ)xi + ∑ ∑

∑ βii(θ)xi 2 i=1

(8)

where tc/R denotes the fastest dynamics modeled in eq 5, and Δt denotes the smallest time interval between two consecutive samples. If this inequality is not satisfied, it is likely that the DRSM model will exhibit oscillatory behavior at initial times as it tries to model, through the higher powers of θ, faster dynamics than the smallest sampling interval can measure.11 Note that the above discussion focuses a single output or response. If there are multiple responses, as shown in the case studies later, all parameters will be unique to each response, including the time constant tc, the highest order of polynomials R, and the local parameters γ.

βij(θ )xixj

i=1 j=i+1

n

+

(6)

P0(θ ) = 1; P1(θ ) = −1 + 2θ ; P2(θ ) = 1 − 6θ + 6θ 2

∀ q = 0, i , ij , ii

r=0

∀ q = 0, i , ij , ii

r=0

R

βq(τ ) =

∑ γq,rPr(θ),

(5)

This is justified by our knowledge that the response of linear dynamic models is expressed in exponential terms of time rather than polynomial of time. As a result of the transformation, the functions βq(τ) and Pr(τ) in eqs 2 and 3 are replaced by βq(θ) and Pr(θ) as follows 13613

DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research 3.2. Constraints at Initial Time. In many cases, regardless of the values of the experimental factors xi, the initial concentration value of a species is a known value, denoted by y0. We wish to prohibit the regression task from estimating a different initial concentration value. We therefore introduce the following hard constraint: n

β0(0) +

n

n

∑ βi (0)xi + ∑ ∑ i=1

Adding eq 14 as a linear constraint in the regression will prevent any initial temporary excursion of the model to negative concentrations. However, there are two implementation details that should be clarified. To ensure that the non-negative initial derivative condition is satisfied at several values of the factors, multiple constraints are needed, as eq 14 involves the factor levels xi. There are two ways to add these multiple constraints:

βij(0)xixj

(1) Use only the values of the factors defined at the experiments performed. This will guarantee a nonnegative initial derivative on the conditions of the conducted experiment but not necessarily at all other values of the factors. This is a potential drawback as we like to use the DRSM model to find the optimal operating conditions for the reaction. (2) Generate a grid of factor values, which are more numerous than the number of experiments and impose the constraint of eq 14 in all of them. If m values are selected for each factor, then mn constraints will be added. This will add non-negative constraints for the whole domain of operating conditions examined if m is large enough.

i=1 j=i+1

n

+

∑ βii(0)xi 2 = y0

(9)

i=1

Because the right-hand side y0 does not depend on the values of the factors xi, the same should apply for each β function the lefthand side. Then, we can conclude that β0(0) = y0 , βq(0) = 0 for q = i , ij , ii , and i , j = 1, 2, ... (10)

On the basis of the definition of the βq(θ) functional parameters and the properties of the SLPs in eqs 6 and 7, the following relationships are derived R

γ0,0 = y0 −

Augmenting non-negative constraints at the experimental points with additional grid points has demonstrated robust model fitting for all the examples used in this paper. Furthermore, we prefer to use stepwise regression in the estimation of the values of the γ parameters, as it helps us to eliminate insignificant parameters and avoid model overparameterization. For implementation reasons (stepwise regression does not allow the imposition of constraints in software like MatLab or other platforms), we proceed in two steps. We first perform stepwise regression without constraints to determine the significant γ coefficients. Then, in the second regression step, we solve a constrained least-squares problem, considering only those γ parameters which were determined as significant in the prior step. The effect of these two constraints at initial times is shown in Figure 1b, where the concentration profiles are described most accurately by fixing their initial value to be zero and constraining its derivative to be non-negative. The current two-stage estimation procedure with model selection and nonnegative constraints was implemented to reduce computational burden. As a result, the constraints were only considered for the model terms selected by the unconstrained stepwise regression step. A more formal alternative is to integrate the physical constraints with the model term selection algorithm. Whether this has a significant impact on the quality of the obtained model is under consideration for further exploration. 3.3. Non-Negative DRSM Output. When the DRSM output is known to be non-negative, as it happens here with species concentrations, we can add non-negative constraints on the expected predictions at different values of the experimental factors as well as at different values of t or, equivalently, of θ. Therefore, the constraints are

R

∑ (−1)r γ0, r ,

γq ,0 = − ∑ (− 1)r γq , r for q = i , ij , ii

r=1

r=1

(11)

In other words, all the γq,0 parameters are expressed in terms of the other γq,r parameters with r = 1, ..., R, and the γq,0 parameters are removed from the regression. The effect of this improvement is shown in Figure 1b for a concentration profile starting at zero. In Figure 1b, we now observe that, once the initial concentration is fixed at zero, an additional problem arises as the DRSM concentration prediction can temporarily dip to negative values. This might not be a problem from the statistical point of view if normal distribution is assumed, because a calculation of the corresponding confidence interval will, most likely, include the zero value. However, in all reaction modeling cases, one expects the nominal model prediction of a concentration to be non-negative. Therefore, one needs to impose an additional knowledge-driven constraint that at t = 0 the time derivative should be non-negative for concentrations that start at a zero value. d y ̃ (t ) dt

= t=0

d y ̃ (θ ) dθ

dθ dt θ(t = 0) = 0

= t=0

d y ̃ (θ ) dθ

θ=0

1 ≥0 tc (12)

This results in d y ̃ (θ ) dθ

≥0 (13)

θ= 0

Considering the definition of ỹ(θ) in eq 5, this leads to l o

| o

∑ γ0,rPr′(0) + ∑ omooxi ∑ [γi ,rPr′(0)]o}oo R

n

n

R

R | l o o o o ′ +∑ ∑ m [ ] x x γ P (0) ∑ } i j r ij , r o o o o i=1 j=i+1 n r=0 ~ n l R | o o o o 2 + ∑m ≥0 xi ∑ [γii , rPr′(0)]} o o o o i=1 n r=0 ~

r=0

i=1

n

r=0

l o

| o

∑ γ0, rPr(θ) + ∑ omooxi ∑ [γi , rPr(θ)]o}oo R

~

n

r=0

n

n

+

∑ i=1

R

o i=1 o n r=0 ~ R | l o o ∑ omooxixj ∑ [γij , rPr(θ)]o}oo + o j=i−1 o n r=0 ~ n

≥ 0 with θ = θk k = 1, 2, ..., nk

l o

| o

∑ omooxi 2 ∑ [γii ,rPr(θ)]o}oo n

i=1

o n

R

r=0

o ~ (15)

Like eq 14, multiple constraint instances of eq 15 are activated using a grid of factor values and time points. We also follow the

(14) 13614

DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research ne

above-mentioned two-step approach in the regression of the model parameters: first, stepwise regression and, then, imposition of the constraints. 3.4. Confidence and Prediction Intervals. After obtaining the nominal values of the significant DRSM model parameters, it is necessary to also calculate their confidence or prediction intervals. The figures based on such calculations, which will be provided in the case studies to demonstrate the effectiveness of DRSM-2c, will consist of the nominal model predictions as functions of time, as well as their 95% prediction interval bands with time. The width of the prediction interval bands will depend on time as well as on the values of the experimental factors. As will be seen later, the experimental data, from either real or in silico experiments, will be inside these confidence bands in all but very few exceptions. Assume that, except the term β0(θ), there are in total Q terms of βq(θ) in eq 5. From eqs 5 and 6, we know that there are (Q + 1)(R + 1) parameters of γq,r available to the stepwise regression. We rename the γq,r parameter as δk with k = 1, ..., (Q + 1)(R + 1), combining the two indices into one. For simplicity of presentation, we assume that all δk parameters, (Q + 1)(R + 1) in number, are significant. This is not very likely the case, but it helps us to describe the calculation of uncertainties in the model predictions. When some of the (Q + 1)(R + 1) parameters are not significant, the dimensionality of the vector δ, defined below, will be appropriately smaller. ij γ0,0 yz ijj δ1 yzz jj z zz jj γ zzz jjjj jj 0,1 zz jj δ2 zzzz jj zz jj zz jj zz j z δ = jjj ∂ zzz = jjjj ∂ zzzz with p = (Q + 1)(R + 1) jj zz jj zz jj γ z jj Q , R − 1 zzz jjjj δp − 1 zzzz jj zz jj zz jjj γ zzz jj zz k Q , R { jk δp z{

Σ,

nmT =

(19)

k=1

The variance associated with a DRSM model prediction at a time instant corresponding to exponential-transformed time θ and factor level x is given by σ 2(θ , x) = var(y ̃(θ , x)) = (f1(θ , x) ... f p (θ , x)) · ij f (θ , x) yz jj 1 zz jj zz j zz Σjjj ∂ zz jj z jj f (θ , x)zzz j p z k {

(20)

Therefore, the 100(1 − α)% confidence interval, assuming normal distribution on the expected output at transformed time θ and at the x values of the experimental factors, is CI(θ , x) = (y ̃(θ , x) − A , y ̃(θ , x) + A) with A = A(α , θ , x) = t0.5α , ζσ 2(θ , x)

(21)

Here, t0.5α,ζ is the upper α/2 percentage point of t-distribution with ζ degrees of freedom. The ζ value is equal to the number of total observations nmT minus the number of significant parameters from the original p components of vector δ resulting from the stepwise regression. Additionally, the 100(1 − α)% prediction interval on a new observation is larger due to variability of the single next experiment and is given by PI(θ , x) = (y ̃(θ , x) − B , y ̃(θ , x) + B) with B = B(α , θ , x) = t0.5α , ζ MSE + σ 2(θ , x)

(22)

Here, the MSE term is the mean squared errors from the regression.

(16)

The covariance matrix of the parameter δ is often denoted by jij δ1 zyz jj z jj δ zzz i Σ ... Σ y jj 2 zz jj 11 p1 z zz jj zz jj zz jj zz jj Σ = cov(δ) = covjjj ∂ zzz = jj ∂ ∏ ∂ zzz j z jj z jj δp − 1 zzz jjjj Σ μ Σ zzzz jj zz k 1p pp { jjj zz jj δ zzz k p {

∑ nm(k)

4. CASE STUDY A: SIMULATED REACTION NETWORK We consider a set of simulated data, which was provided by our industrial collaborators and has been studied previously using DRSM-1.16 The data is based on a real reaction network with revised kinetic parameters to conceal sensitive information. Here, we study it using DRSM-2c, which shows great improvement in modeling results. 4.1. Data Set. The underlying reaction network, unknown initially but revealed eventually, consists of ten species and eight independent reactions, which is shown in Figure 2. The numbers denote reactions, and the colored circles denote species, with blue for reactants, gray for intermediates, red for byproducts, and green for the desired product. There are two reversible reactions, Rxns #1 and #4. An alternative representation of the reaction network has been given in the upper part of Table 1 in our previous publication.16 The set of experiments included three independent static factors: reaction temperature throughout the batch (varying between 50 and 90 °C), initial concentration of species B (varying between 0.8 and 1.2 mol/L), and initial concentration of species D (varying between 0 and 2 mol/L). The experiments were generated following a face-centered central composite design, including 17 experiments: 8 at the vertices of the cube, 6 at the centers of the six faces, and 3 repeated experiments at the center of the cube. The output variables are the concentrations of all 10 species at different sampling time instants. These were generated by simulating the reaction kinetics and solving a set of differential equations representing mole balances. The kinetic

(17)

and the covariance is a direct result of the unconstrained parameter estimation task. Because the DRSM model is linear in the original gamma parameters, eqs 5 and 6 can now be written as follows: ij δ1yz jj zz j z y ̃(θ , x) = ∑ δkfk (θ , x) = (f1(θ , x) ... f p (θ , x))jjjj ∂ zzzz jj zz k=1 jj δp zz k { p

(18)

The term f k(θ,x) does not depend on the unknown parameters but consists of SLP terms multiplied by the values of the corresponding factors at a specific experiment. If the number of concentration measurements collected for a given species are nm(k) during the kth experiment out of ne experiments, then the output data vector for the regression is of dimension nmT, given by 13615

DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research

reactions toward their completion, the equidistant sampling would be more appropriate. We use the full quadratic DRSM model, and we choose the number of polynomials and the time constant tc aiming to minimize the Bayesian information criterion (BIC) as proposed previously.11 Each species is modeled separately as an independent response profile. Three species are used as representatives for discussion: (i) species A, a reactant; (ii) species C, an intermediate, whose concentration drops to nearly zero at the end of the operation; (iii) species F, a product. Figures 3 and 4 show the (simulated) experiment data and the estimated profiles according to DRSM-1 and DRSM-2c for one experiment. The results in other experiments exhibit similar behaviors. For equidistant sampling, both DRSM-1 and DRSM-2c have a good fit of the concentrations of the reactant (species A) and the product (species F); however, for the intermediate species (species C), DRSM-1 exhibits noticeable oscillations at the end of the operation, while DRSM-2c, using the exponential transformation of time, has no such oscillations. For the nonequidistant sampling, the advantage of DRSM-2c is even more notable and is not limited to the intermediates. For the reactant (species 1) and the product (species 6), DRSM-1 leads to divergent oscillations and, therefore, cannot serve as a useful data-driven model. On the contrary, DRSM-2c performs very well for all the species, with good fit to the experiment data and no oscillations. The polynomial order, R, and the time constant, tc, are summarized in Table 1. We observe that DRSM-2 requires a smaller number of polynomials compared to DRSM-1. Because of the advantage of DRSM-2c, all the following results shown in this paper will be based on DRSM-2c, and the version number will be omitted. We show the DRSM estimation results and the 95% prediction interval in Figure 5. This is for a byproduct, species H with R = 3 and tc = 5 h. Other species are shown in Supporting Information Section 2. Experiments 1 to 17 are from the facecentered central composite design used to calculate the DRSM model, while experiments 18 and 19 are at different experimental conditions from those used in building the DRSM model, a cross-validation test. For experiment 18, the coded values of the three factors are 0.5, −0.5, and 0.5, while for experiment 19, they are −0.5, 0.5, and −0.4 respectively. We can see that DRSM not only fits the data in the first 17 experiments well but also provides very good predictions for the cross-validation data. The initial concentration is fixed at zero, and the “inverse action” is

Figure 2. Scheme of the underlying network of case study A.

parameters are given in Supporting Information Section 1. Normally distributed additive errors, with a constant variance N(0, 0.005), were added to the simulation to represent the measurement error. The measurement error is about the magnitude of the two smallest concentrated species, species I and J. The knowledge about the reaction network is only used for simulation purposes. On the basis of the generated data, DRSM models were built using only the data set, without any knowledge of the simulated reactions. 4.2. DRSM Results. First, we compare the modeling results using DRSM-1 and DRSM-2c. We show the results using two sampling strategies: equidistant sampling and nonequidistant sampling. (1) For equidistant sampling, the sampling occurs every hour from 0 until 12 h. (2) For nonequidistant sampling, we increase the sampling interval as time increases using the following time instants: 0.09, 0.19, 0.38, 0.75, 1.5, 3, 6, 9, and 12 h. The nonequidistant sampling is a popular sampling method in the pharmaceutical industry as measurements are focused on a domain where the responses are expected to be the most dynamic (and thus more informative for parameter estimation). If one focuses more on the

Figure 3. Equidistant sampling results. 13616

DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research

Figure 4. Nonequidistant sampling results.

missing due to equipment faults or human error. For example, in experiment 13, the concentration data of samples between 2 and 5 h are missing. The missing data points are shown as green dots in Figure 6. Additionally, some experiments may be completed at an earlier time than others, e.g., experiment 6 stopped after 6 h. This data set simulating various durations in each experiment and missing samples was made possible by deleting 43% of the sample points in the original data set. DRSM-1 had considerable challenges handling this data set, while DRSM-2c models the provided data concentration profiles effortlessly. The results for an intermediate species are shown in Figure 6. The improvements of fixing the initial concentrations are clearly seen in experiments 2, 3, and 6. DRSM-2c avoids divergent model profiles at the end of the batch in experiments 3, 5, 6, 10, and 16. The oscillations in the DRSM-1 predictions at the end of

Table 1. Number of Polynomials and Time Constants in the DRSM Models for Both Sampling Strategies R

species equidistant sampling nonequidistant sampling

DRSM-1 DRSM-2c DRSM-1 DRSM-2c

tc (h)

A

C

F

8 5 9 6

9 5 5 5

8 6 10 7

A

C

F

5

5

6

5

6

5

avoided by the initial derivative constraints. Additionally, the prediction intervals of DRSM following eq 22 are shown, and most of the experimental data fall inside the prediction intervals. Two more challenges may come up with the data. The first challenge relates to the possibility that some of the data might be

Figure 5. DRSM-2c estimation of concentrations (mol/L) and prediction interval for species H vs time (h). 13617

DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research

Figure 6. Comparison of DRSM-1 and DRSM-2c for the estimation of concentrations (mol/L) for species E vs time (h), using data with various durations and missing samples.

Figure 7. DRSM estimation of concentrations (mol/L) and prediction interval of species B vs time (min).

experiments 5 and 8 are nonexistent in the DRSM-2c model. Furthermore, DRSM-2c is a more parsimonious model. For example, for species E shown in the figure, the DRSM-2c model has 21 significant gamma parameters with R = 3 and tc = 3.3, while DRSM-1 needs 32 significant gamma parameters with R = 5. In terms of the missing data shown as green dots in the figure, DRSM-2 provides good predictions for most of them; however, if a lot of data are missing, for example, in experiments 6 and 13, predictions might not be very accurate. One cannot expect

perfect predictions when a lot of data are missing. The results for other species are shown in Supporting Information Section 3.

5. CASE STUDY B: EXPERIMENTAL REACTION NETWORK WITH THREE SPECIES A set of experimental data from our industrial collaborators is used here to study the effectiveness of the DRSM model. This case study shows the flexibility of the DRSM methodology in handling experimental data with different sampling strategies 13618

DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research

Figure 8. DRSM estimation of peak area (%) and prediction interval of species D vs time (h) from the original data.

5.1. Data Set. There are 3 observed species (A, B, C) and 5 factors (a, b, c, d, e) in the design space. The design of the experiments follows a 25−2 fractional factorial design with eight experiments. The two defining relations, “words”, that characterize this quarter fractional design are I = abd and I = ace. The data are obtained through liquid chromatographic measurements at predefined sampling instances, and they are split in two blocks. Block 1 of the data was obtained through a robotic sampling system, while data from block 2 was obtained manually using the same input factors as block 1. Eight experiments are conducted in each block, and there are totally 16 experiments. For each experiment by the robotic system, there are five samples measured, respectively, at 20, 40, 60, 120, and 240 min from the start of the experiments. For each manual experiment, there are five samples obtained approximately at those times but not precisely as well as one additional sample taken in the beginning. 5.2. DRSM Results. From the fractional factorial design, we know that the following alias structure is present: (a = bd = ce, b = ad, c = ae, d = ab, e = ac, bc = de, cd = be). Therefore, only two independent two-factor interaction terms exist. We selected the bc and cd terms to build the DRSM model, as shown in the equation below. Using any other independent 2FI terms, (e.g., bc and be), will lead to the same results.

time-resolved profiles very close to the data from the experiments. This result illustrates the flexibility of the DRSM methodology, which allows different sampling times among experiments in the data. We also studied the block effect of manual versus robotic operation by adding a block term in eq 23. This block effect has been calculated to be insignificant.

6. CASE STUDY C: EXPERIMENTAL REACTION NETWORK WITH FIVE SPECIES This second experimental data set is a larger case study compared to case study B. There are five measured species (A, B, C, D, E) and three factors in the design space. There are 25 experiments in total, of which 14 are distinct. For each experiment, 8 to 14 chromatographic measurements are obtained at different time points, from 0 to 34.3 h. The reported data is in chromatogram peak area units. We developed a quadratic DRSM model from the data set. The DRSM estimations and the 95% prediction interval, considering species D with R = 4 and tc = 8, are shown in Figure 8. The results for other species are shown in Supporting Information Section 6. We can see that the DRSM models for the peak area profiles are quite accurate, and most of the samples fall inside the prediction intervals. A few outlier data are present, like sample 6 from experiment 18. There is also one experiment, #14, which has a slight but consistent DRSM underestimation. A potential reason for this underestimation is that there is a small unknown experimental error between the actual factor levels in the experiments and the designed factor levels that are used for estimation of the model. In many real occasions, not all experiments are carried out over the same duration and many missing samples appear in the reported data. The reduced data set, which is made by deleting

y ̃(θ ) = β0(θ ) + βa(θ )xa + βb(θ )xb + βc (θ )xc + βd (θ )xd + βe(θ )xe + βbc (θ )xbxc + βcd (θ )xcxd

(23)

The DRSM estimation and its 95% prediction interval are shown in Figure 7, using species B as an illustrative example (R = 4, tc = 160). DRSM results for other species are shown in Supporting Information Section 5. The DRSM model provides 13619

DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research

Figure 9. DRSM estimation of peak area (%) and prediction interval of species D vs time (h) from the data with various durations among experiments and many missing samples.

the reaction stoichiometries active in the mixture, where statistical tests are used in the stoichiometric identification algorithm.

25% of the sample points in the original data set, is shown in Figure 9. We can see that the DRSM model still provides an accurate model within the duration of each experiment. The comparison of Figures 8 and 9 shows that in the reduced data the DRSM predictions at extrapolated times beyond the last measurement are also quite good. From this data set, we have shown that the DRSM-2c can produce an accurate model for data other than concentrations; in this case, peak area as the response factor converting peak area data to concentrations is not available.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.iecr.9b00731.



7. CONCLUDING REMARKS In this paper, we demonstrate that the proposed improved DRSM-2c is effective in modeling challenging composition data for complex organic synthesis, when the measurements are not equidistant, when a large fraction of them is missing, or when experiments are conducted over various durations. The improvements in the DRSM methodology include (1) using exponential transformations, (2) fixing initial concentration levels, and (3) forcing concentration outputs to be non-negative. The advantages of DRSM-2c over DRSM-1 in handling these challenging data encountered in drug synthesis were clearly demonstrated through multiple computational and experimental case studies. Thus, one can obtain a more accurate DRSM model using the approach described here, and the obtained DRSM models have been proven to be quite powerful for multiple purposes, which include the identification of candidate stoichiometries, the optimization of operating conditions in batch processes, and the control of continuous processes. For future research, we will use the DRSM-2c model with target factor analysis for identifying

Kinetics for the simulation case and figures of DRSM estimations for all species in all of the case studies (PDF)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Yachao Dong: 0000-0002-6990-7327 Christos Georgakis: 0000-0001-9895-8564 Jonathan P. McMullen: 0000-0001-5969-2396 Notes

The authors declare no competing financial interest.



βq γq,r δk ζ θ τ Σ 13620

NOTATION parametric function of term q in DRSM parameter in r-th SLP of term q renamed parameter γq,r with one index degree of freedom in regression exponential transformed time in DRSM-2 dimensionless time in DRSM-1 covariance matrix of vector δ DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621

Article

Industrial & Engineering Chemistry Research

(17) Wang, Z. Y.; Klebanov, N.; Georgakis, C. DRSM Model for the Optimization and Control of Batch Processes. In Dynamics and Control of Process Systems, including Biosystems/IFAC2016, Trondhelm, Norway, June 6−8, 2016.

f

function of factor levels multiplied by the SLP basis function n number of factors ne total number of experiments Pr r-th SLP basis function q index of parametric function terms Q total number of parametric function terms r index of polynomials R number of polynomials t time tc time constant in DRSM-2 tα,ζ upper α/2 percentage point of T distribution with ζ degrees of freedom xi i-th factor y/ỹ̃ true/estimated output variable y0 initial output



REFERENCES

(1) Togkalidou, T.; Braatz, R. D.; Johnson, B. K.; Davidson, O.; Andrews, A. Experimental design and inferential modeling in pharmaceutical crystallization. AIChE J. 2001, 47 (1), 160−168. (2) Gabrielsson, J.; Lindberg, N.-O.; Lundstedt, T. Multivariate methods in pharmaceutical applications. J. Chemom. 2002, 16 (3), 141−160. (3) González, A. G. Optimization of pharmaceutical formulations based on response-surface experimental designs. Int. J. Pharm. 1993, 97 (1), 149−159. (4) Singh, B.; Kumar, R.; Ahuja, N. Optimizing drug delivery systems using systematic ″design of experiments.″ Part I: fundamental aspects. Crit. Rev. Ther. Drug Carrier Syst. 2005, 22 (1), 27−105. (5) Ferreira, S. L. C.; Bruns, R. E.; Ferreira, H. S.; Matos, G. D.; David, J. M.; Brandão, G. C.; da Silva, E. G. P.; Portugal, L. A.; dos Reis, P. S.; Souza, A. S.; dos Santos, W. N. L. Box-Behnken design: An alternative for the optimization of analytical methods. Anal. Chim. Acta 2007, 597 (2), 179−186. (6) Bezerra, M. A.; Santelli, R. E.; Oliveira, E. P.; Villar, L. S.; Escaleira, L. A. Response surface methodology (RSM) as a tool for optimization in analytical chemistry. Talanta 2008, 76 (5), 965−977. (7) Hanrahan, G.; Lu, K. Application of Factorial and Response Surface Methodology in Modern Experimental Design and Optimization. Crit. Rev. Anal. Chem. 2006, 36 (3−4), 141−151. (8) Montgomery, D. C. Design and Analysis of Experiments; John Wiley & Sons: 2008. (9) Georgakis, C. Design of Dynamic Experiments: A Data-Driven Methodology for the Optimization of Time-Varying Processes. Ind. Eng. Chem. Res. 2013, 52 (35), 12369−12382. (10) Klebanov, N.; Georgakis, C. Dynamic Response Surface Models: A Data-Driven Approach for the Analysis of Time-Varying Process Outputs. Ind. Eng. Chem. Res. 2016, 55 (14), 4022−4034. (11) Wang, Z. Y.; Georgakis, C. New Dynamic Response Surface Methodology for Modeling Nonlinear Processes over Semi-infinite Time Horizons. Ind. Eng. Chem. Res. 2017, 56 (38), 10770−10782. (12) Ruppert, D.; Wand, M. P.; Carroll, R. J. Semiparametric Regression; Cambridge University Press: 2003. (13) Wood, S. N. Generalized additive models: an introduction with R; Chapman and Hall/CRC: 2006. (14) Tang, Q.; Lau, Y. B.; Hu, S.; Yan, W.; Yang, Y.; Chen, T. Response surface methodology using Gaussian processes: Towards optimizing the trans-stilbene epoxidation over Co2+−NaX catalysts. Chem. Eng. J. 2010, 156 (2), 423−431. (15) Wilson, Z. T.; Sahinidis, N. V. The ALAMO approach to machine learning. Comput. Chem. Eng. 2017, 106, 785−795. (16) Santos-Marques, J.; Georgakis, C.; Mustakis, J.; Hawkins, J. M. From DRSM Models to the Identification of the Reaction Stoichiometry in a Complex Pharmaceutical Case Study. AIChE J. 2019, 65, 1173. 13621

DOI: 10.1021/acs.iecr.9b00731 Ind. Eng. Chem. Res. 2019, 58, 13611−13621