Optimal Operation of Discretely Controlled Continuous Systems under

Article pubs.acs.org/IECR

Optimal Operation of Discretely Controlled Continuous Systems under Uncertainty Mariano De Paula and Ernesto C. Martínez§ INGAR (CONICETUTN), Avellaneda 3657, Santa Fe S3002 GJC, Argentina ABSTRACT: Discretely controlled continuous systems constitute a special class of continuous-time hybrid dynamical systems where timely switching to alternative control modes is used for dynamic optimization in uncertain environments. Each mode implements a parametrized feedback control law until a stopping condition triggers due to the activation of a constraint related to states, controls, or disturbances. For optimal operation under uncertainty, a novel simulation-based algorithm that combines dynamic programming with event-driven execution and Gaussian processes is proposed to learn a switching policy for mode selection. To deal with the size/dimension of the state space and a continuum of control mode parameters, Bayesian active learning is proposed using a utility function that trades off information content with policy improvement. Probabilistic models of the state transition dynamics following each mode execution are fitted upon data obtained by increasingly biasing operating conditions. Throughput maximization in a hybrid chemical plant is used as a representative case study.

1. INTRODUCTION Many engineered systems, such as chemical processes, manufacturing systems, solar collector fields, and biological systems combine continuous-time dynamics with a switching control logic that is typical of discrete-event systems.1−5 For control and optimization purposes these systems are modeled as hybrid dynamical systems, or simply hybrid systems. The trajectory of a continuous-time hybrid system can be represented as a sequence of continuous evolutions interleaved by discrete events,6 which cause changes in the set of differential equations defining the continuous state flow, also called the “modes” of the hybrid system dynamics. For each mode, the continuous-time dynamics is generally modeled by differential equations corresponding to the evolution of the system state. The discrete-event dynamics is the result of some event-based control strategy used to respond to external disturbances and endogenous inputs affecting the state evolution of the controlled system as whole.7 For a finite number of modes, a novel modeling paradigm known as integral continuous-time hybrid automata (icHA) has been recently proposed for event-driven optimization-based control for which no a priori information about the timing and order of the events is assumed.8 Exogenous inputs or disturbances are the events affecting the dynamics of an icHA. The continuous dynamics and the instants at which discrete events occur are further influenced by endogenous inputs related to the value or desirability of states in goal-directed control. From the nonlinear optimal control perspective, when dynamic programming is applied to integral continuous-time hybrid systems, the resulting computational problem is hard to solve, since it usually involves nonconvex mathematical programs.9 Most previous works related to applying the dynamic programming formalism to optimal control and mode scheduling of switched systems highlight the issue of combinatorial optimization problems with exponential complexity.10−12 Basically, in view of the dynamic programming principle, the optimal control of an icHA is implicitly determined by the associated Hamilton−Jacobi−Bellman (HJB) equations, and the optimal control solution must be obtained by combining efficient © 2012 American Chemical Society

state space sampling and value function approximation. In a recent work, approximate dynamic programming has been successfully applied to the discrete-time switched LQR control problem.13 Discretely controlled continuous systems (DCCSs) constitute a special class of continuous-time hybrid dynamical systems where “modes” are hereafter considered control modes with defining parameters that vary over a continuum. Each mode implements a parametrized feedback control law until a terminating condition is activated in response to goal achievement, a state constraint or an external event.14−17 Timely switching to alternative control modes is used for dynamic optimization in uncertain environments. Since a control mode parametrization in a DCCS switches among infinitely many possibilities for optimal performance, a DCCS can be represented as an icHA only when a given switching policy for control mode selection is applied. For optimal control, the optimal switching policy generates a sequence of control modes to complete a goal-directed control task from different initial states while minimizing some performance criterion.18,19 Execution times for control modes depend on the type of behavior or specific goal state that is being pursued and the occurrence of endogenous inputs and external events influencing the system dynamics. Thus, optimal operation of a DCCS gives room for resorting to a Lebesgue sampling strategy of states to advantage.20−23 The solution of dynamic optimization problems with continuous time hybrid automata embedded24 has been thoroughly reviewed by Barton et al.25 A decomposition approach was proposed to divide the original problem into smooth subproblems, such as determining the optimal mode sequence given fixed transition times. A deterministic branchReceived: Revised: Accepted: Published: 13743

January 9, 2012 September 6, 2012 September 6, 2012 September 7, 2012 dx.doi.org/10.1021/ie301015z | Ind. Eng. Chem. Res. 2012, 51, 13743−13764

Industrial & Engineering Chemistry Research

Article

decision stages. Also, the solution approach must recomputed in real-time from different initial conditions since instead of a dynamic programming approach the approximate problem is solved using mixed-integer linear programming techniques. Dynamic programming solution has been applied to probabilistic reachability and safety for controlled discrete time stochastic hybrid systems35 to obtain the set of initial conditions providing a certain probabilistic guarantee that the system will keep evolving within a desired safe region of the state space, which is characterized in terms of a value function, and “maximally safe” Markov policies are determined. Bemporad and Di Cairano36 focus on optimal and receding horizon control of switched linear affine systems modeled using Discrete Hybrid Stochastic Automata (DHSA), whose discrete-state transitions depend on both deterministic and stochastic events. A finite-time optimal control approach is used to determine the trajectory that provides the best trade-off between tracking performance and the probability for the planned trajectory to actually execute under possible chance constraints. Linear models, a finite number of modes, and lack of a goal-directed control policy are the main disadvantages of using the modeling paradigm of a DHSA. Finally, it is worth citing the sliding-mode control strategy for stochastic jump systems proposed by Shi et al.,37 in which the jumping parameters are modeled as a continuous-time, discretestate homogeneous Markov process. The mode describes the random jumps of the system parameters and the occurrence of discontinuities. The salient advantages of sliding-mode control are (i) fast response and good transient performance; (ii) robustness; and (iii) the possibility of stabilizing some complex nonlinear systems that are difficult to control by a monolithic continuous-state feedback law. In this work, control mode parameters are chosen as the degree of freedom for dynamic optimization so that the optimal sequence of modes is synthesized on the fly as state transitions are observed when stopping conditions triggered. These conditions may be due to external events/disturbance or endogenous events. The latter ones are derived from the optimal value function landscape, such as its gradient changes from positive to negative. In this work, the main argument is that for optimal operation of a DCCS under uncertainty a switching policy is required to implement goal-directed control in realtime. A novel simulation-based algorithm that combines dynamic programming with Lebesgue sampling and Gaussian process approximation is proposed to learn a switching policy for mode selection. To deal with the size and dimension of the state space and a continuum of feedback law parameters, Bayesian active learning is proposed using a utility function that trades off information content with switching policy improvement. Probabilistic models of the state transition dynamics following each mode execution are learned upon data obtained by increasingly biasing operating conditions.

and-cut framework incorporating outer approximation cuts and a bounds tightening heuristic was proposed to solve a nonconvex mixed-integer problem. The proposed mixed-integer reformulation is thus able to retain the linear structure of the embedded hybrid system. An important issue is the possibility that mode switching within a hybrid system causes nonsmoothness in the master problem. From this perspective, the control parametrization approach was examined in detail and difficulties of gradient-based solvers were thoroughly discussed. However, the issue of uncertainty in hybrid system optimization was not addressed in the work of Barton et al.,25 but its importance was recognized mainly in relationship with safety verification. From the perspective of behavior-based control borrowed from robotics, the important issue of optimality in multimodal optimal control has also been addressed by Mehta and Egerstedt26 using reinforcement learning (RL) techniques.27−29 RL is a simulation-based dynamic programming (DP) approach to solve optimal control problems under uncertainty and proves to be somewhat effective to address the problem of concatenating or combining control modes with fixed feedback laws and stopping conditions. To overcome the limitations of a fixed number of modes, Mehta and Egerstedt26 propose to augment the mode set as a function of the current modes using calculus of variations. Moreover, given such a control program, the issue of improving the system performance through the addition of new control laws was formulated as an optimal control problem. In fact, this is achieved through an optimal combination of recurring mode strings. Uncertainty in the initial states is a major obstacle for multimodal control, since fixed control programs are derived from assumed initial conditions. Also, it cannot be effective to cope with endogenous disturbances and uncertain events affecting the state trajectory toward a goal state. To capture fully the dynamics of controlled hybrid systems where the continuous state interacts with the discrete state in an uncertain environment, a number of modeling frameworks have been proposed.30−37 The early work of Bensoussan and Menaldi30 is a seminal contribution on using dynamic programming for stochastic hybrid control, whereas a number of chapters in the book by Cassandras and Lygeros31 highlight the importance of addressing uncertainty in optimal control of hybrid systems. To address a computationally intractable optimal control problem for a class of stochastic hybrid systems with linear state equations and a separable cost function, a near optimal state feedback control scheme based on statistical prediction method was recently proposed by Song and Li.32 Besides linearity assumptions, a disadvantage of the approximate numerical solution method is that it samples over the entire state space. As a result, the “curse of dimensionality” cannot be addressed. As a result, numerical solutions can be calculated in practice only for state spaces with small size/dimension. Computing suboptimal continuous and discrete control trajectories for hybrid systems with a probabilistic discrete transition structure has been also addressed by Adamek et al.33 The discrete inputs are mainly used to block undesired state transitions and to realize state-action trajectories that are goalattaining with high probability. In the field of robotics, a probabilistic particle-control approximation of chance-constrained stochastic predictive control so that the probability of failure is less than a defined threshold was recently proposed.34 The particle-control method is restricted to systems with linear or jump Markov linear dynamics and lacks a goal-driven strategy for defining duration of

2. DISCRETELY CONTROLLED CONTINUOUS SYSTEMS (DCCS) Discretely controlled continuous systems are a special class of hybrid dynamical systems where a goal-directed strategy generates sequences of control modes for controlling an otherwise continuous dynamics. Each control mode implements a feedback control law until a stopping condition triggers and opens the possibility for changing the control mode in order to achieve a goal or deploy a desired behavior. Thus, the hybrid nature of the dynamics is more a consequence of a multimodal 13744

dx.doi.org/10.1021/ie301015z | Ind. Eng. Chem. Res. 2012, 51, 13743−13764


Article

Figure 1. Discretely controlled continuous system.

Figure 4. Lebesgue sampling of the state space using a finite number of modes.

defines a decision rule for choosing control mode parameters θ ∈ Θ ⊂ p over time t in order to achieve a control goal G. Definition 2. The event generator (EG) defines the endogenous binary inputs ξie, i = 1, ..., ne, according to stopping conditions for states, controls, and disturbances

Figure 2. Event-driven continuous-time hybrid automaton.

control strategy applied rather than an intrinsic nature of the system. 2.1. Lebesgue-Sampled-Based Dynamics. Definition 1. A DCCS is made up of the five components shown in Figure 1: the continuous dynamics x ̇ = f(x,u,d), where x(t ) ∈  ⊂ m is the state, x(t ) ∈  ⊂ m is the control vector, and d(t ) ∈  ⊂ d is the vector of exogenous disturbances; the event generator (EG) is made up of stopping conditions to set endogenous binary inputs ξe; the control mode, a parametrized feedback law u = 2(x , θ); the switching policy π(x,d) that

[ξeix = 0] ↔ Hxi(x) ≤ 0, i ∈ [1, nex]

(1.1)

[ξeiu = 0] ↔ Hui(u) ≤ 0, i ∈ [1, neu]

(1.2)

[ξeid = 0] ↔ Hdi(d) ≤ 0,

(1.3)

i ∈ [1, ned]

nx nu nd so that the ξe ≙ [ξ1ex...ξexe ξ1eu...ξeue ξ1ed...ξede ] ∈ {0,1}ne, ne ≙ nxe + nue + nde .

Thus, whenever one of these constraints becomes active, the corresponding binary variable ξie changes its value from “0” to “1”.

Figure 3. Lebesgue sampling under a given switching policy. 13745



Article

Figure 5. Two-tank system of Example 1.

Once the control mode has been changed, the concerned variable ξie is reset to zero. Hi* are the functional representations of constraints for the states, controls, and perturbations. It is worth noting that stopping conditions in eq 1.1 are used to represent both path and value constraints for states, whereas stopping conditions in eq 1.2 may be used to constraint control actions within a feasible region. For disturbances, the stopping conditions in eq 1.3 are directly related to environmental events that may require switching to a different control mode for optimal operation. Definition 3. For a given disturbance dynamics ḋ(t), a switching policy π: θ ← (x,d) generates a finite sequence of pairs {(xi,θi)}Ni = 0 where the observed state transitions give rise to a sequence of rewards {(ri)}Ni = 1, measuring the goodness of the switching policy π in relationship with the control goal G. In definition 3, it is implicitly assumed that the control goal G is mathematically expressed using a reward function r(x,θ) and the expected return 9 associated with the sequence of rewards for observed state transitions.27−29 Without any loss of generality, is assumed here that from any decision stage k the return 9 k is calculated as a discounted sum: p=1

9k =

∑ γ prk+ p , N

0≤γ utility weighting factor, β discount rate, γ control horizon in no. modes, N

1.0 2.0 0.95 30 500

support set cardinality, ||?k3||

executed in the kth decision stage is interrupted. The dynamics of a DCCS under a given switching policy π can be abstracted as an event-driven continuous-time hybrid automaton depicted in Figure 2. The discrete state of the hybrid automaton is represented by the currently active control mode with a given θ (the nodes of Figure 2). At points in time other than events, there is one and only one active control mode. There is a continuum of control mode parameters θ ∈ Θ ⊂ p upon which the execution of the hybrid automaton depends. These will be the continuous decision variables in any dynamic optimization problem embedding the hybrid automaton of Figure 2. The fact that execution times of control modes are different gives rise to a Lebesgue sampling strategy of the state space for event-driven optimal control.20 Thus, a switching policy conveniently converts the continuous time problem into an event-driven one based on stopping conditions that constrain the dynamics of a DCCS. The importance of Lebesgue sampling of the state space  in the search for an optimal switching policy is that it makes feasible to overcome the curse of dimensionality in applying dynamic programming techniques when states and control vary over a continuum. If the length of mode sequences is bounded, for an arbitrary disturbance dynamics ḋ(t), Lebesguebased sampling results in a subspace of reachable states 3 . Over the continuum of parameters θ ∈ Θ, this subspace of states is given by the set of all states that are reachable from an initial

Figure 8. Results obtained using the 3DP algorithm for the inflow pattern in Figure 5: (a) lower tank level; (b) upper tank outflow rate; (c) lower tank outflow rate; (d) scheduling of mode parameters. 13747



Article

to generate ?3 is given in Figure 4. Lebesgue sampling begins learning/exploring from the states that we know (namely x 0 ∈ 0 , d0, and all the states in ?13 that are reachable in one step). At any iteration, a state is randomly selected from the set of known states and a mode parameter θ ∈ ΘDP is randomly chosen and applied until one stopping condition is activated. In the algorithm, t(k + 1,θj) is the time at which a stopping condition triggers, and the resulting state x(t(k + 1,θj),θj) and disturbance d(t(k + 1,θj)) are obtained, and a pair z(k + 1) is composed. Then, whether it is already a member of the set ?k3+ 1 sampled space is tested. If not, add this pair to the set of sampling points ?k3+ 1. Lebesgue sampling continues by exploring and updating set ?k3, k = 2, ..., N ensuring that sufficiently many states/ disturbance pairs are visited by using a large number L ≫ p. In summary, the algorithm effectively generates a sample-based estimation ?3 of the reachable space 3 while simulating sequences of state transitions generated by mode strings of variable length. It is worth noting that as the number of available modes in ΘDP increases due to a finer representation of the parameter space, the efficacy of Lebesgue sampling increases but the algorithm in Figure 4 imposes exponential computational costs. In this way, a dynamic programming approach can be applied using Lebesgue sampling of the state space , instead of resorting to arbitrary discretization of  such as the one proposed in the cell-to-cell mapping approach.38−40 Cell mapping methods offer a direct approach to optimal control problems in the discrete state space, known as the cell state space. The cell mapping methods are based on the assumption of a finite set of sampled time intervals, admissible control inputs, and phase space regions. The control performance index is discretized accordingly. Due to the quantized input levels and the discretized state space, cumulative steady state error appears in the solution and degenerated cases such as false limit cycles and chattered paths occur in the optimal solution. The main drawback of cell-to-cell methods is that they need to explore the whole state space even though many parts of it can only be visited under poor control policies. Moreover, since the states in a cell are all represented by the same point in its center, for accurate dynamic modeling of DCCSs, cells should be very small, which as the dimension/size of  increases makes the approach computationally intractable. The reader is referred to the quoted references for further details. 2.2. Example 1. In Figure 5, a two-tank system is shown with five taps intermittently discharging into the upper tank at a flow rate of 0.3 m3/(h tap), which in turn discharges its outflow rate into the controlled (lower) tank. In Figure 6a, a discharge pattern is given to highlight the difficulty of managing the lower tank capacity by properly manipulating its outflow rate, and in Figure 6b, the upper tank level dynamics is shown. The system state for optimal control is defined by the level h2 in the lower tank. The upper tank outflow rate cannot be controlled directly, and its discharge d(t) is the disturbance, which follows the Torricelli’s law: d(t) = h1 . For optimal operation, the control task G is defined as follows. The lower (controlled) tank must accommodate its inflow rate d(t), which varies significantly, while maximizing the mean discharge rate (throughput) over time by varying the mode parameter θ in the feedback law:

Figure 9. Results obtained using the 3DP algorithm for a different inflow rate pattern in Example 1. (a) Inflow rate pattern; (b) lower tank outflow rate; (c) scheduling of mode parameters.

subspace of state x 0 ∈ 0 ⊂  using mode strings of length less than or equal to N under a given switching policy π. Definition 5. Given x 0 ∈ ?03 and d0, an arbitrary disturbance dynamics ḋ(t) and a switching policy π, a state x(t ) ∈  is reachable if there exist a sequence of pairs {(x i , θi)}ij=≤0N such that x(t ) ∈ 3 . If the alternative parametrizations of control modes is limited to a finite set ΘDP, a sample-path-based approximation ?3 to the subspace of reachable states 3 under π can be obtained. In Figure 3, an intuitive explanation of Lebesgue sampling using a given switching policy is provided. A simulation-based algorithm

u(x , θ ) = θ x ̃ , 13748

0.6 ≤ θ ≤ 1.2

(4)



Article

rewards into account more strongly (i.e. it becomes more farsighted). An optimal switching policy π* is one that maximizes the associated value function in eq 6, such that the generated sequences (xπ,d,tπ,θπ) and the corresponding rewards {r(k)}k=N k=1 satisfy the recursive Bellman’s equation

and simultaneously, the undesirable situations when the lower tank overflows or stop its downstream discharge must be prevented at all costs. Each control mode u(x,θ) is stopped whenever the number of open taps changes. Important data for Example 1 are given in Table 1. Note that in order to avoid a sudden upset in the outflow rate u due to mode switching, the feedback control law in eq 4 resorts to a smoothed level x̃ in the lower tank defined as follows: x(̃ t ) = αx(̃ t ) + (1 − α)x(̃ t − 1),

0≤α≤1

V *(x(k), d(k)) = arg max θ{r(k + 1) + γ . Ex(k + 1)[(V *(x(k + 1))

(5)

where the smoothing parameter α must be properly set, bearing in mind not only the stability of downstream operations but also timely switching to a different control mode to handle undesirable events (controlled tank overflows or becomes empty). For a given discharge pattern and an initial state of the DCCS, there exist at least an optimal schedule for the mode parameter θ that maximizes the lower tank throughput over a finite horizon of N control modes. Significant changes to the discharge pattern in Figure 6a will affect the optimality of the switching policy and also may cause the lower tank overflows should its downstream discharge be interrupted. 2.3. Simulation-Based Dynamic Programming with Lebesgue Sampling. Simulation-based Dynamic Programming is a solution approach to sequential decision problems, which consists of integrating evaluative feedback and reinforcement learning algorithms27−29 to learn achieving a goal (control task) from simulated interactions between a decision-making agent (or controller) and a system. During learning, the agent interacts with a system by selecting actions (i.e., control mode parameters), and as a result of executing it, the system under control makes a state transition and the agent receives a numerical signal r, called reward (cost), that provides a hint of how good (or bad) the executed action is in terms of the observed state transition. Rewards are directly related to the achievement of a goal or behavior for which the sequence of actions is intended. In resorting to simulation-based DP for optimal operation of DCCSs, the objective of the control agent is learning the optimal switching policy, π*, which defines the control mode parameters for any state/disturbance the system may be in, bearing in mind both short- and long-term rewards. To this aim, the learning controller executes a sequence of modes to maximize the amount of reward received from an initial state/disturbance pair (x0,d0) until a certain goal state is reached. Under a given switching policy π, let us assume the expected cumulative reward Vπ(x0,d0) or value function (in dynamic programming jargon) over a certain time interval is a function of (xπ,d,tπ,θπ), where t = π {t(k)}k=N k = 1 is the time instants at which mode switches occur, x = k=N k=N {x(k)}k = 1 is the corresponding state values, d = {d(k)}k = 1 is the corresponding disturbance values, and θπ = {θ(k)}k=N k = 1 defines the policy-specific sequence of control modes. The sequence xπ of state transitions gives rise to rewards {r(k)}k=N k = 1 that are used to define a discounted value function

|x(k), d(k), θ(k))]

The control value function Q* is defined by Q *(x(k), d(k), θ(k)) = r(k + 1) + γ . Ex(k + 1)[(V *(x(k + 1)) |x(k), d(k), θ(k))]

π*(x , d) = arg max θ Q *(x , d, θ)

(9)

To find the Q* values for alternative parametrization of control modes, the well-known Dynamic Programming for policy iteration can be used. Thus, DP algorithms are obtained by rewriting the Bellman’s equation (eq 7) into update rules for improving approximations of the desired value functions and corresponding switching policy using simulated state transitions. The classic DP algorithm is now combined with Lebesgue sampling to select the optimal control mode at each state x(t ) ∈ ?k3, k = 1, ..., N in a new algorithm, which is referred to here as 3DP. For a given disturbance dynamics ḋ(t), and assuming that the mode parameter space Θ has been arbitrarily discretized as ΘDP, the set ?3 = ?13 ∪ ?23... ∪ ?3 N of reachable states is first computed using Algorithm 1 (Figure 4). Then, the 3DP algorithm recursively computes the control modes using the optimal switching policy π*. Starting from the terminal step N, 3DP exploits Bellman’s optimality principle to determine the value function V *(?3 N ) and the corresponding optimal mode 3 parameters π*(?N ). For any state x(t ) ∈ ?3 N , the value function is initialized using r(N). The Q* − values are computed recursively for any tuple (x(k),d(k),θ(k)) in line 6 of the Algorithm 2 in Figure 7. It is worth noting that for a given disturbance dynamics, the successor state of a control mode chosen from ΘDP is always included in ?3 . Each optimal parameter value π* (x(k),d(k)) in the current recursion step is the maximizing argument of the Q* −values for all modes that can be chosen in a particular z(k) = (x(k),d(k)), and the value function Vk* is the corresponding maximum value. A word of caution, though, is that the optimality of the resulting policy 3 π*(?3 N ) heavily depends on how well the sampling set ?N 3 approximates the reachable space  . As the algorithm for Lebesgue sampling in Figure 4 is based on a finite number of control modes in ΘDP, the resulting control policy is an approximation to the optimal policy, which can be obtained when a continuum of mode parameters are considered. 2.4. Example 1 (Continued). The algorithm 3DP has been applied to determine an optimal sequence of control modes that maximize the system throughput by rewarding state transitions in

∑ γ ir(k)] N−1

(8)

such that V*(x,d) = argmaxθ Q*(x,d,θ) for all (x,d). Once Q* is learned through simulated interactions, then the optimal switching policy is obtained directly through

k=1

V π(x 0, d 0): = E[γ N r(N ) +

(7)

(6)

where γ ∈ [0,1] is a parameter called the discount rate, which determines the present value of future rewards: a reward received k time steps in the future is worth only γk−1 times what it would be worth if it were received immediately. If γ = 0, the switching policy is “myopic” in being concerned only with maximizing immediate rewards. As γ → 0, the switching policy takes future 13749



Article

such way that the average outflow rate from the controlled tank is maintained as high as possible without overflowing or interrupting the downstream discharge. Also, sudden changes to the outflow rate to prevent such undesirable events are heavily penalized. After each control mode has been implemented, the corresponding reward r(k) is calculated using the following reward function: ⎛1 1 ⎞ r(k) = −1 + Y exp⎜ 2 ΔF 2⎟ + (1 − Y) ⎝2 H ⎠ 2 ⎡⎛ 1 1 ⎞⎛ 1 ⎞ ⎤ ⎟⎜ ⎟ ⎥ , exp⎢⎜ if 0 < x(ξe) < x max ⎣⎝ 2 I 2 ⎠⎝ F ̅ ⎠ ⎦

(10)

and r(k) = −∞ whenever the lower tank overflows or becomes empty; x(ξe) is the level in the controlled tank when a stopping condition ξe is activated. F̅ is the average outflow rate over the time interval for the applied mode, ΔF is the net change in the outflow rate when the mode switches, whereas the reward 1 function parameters are set to Y = 0.7; H = 12 ; I = 5. All control modes are stopped whenever a tap is open or closed. Control modes are also interrupted if the lower tank overflows or becomes empty. The proposed 3DP algorithm was applied using the inflow rate pattern in Figure 6a, which imposes an optimization horizon of N = 30 modes. In Figure 6b, the resulting level in the upper tank is shown whereas the level in the controlled tank is in Figure 6c. To obtain a risk-sensitive control policy, the discount parameter must be set near to 1 such that the long-term effects of control actions are taken into account. To this aim, a convenient value γ = 0.95 has been chosen, which guarantees that the lower tank does not overflow nor becomes empty. The number of mode parameters in the fixed set ΘDP is limited to 40 values for θ evenly distributed over the interval θ ∈ [0.7,1.1]. Parameters used by 3DP in this example are summarized in Table 2. Results obtained are given in Figure 8. The level dynamics in the lower tank (Figure 8a) highlights that the switching policy found by the 3DP algorithm is able to handle a significant variability in the upper tank discharge (see Figure 8b) while maximizing its average outflow rate as it is shown in Figure 8c. The optimal sequence of mode parameters that maximize the average throughput is depicted in Figure 8d. It is worth noting that this sequence is highly sensitive to the variability pattern in the upper tank inflow rate, as can be seen in Figure 9. Assuming the inflow rate pattern in Figure 9a, the optimal sequence of control modes is now given by Figure 9c, with the corresponding outflow rate shown in Figure 9b. Accordingly, for optimal operation of the two-tank system, the switching policy should be able to cope with variability in the inflow rate pattern, which alters the level dynamics in the upper (uncontrolled) tank. It is worth noting then that both the set of states x(t ) ∈ ?3 and the optimal switching policy found by the 3DP algorithm are dependent on the inflow rate pattern. Thus, to learn a robust switching policy to cope with a stochastic disturbance dynamics uncertainty in state transitions following control mode executions must be addressed. h

Figure 10. Lebesgue sample dynamic programming with Gaussian process regression.

generalizing the switching policy π to states that are not necessarily in the neighborhood of known states x(t ) ∈ ?3 . Also, the search for a near optimal switching policy under uncertainty must selectively resample state transitions using only the most promising control modes to lead a DCCS from an initial state to the goal state requiring only a small number of simulation runs. This imposes the need for introducing active learning in the state sampling strategy to focus on improving π by generating informative sequences (xπ,d,tπ,θπ) as policy evaluation runs are increasingly used to bias data sampling. It is worth noting that as the switching policy is iteratively changed new states are visited, which gives rise the need for selecting which states are used for value function approximation. 3.1. Gaussian Processes (GPs) in a Nutshell. In the following, a brief introduction to GPs will be given based on the books by Rasmussen and Williams41 and MacKay42 using the value function V*k (·) as a representative example. Given a data set {Zk,Vk}, where Vk is a vector of nk sample points for the value function Vk*(·) and the corresponding state/disturbance pairs z(k) = (x(k),d(k)) made up the input data matrix Zk. If observations are generated by 2 V *k = h(z(k)) + ε , ε ∼ 5(0, σε ), we want to infer an inductive model h of the (unknown) function Vk*(·) that generated the observed data. Within a Bayesian framework, the inference of the underlying function h is described by the posterior probability p(h|Zk , Vk) =

p(Vk|h , Zk)p(h) p(Vk|Zk)

(11)

where p(Vk|h,Zk) is the likelihood and p(h) is a prior on plausible value functions assumed by the GP model. The term p(Vk|Zk) is called the evidence or the marginal likelihood. When modeling value functions in DP using GPs, a probabilistic prior p(h) is placed directly in the space of functions without the necessity to consider an explicit parametrization of the approximating

3. OPTIMAL MODE SWITCHING UNDER UNCERTAINTY The main drawback with 3DP is that when state transitions are subject to uncertainty unseen states arise. To circumvent this issue, 3DP should be combined with a regression technique for 13750



Article

Figure 11. Role of sampling bias in Bayesian active learning.

function h. This prior typically reflects assumptions on the, at least local, smoothness of the latent function. Similar to a Gaussian distribution, which is fully specified by a mean vector and a covariance matrix, a GP is specified by a mean function m(·) and a covariance function Cov(∵), also known as a kernel. A GP can be considered a distribution over functions. However, regarding a function as an infinitely long vector, all necessary computations for inference and prediction of value functions can be broken down to manipulating well-known Gaussian distributions. The fact that the value function V*k (·) approximation is based on a GP is indicated by V k*( ·) ∼ .7v(m , Cov) hereafter. Given a GP model of the value function Vk*(·), we are interested in predicting the value function for an unseen input zk*. The predictive (marginal) distribution of the value function approximation V k* = .7v(z*k ) for a test input z*k is Gaussian distributed with mean and variance given by E[V k*(z*k )] = k(z*k , Z k) + (K + σε2 I)2 Vk

log p(Vk|Z k , ψ ) =

(15)

The marginal likelihood can be considered the likelihood of the hyper-parameters given the data after having marginalized out the latent function h. We made the dependency of K on the hyper-parameters ψ explicit by writing Kψ in eq 15. Evidence maximization yields an inductive model of the value function that (a) rewards data fitting and (b) rewards simplicity of the model. Hence, it automatically implements Occam’s razor, that is, preferring the simplest hypothesis (model). Maximizing the evidence is a nonlinear, unconstrained optimization problem. Depending on the data set, this can be a very hard problem to solve. However, after optimizing the hyper-parameters ψ, the GP model can always explain the data, although a global optimum has not necessarily been found. 3.2. Gaussian Process Dynamic Programming with Lebesgue Sampling (3GPDP). 3GPDP is a generalization of the 3DP algorithm in Figure 7 to handle continuous state and mode parameter spaces using fully probabilistic GP models to describe value functions. In this section, a multimodal optimal control problem under uncertainty is addressed, where the mode transition dynamics is stochastic and only data gathered from interactions with a simulator or the real system are known. To determine a solution for continuous-valued state and mode parameter spaces, 3GPDP describes the value functions V*k (·) and V*k (∵) directly in function space by representing them by fully probabilistic GP models. These inductive models make intuitive sense, as they use simulation/real data to determine the underlying structures of these value functions, which are a priori unknown. Moreover, they provide information about confidence intervals for value function estimations and near optimal setting of control mode parameters. Similarly to the Lebesgue-sampled 3DP (Algorithm 2) in Figure 7, we now choose finite sets ?̵ k:, k = 1, ..., N and ΘS for states and mode parameters, respectively. However, instead of being a discrete representation of the state and parameter spaces, these sets are now considered

(12)

(13)

where K is the kernel matrix with entries Kij = Cov(zik,zjk). A common covariance function Cov is the squared exponential (SE): ⎤ ⎡ 1 CovSE(z , z′): = ζ 2exp⎢ − (z − z′)T Λ(z − z′)⎥ ⎦ ⎣ 2

(Vk|h(Z k), Z k , ψ )p(h(Z k)|Z k , ψ )dh

1 = − VTk (K ψ + σε2 I)−1Vk 2 n 1 − log|(K ψ + σε2 I)| − k log(2π ) 2 2

Var[V k*(z*k )] = Cov(z*k , z*k ) + Cov(z*k , Z k) (K + σε2 I)2 Cov(Z k , z*k )

∫

(14)

where Λ = diag([S12, S22, ..., Sn2x ]) and Sr , r = 1, ..., nx , being the characteristic length scales. The parameter ξ2 describes the variability of the inductive model h. The parameters of the covariance function Cov are the hyperparameters of the .7v and are collected within the vector ψ. To fit parameters to value function data the evidence maximization or marginal likelihood optimization approach is recommended.29 The log-evidence is given by 13751



Article

Figure 12. Combining 3 GPDP with Bayesian active learning.

model for the value function allows addressing uncertainty in state transitions following a mode execution and the variability in corresponding rewards (costs) in a natural way. As the expectation is with respect to the latent function V*k+1(zk+1), which is probabilistically modeled using .7] , E = [.7](zk + 1)], is

the support points (training data) for approximating value f u n c t i o n s V k* ( · ) a n d V k* ( · , · ) us in g G P m o d e ls : V k*( ·) ∼ .7](m = , Cov=) and Q k*( ·, ·) ∼ .7X(mX , CovX), respectively. The training targets (observations) for .7] and .7X are recursively determined by 3GPDP itself. A sketch of the modebased3GPDP algorithm using Lebesgue sampling and for a known state transition simulator is given in Figure 10. The advantage of modeling the state-value function Vk*(·) by .P] is that the GP provides a predictive distribution of V*k (zk) for any input zk through eqs 12 and 13. This property is exploited in the computation of the Qk*-value (line 7); due to the generalization property of .7] , we are not restricted to a finite set of successor states following mode execution when determining E][.P](zk + 1)]. It is worth noting that using a probabilistic

simply m = (.7](zk + 1)), that is, the predictive mean of the Vk+1 * (zk+1). Note that for any zk ∈ ?̵ k: the corresponding .7X models a function of over a continuum of mode parameters not only the finite number of modes in the set ΘS. Therefore, the optimal control mode for zk in line 10 is the maximum of the mean function of .7X . The parametrization for the optimal control mode at zk is obtained by solving the optimization problem: 13752



Article

Figure 13. Results obtained using the BA3 GPDP algorithm for Example 1. (a) Inflow rate pattern; (b) lower tank outflow rate; (c) scheduling of mode parameters.

Figure 14. Results obtained using the BA3 GPDP algorithm for Example 1. (a) Inflow rate pattern; (b) lower tank outflow rate; (c) scheduling of mode parameters.

argmaxθ Q *k (zk , θ ) = argmaxθ mq(θ), subject to: θ min(zk) ≤ θ ≤ θ max(zk)

only needs the value function V*k (zk), the maximum expected cumulative reward at a Lebesgue point, which is estimated using .7] . Furthermore, a good model of Qk* in joint state−action space requires substantially more training points and makes standard GP models computationally very expensive. In contrast to the classic DP, 3GPDP is independent of the time-sampling f r e q u e n c y . W h e n c o m p a r e d w i t h 3DP, t h e s et s ?̵ k:, k = 1, ..., N in 3GPDP contains support points of the GP value function models rather than a discrete representation ?L of the state space L . Thus, by means of Lebesgue sampling,

(16)

using conventional optimization methods. A distinctive feature of simulation-based dynamic optimization using 3GPDP is that for all zk ∈ ?̵ k: , independent GP models for Q k*(zk , ·) are used rather than modeling Q k*( ·, ·) in joint state−actions space as it is proposed by Mehta and Egerstedt.26 The main reason for this choice is that, by means of a simulation model for state transitions, the 3GPDP algorithm in Figure 10 13753



Article

Figure 15. Six results inflows rates of six different tap schedules.

problem to obtain an optimal switching policy π* valid all over 3 . The support sets ?̵ k: k = 1, ..., N , contains the training inputs for the proposed policy GP, that is, the training input locations of the value function GP. The training targets are the parameter values ?̵ k: k = 1, ..., N . If we lack problem-specific priors, this general approach is applicable. Although any function approximation technique can be used for policy modeling purposes, the multimodal policy is also approximated with a GP, the policy GP (line 15 of Algorithm 3). Finding a globally optimal policy is very difficult in general. Moreover, it requires many data points, particularly in high-dimensional state spaces. In practical applications, a globally optimal policy is not required, but rather a policy that can solve a particular control task. Thus far, we have placed the support points ?̵ k: k = 1, ...N for the value function model .7] based on sampling in the states/

3GPDP can use data far more efficiently than the Riemanndiscretized version of GPDP proposed elsewhere.38−40 Also, by resorting to GPs, the propose algorithm 3GPDP can readily handle uncertainty in state transitions due to stochastic disturbance dynamics and noisy measurements. We interpret the optimal control modes π*(?̵ k: ) k = 1, ..., N (line 15 of Algorithm 3) returned by 3GPDP as noisy measurements of an optimal policy under uncertainty. To generalize an optimal, continuous-valued switching policy on reachable state space 3 , we have to model the switching policy based on a finite number of evaluations in the support sets ?̵ k: k = 1, ..., N . We regard the switching policy π* as a deterministic map from states/disturbances to mode parametrizations. To generalize the switching policy based on a finite set of mode parameters ΘS, we have to solve a regression 13754



Article

Figure 16. Testing results obtained using the grand switching policy π.* for Example 1. (a) Inflow rate pattern; (b) upper tank level; (c) upper tank outflow rate; (d) lower tank level; (e) lower tank outflow rate; (f) scheduling of mode parameters.

3.3. Learning the Transition Dynamics. Optimal operation under uncertainty of DCCSs using the 3GPDP algorithm in Figure 10 requires many simulation trials to average the value function V*k (z) for any z, as it is needed to define the optimal policy π *(?̵ k:). Moreover, lacking an initial estimation of a good policy to drive the system from an initial state to a goal state, there exist many states that can only be visited under a rather poor switching policy. Since the objective is to find an optimal policy requiring only a small number of interactions (real or simulated) with a DCCS, it seems worth introducing modeling bias for state transitions due to control modes. As it

disturbance spaces using random sequences of control modes. We consider this a suboptimal strategy, which can be highly datainefficient to achieve a meaningful cover of the state space as N increases. However, lacking prior knowledge about a near optimal switching policy, for a proper trade-off between exploration and exploitation the state space sampling should be restricted only to the relevant part. In the remaining part of this section, we will describe how to combine online modeling of the state transition dynamics using GPs and active Bayesian learning by means of a utility function. 13755



Article

is widely recognized,27−29 model-based learning methods often make better use of available information, since they capture the underlying pattern (latent function) for state transitions. Probabilistic models such as GPs appropriately quantify uncertainty about state transitions and, assuming the mode transition dynamics is smooth, can lead to very data-efficient solutions. GP models of both transition dynamics and the value functions Vk*(·) and Qk*(·,·) will be built simultaneously, as data biases are introduced through sampling using an increasingly improved switching policy. For optimal control under uncertainty, a model state transition dynamics based on interactions with the real dynamic system or a simulator of it is used here. We assume that the process dynamics evolve smoothly while a control mode is being implemented. Moreover, we implicitly assume that output variability is due to uncertainty about stopping conditions activation. A GP model, the dynamics GP, is learned to describe the state/disturbance transition dynamics through expectations on changes caused by a control mode execution. For each input variable zd, a separate GP model is trained in such a way the effect of uncertainty about its change following a control mode execution is modeled statistically as z d(k + 1) − z d(k) ∼ GPd(md , Covd)

information gain about the latent value function and to reduce uncertainty about the optimal control actions at each input in the shaded area of Figure 11. Hence, we choose a utility function < that captures both objectives to rate the quality of candidate states used to infer GPs. We aim to find the most promising state z*̃ that maximizes the expected utility function based on Bayesian averaging using β Ev(Vk(z)̃ |?̵ k:)+ log(varv[Vk(z)̃ |?̵ k:]) 2

(18)

where > and β are used to control the exploration−exploitation trade-off in BA3 GPDP, and the predictive mean and variance of the value function are Ev[Vk(z)̃ |?̵ k:)]; varv[Vk(z)̃ |?̵ k:)] = Covv(z ̃, z)̃ + Covv(z ̃, ?̵ k:)K −v 1Covv(?̵ k:, z)̃

(19)

The utility of a given input z ̃ expresses how much total reward is expected from it when acting optimally (first term) and how surprising V*k (z)̃ is expected to be given the current training inputs ?̵ k: of the GP model for Vk (second term). The second term in eq 18 is somewhat related to the expected Shannon information (entropy) of the predictive distribution Vk*(z)̃ or the Kullback−Leibler divergence29,43 between the predictive Gaussian distribution of V *k (z)̃ |?̵ k: and V *k (?̵ k:). The parameters > and β are used to trade off optimality of action selection and information gain needed for policy improvement. A large (positive) value of > introduces data bias toward optimal operation, whereas a large value (positive) of β favors gaining information based on the predicted variance of the value function for unseen states at different decision stages. In the search for an improved switching policy, the algorithm BA3 GPDP exploits current knowledge represented and provided by the probabilistic value function model .7] . Gaining information means that BA3 GPDP should explore selectively sparsely sampled subspaces with few training points (states). Thus, by adding inputs z*̃ that are informative for optimal operation of a DCCS, the most promising state trajectories from the initial point to the goal state can be found. In Figure 12, the main steps of the algorithm BA3 GPDP are shown. First a small number (say ; ) of state/disturbance transition trajectories are simulated by applying from different initial inputs z1(0) ∈ ?̵ 0: a sequence of up to N control modes whose parameter values are randomly chosen from the finite set ΘS based on an initial switching policy π0. The resulting tuples (x1(k),d1(k),θ1(k)) observed along the generated trajectories define the training data for the dynamics GP in the first iteration. The observed state transitions are necessarily noisy due to the uncertainty in the activation of stopping conditions. To obtain the switching policy in the first iteration of BA3 GPDP, a DP-like recursion is used to approximate .7] beginning with final states x1(N ) ∈ ?̵ :N for which the value function can be readily calculated (line 9) and then used to train VN*(z1N) (line 10). The internal loop determines the support sets ?̵ k:, k = N − 1, ..., 1 using Bayesian active learning as follows. For each z1(k) = (x1(k), d1(k)) the dynamics GP is used to estimate

(17)

where md is the mean function and Covd is the covariance function. The training inputs to the transition dynamics GPd are tuples (x(k),d(k),θ(k)), whereas the targets are the differences shown in eq 17. The posterior distribution for the dynamics GP reveals the remaining uncertainty about the underlying latent function for any input zd transition caused by a given mode θ(k) when stopping conditions are activated. GP models of the transition dynamics for both states and disturbances are built on the fly using data gathered from interactions with a simulator or the real system. Further details on GP regression can found in the excellent books of Rasmussen and Williams41 and MacKay,42 and references therein. 3.4. Bayesian Active Learning. To find an optimal switching policy, Bayesian active learning is incorporated in the algorithm described in Figure 12 such that only a relevant part of the state space will be explored. A priori it is unclear which parts of the state space are relevant. Hence, ‘‘relevance’’ is rated by a utility function within a Bayesian active learning framework in which the posterior distributions of the value function model .7] will play a central role. This novel online algorithm largely exploits information, which is already computed within the algorithm 3 GPDP. The combination of Bayesian active learning and 3 GPDP will be called BA3 GPDP in the sequel. Instead of a globally, sufficiently accurate value function model, BA3 GPDP aims to find a locally appropriate value function model in the vicinity of most promising trajectories from the initial states to the goal state using a optimal sequence of control modes. Figure 11 depicts how such a solution is found by increasingly biasing data gathering. Starting from an initial state, training inputs for the involved GP models are placed only in a relevant part of the state space (shaded area). The algorithm BA3 GPDP finds a solution leading the system through this relevant region to the goal state. : Consider a given set ?̵̅ k of candidate states that could be added to the support set ?̵ k: used for training the GPs. For efficiency reasons, only the best candidates shall be added. In probabilistic : optimal control, a ‘‘good’’ input z*̃ ∈ ?̵̅ k should provide both

:

the mean successor states that will make up the set ?̵̅ k + 1 from which the most promising ∂ states according to the utility function (eq 18) are chosen to augment ?̵ k: . Once the backward 13756



Article

recursion has been completed, the first version of the switching policy π*1(?̵ k:), k = 1, ..., N is approximated using a Gaussian process, the policy GP. In the next iteration, the policy π1* is applied to all initial inputs z 2(0) ∈ ?̵ 0: so that tuples (x2(k),d1(k),θ2(k)) are obtained along simulated trajectories so that new state-disturbance pairs can be considered as candidates to augment ?̵ k:, k = 1, ..., N . Bayesian active learning is the used to select the most informative ∂ states, and to then all GPd that model the transition dynamics are updated. When the backward recursion ends, a new version of the control policy π*2 (?̵ k:), k = 1, ..., N , is obtained. As switching policies in successive iterations are also modeled using GPs, policy iteration can be stopped when the Kullback−Leibler (KL) distance between two successive policy distributions all over the grand support set ?̵ = ?̵ 0: ∪ ?̵ 1: ∪ ...?̵ k: ∪ ...?̵ :N is lower than a small tolerance δ KL(GPπ *j+1 (?̵ )GPπ *j (?̵ )) =

Second, policy iteration using Bayesian active learning focuses Lebesgue sampling only where it is needed for optimal control, which makes the memory size and computation complexity required for convergence rather small. Hence, BA3 GPDP can also lead to a remarkable speed up of 3 GPDP and several orders of magnitude improvements over DP. Generalization capabilities of GPs are instrumental in significantly lowering the cardinality of sets ?̵ and ΘS required for convergence in policy iteration. 3.5. Example 1 (Continued). To end this section, the algorithm BA3 GPDP is applied to the two-tank system in Figure 5. One of the main advantages of using GPs to model value functions and state transition dynamics is that the finite set ΘS may have a small number of elements, which dramatically lowers computational costs. The number of control modes in ΘS is chosen as 15, whereas the values for the parameter θ are evenly distributed over the interval θ ∈ [0.7,1.1]. As a result, after 15 iterations the support set ?̵ for relevant states provides enough information to approximate the switching policy using GPs. Moreover, resorting to GPs allows obtaining a smoother switching policy, as it is highlighted in Figures 13 and 14 for different input patterns. The remaining uncertainty for control mode parametrizations in Figure 13c and 14c is small enough to comply with the convergence criterion in eq 20. Finally, it is worth noting that, for the input pattern in Figure 14a, the algorithm BA3 GPDP is able find an optimal control policy that applies almost the same control mode through the entire optimization horizon (N = 30 modes). To address uncertainty in the inflow rate to the upper tank, a simple procedure is used. Let us consider the six inflow rates in Figure 15. For each one of them 15 iterations of the BA3 GPDP are made to obtain πi*(?̵ :i), i = 1, ..., 6, which are used to fit a Gaussian process, called the grand policy GP, supported by points in the set ?̵ = ?̵ :1 ∪ ?̵ :2 ∪ ...?̵ :6 . To illustrate the interpolation capability of this comprehensive switching, it has been tested using much longer (an unseen) sequence of events related to opening/closing taps over time, as shown in Figure 16a. As can be seen in Figure 16b, the level in the upper tank varies significantly, which generates an extremely oscillating inflow rate to the lower tank (see Figure 16c) that the switching policy must handle properly. Results obtained for this testing exercise are shown in Figure 16d−f. The grand switching policy is able to successfully extrapolate the mode selection strategy to unseen states. It is noteworthy the maximum usage of the lower tank capacity to accommodate an extreme variability in its inflow rate by a proper schedule of control modes.

∑ GPπ *

j + 1 (z)

z ∈ ?̵

⎛ GPπ *j+1 (z) ⎞ ⎟⎟ ≤ δ log⎜⎜ ⎝ GPπ *j (z) ⎠

(20)

The algorithm BA3 GPDP in Figure 12 has a radical difference with 3 GPDP and 3 DP due to its iterative approach to policy improvement. Since the value function V k*(?̵ k:), k = 1, ..., N is approximated from iteration two and onward, stopping conditions related to the value of states the system is entering due to a control mode can be checked to advantage. The most effective design of a stopping condition for this purpose should be related to the gradient of the value function, that is, a control mode is implemented as long as the values of states encountered is, on average, monotonically increasing. As soon as the values of visited states start to lower, a stopping condition must trigger. Similarly, minimum threshold values for states can be enforced during mode execution. As a result, selective sampling of the state space through stopping conditions based on the value function approximation over ?̵ is a key part of the policy iteration procedure in Figure 12, in order to avoid visiting states with extremely low values. An important advantage of resorting to Gaussian processes in BA3 GPDP in Figure 12 is that a probabilistic analysis of the switching policy is readily available using GP predictions of state transition with uncertain inputs (see Appendix A in ref 29 for details). A simple procedure would be to estimate the set of final states (means and variances) using the obtained switching policy from all state/disturbance pairs z ∈ ?̵ 0: . Equipped with this information, it is rather straightforward to calculate the probability of reaching an arbitrarily defined subset of final states, namely the goal state G. To end this section, issues regarding computational and memory requirements of BA3 GPDP compared to DP are briefly discussed. First, due to Lebesgue sampling of the state space, storage requirements are significantly lower, since they grow linearly with the cardinality of both the action set ΘS and the state support set ?̵ . For standard DP when applied to a stochastic dynamics, Riemann sampling makes it very cumbersome and hardly applicable without approximations because the memory requirements to storage a full transition matrix grow quadratically with the number of states in the grid used to approximate  and linearly with the number of control modes in the set ΘDP.

4. HYBRID PROCESS PLANTS 4.1. Problem Statement. Representative examples of discretely controlled continuous systems in the process industry are hybrid chemical plants constituted by parallel batchcontinuous production lines with shared resources.44 Processes such as the Solvay PVC production line,45 sugar milling plants46,47 and petrochemical plants48 have all in common this archetype combination of batch and continuous modes of operation involving parallelism and synchronization constraints. Buffer tanks play a key role to manage such constraints at the interface between the batch and continuous sectors in a plant where continuous−discrete interactions should be properly controlled for productivity optimization. To maximize the average throughput of a hybrid chemical plant, the buffer tank capacity must be properly controlled by manipulating its downstream outflow rate. As depicted in Figure 13757



Article

Table 4. Data for the Nominal Schedule batch reactors, BR i BR1

BR2

BR3

BR4

First Batch Phases Times (min) stand by filling, SF 40 0 60 100 filling, F 30 30 30 30 stand by reaction, SR 15 15 15 15 reaction, R 60 65 60 70 discharge, D 30 30 30 30 maintenance and setup, M 15 15 15 15 Second Batch Phases Times (min) stand by filling, SF 20 5 10 25 filling, F 30 30 30 30 stand by reaction, SR 15 15 15 15 reaction, R 45 60 70 70 discharge, D 30 30 30 30 maintenance and setup, M 15 15 15 15 Third Batch Phases Times (min) stand by filling, SF 40 45 20 15 filling, F 30 30 30 30 stand by reaction, SR 15 15 15 15 reaction, R 70 70 60 45 discharge, D 30 30 30 30 maintenance and setup, M 15 15 15 15

Figure 17. Hybrid chemical plant.

BR5

BR6

BR7

70 30 15 80 30 15

0 30 15 40 30 15

10 30 15 55 30 15

20 30 15 55 30 15

15 30 15 60 30 15

20 30 15 60 30 15

20 30 15 65 30 15

20 30 15 55 30 15

20 30 15 60 30 15

Table 5. Parameters for the Hybrid Plant

Table 3. Data for the Hybrid Chemical Plant parameter

value

buffer tank height buffer tank volume max outflow rate, Fmax min outflow rate, Fmin reactor outflow rate, FBR i

0.3 m 0.3 m3 1.0 m3/h 0.1 m3/h 0.30 m3/h

smoothing param., α no. reactors

0.025 7

parameter

value

utility weighting factor > utility weighting factor β discount rate, γ control horizon in no. modes, N support set cardinality, ||X ||

1.0 2.0 0.98 20 400

whole schedule of batch discharges to the buffer tank. Also, waiting times for filling or discharging are usually the manipulated variables to solve resource conflicts and leveling resource demands in scheduling production campaigns. As the buffer tank hold-up is used to feed downstream continuous processes in the plant (centrifuges, dryers, fill-out trains, etc.), tank stability and throughput maximization must face the uncertainty of the inflow rate from upstream discharges. Also, dynamic optimization of tank throughput is constrained by the limitation of the resources that are shared by batch units and by the capacities of the various devices that constitute the plant. Finally, tank stability is not only related to the risk of overflow or underflow in the buffer tank output but also is mandatory to avoid making sudden changes to the tank outflow rate so as to avoid upsetting operating conditions in the continuous processes. In Table 3, relevant data for the hybrid batch plant used as a case study are given. The key issue in these of hybrid process plants is that due to resource sharing in some phases of parallel batch units (i.e., reactors, crystallizers, etc.) optimal scheduling revolves around avoiding conflicts and peaks in resource usage.44,45 Assuming an industrial environment with no uncertainties involved, periodical scheduling is the alternative of choice to maximize plant throughput.44 As soon as an event occurs, the schedule must necessarily be updated and the disruption must be addressed by managing tank (finite) capacity to avoid upsetting the continuous process. Effective handling of the uncertainty and variability in reactors discharges makes room for the scheduling

Figure 18. Nominal schedule for the hybrid chemical plant.

17, hybrid chemical plants are generally made up of several parallel working units (e.g., batch reactors, formulation units or crystallizers) “sharing” common resources, typically utilities such as cold water, steam, or even reactants whose availability is limited, as in the dairy industry. Batch units often operates by cyclically passing through a sequence of phases, namely loading of raw material, cooling, heating, reaction, discharge, etc. Resource sharing by different process units may increase the duration of some of these tasks in the recipe, which will alter the 13758



Article

Figure 19. Results obtained using the BA3 GPDP algorithm for the hybrid chemical plant. (a) Inflow rate for the nominal schedule; (b) buffer tank outflow rate; (c) scheduling of mode parameters.

problem in the batch sector to focus separately on addressing resource conflicts while leveling resource usage. As a result, plant productivity is maximized and the interface between the batch and continuous sectors of the plant is properly managed. In the next section, the proposed algorithm BA3 GPDP is used to obtain a switching policy for optimal operation under uncertainty of hybrid chemical plants with parallel production lines and shared resources. 4.2. Simulation Results. To begin with, the algorithm BA3 GPDP is applied to learn a switching policy that can handle the intermittent reactor discharges in the nominal schedule

Figure 20. Results obtained using the optimal policy obtained for the nominal schedule with a different schedule. (a) Disturbed schedule; (b) inflow rate pattern; (c) buffer tank level; (d) buffer tank outflow rate. 13759



Article

Figure 21. Training the BA3 GPDP algorithm with a schedule sensibly different from the nominal one. (a) Production schedule; (b) inflow rate pattern.

shown in Figure 18 with data given in Table 4. Data for addressing this dynamic optimization task are given in Table 5, whereas the feedback control laws and the reward function are similar to eqs 4, 5, and 10, where the buffer tank is equivalent to the controlled tank of example 1. The state is defined by the buffer tank level and the corresponding smoothed level used in the feedback law. The nominal schedule of reactor discharges is somewhat easy to handle, as can be seen in Figure 18, which provides an optimal solution for mode scheduling aiming at dynamically maximizing the tank throughput. It is rather remarkable that, for the nominal schedule in Figure 18, the solution found with high certainty corresponds to repeatedly applying the same feedback law with θ = 1. After 15 iterations, the remaining uncertainty about the optimal mode parametrizations (see Figure 19c) is almost negligible. This simple solution may seem a bit of a surprise, but its rationale is rather easy to be understood by analyzing the reward function in eq 10, which trades off a higher plant throughput for tank stability. For the optimal control policy, it is noticeable in Figure 19b that the smoothness of the outflow rate sent to the continuous process downstream despite the discontinuous inflows to the buffer tank (see Figure 19a). Figure 19c highlights the capability of the switching policy to manage the tank capacity in order to filter out the significant variability in the inflow rate to the buffer tank. How robust to production schedule variability is using repeatedly the feedback law with θ = 1? To answer this question, the switching policy obtained for the nominal schedule is applied to a different production schedule (see Figure 20a) whose reactor discharges give rise to the inflow rate in Figure 20b. When compared with inflow rate pattern of Figure 19a for the nominal

Figure 22. Results obtained using the BA3 GPDP algorithm for the schedule in Figure 18. (a) Buffer tank level; (b) buffer tank outflow; (c) scheduling of mode parameters.

schedule, it can be seen that there exists now a peak in the tank inflow rate at time around 500 min that may pose a challenge to tank stability. The tank outflow rate obtained using the same optimal policy obtained earlier for the nominal schedule is given in Figure 20c. The smoothness and dynamics of the downstream discharge is certainly an appealing outcome, but the tank level reaches its maximum just when the inflow peak has ended, which highlights that there exists an imminent risk of overflowing 13760



Article

Figure 23. Testing the policy obtained using the BA3 GPDP algorithm and merging four archetypes control policies for the hybrid plant. (a) Inflow rate pattern; (b) buffer tank outflow; (c) scheduling of mode parameters.

Figure 24. Testing the policy obtained using the BA3 GPDP algorithm and merging four archetypes control policies. (a) Inflow rate pattern; (b) buffer tank outflow; (c) scheduling of mode parameters.

should the peak last a bit longer. To face difficult peaks and valleys in the buffer tank inflow rate, the optimal control policy must resort to modes with values for θ that are conveniently scheduled to maximize throughput while avoiding that the tank overflows or the feed to the downstream continuous process should be interrupted. To illustrate this point, let us consider the production schedule in Figure 21a, which gives rise to the inflow rate pattern in Figure 21b. After 15 iterations, the resulting switching policy obtained using BA3 GPDP provides the sequence of modes in Figure 22c. Even though there is

uncertainty about the optimal values of mode parameters, it is vividly clear that the solution strategy is to use modes with θ > 1 to handle significant inflow peaks and θ < 1 to deal with deep valleys. Even though there is still room for improvement, the switching policy is rather satisfactory to manage tank capacity while maximizing the average tank throughput, as can be seen in Figure 22a and b. From the results, it is quite clear that what is needed is a robust switching policy for mode scheduling than can cope with the uncertainty in tank inflows rates due to unplanned events and 13761



Article

control. The first issue is that of efficiently sampling of a continuum state space in dynamic programming using a finite number of control modes. The approach we follow was motivated by replacing in classic dynamic programming algorithms the conventional Riemann sampling with Lebesgue sampling. For sampling the state space by concatenating control modes in strings of different lengths, the Algorithm 1 (Figure 4) was proposed and integrated with dynamic programming to define the Algorithm 2 (Figure 7) aptly named 3DP, or Lebesgue-sampling-based Dynamic Programming. The second issue is about dealing with state transition uncertainty and the need to generalize value functions. Due to the variable effect of control actions, unseen states are necessarily found when applying even the same sequence of modes. Since the switching policy provided by the 3DP defines optimal modes for states generated using the Algorithm 1, the capability of generalization using Gaussian Processes was incorporated, and in the 3GPDP in Figure 10, a novel algorithm for multimodal control was proposed. The final issue addressed in this work was related to the proper introduction of data bias to make Lebesgue sampling and 3GPDP more efficient. This is critical not only for generating the control policy via simulation but, more importantly, for online updating the policy upon data obtained from applying it to a real system. Inspired in the work of Deisenroth et al.,29 Bayesian active learning in policy evaluation experiments is proposed for optimizing a DCCS operating performance. Data gathered over a number of interactions with the system is used to learn a regression metamodel of the transitions dynamics, the dynamics GPd, which is instrumental in making the proposed algorithm BA3 GPDP ideal for both learning via intensive simulation the switching policy and adapting it to the true operating environment a DCCS is facing over time. A number of numerical examples were presented to help illustrate the applicability of the proposed approach to DCCSs of industrial interest. When compared with previous works for stochastic hybrid systems, the policy iteration algorithm BA3 GPDP has a number of advantages. First, it can deal successfully with a continuum of control modes using the generalization capabilities of GPs. Other event-driven optimization-based approaches such as the icHA8 are based on a finite number of modes. Second, BA3 GPDP generates a solution in the form of a control policy that can be used in a reactive manner without any need to continuously solve online an optimization program to resynthesize the sequence of control modes. Last but not least, the control policy can be updated “on the fly” to account for significant changes in the process or disturbance dynamics.

ongoing conflicts that disrupt the production schedule in a way that cannot be anticipated. To define such robust policy, a merging strategy is proposed here. First, simulation is used to regress with a GP all the optimal mode parametrizations πi*(?̵ i), i = 1, 2... for all states contained in a finite number of support sets ?̵ i, i = 1, 2, ... obtained when applying the algorithm BA3 GPDP to different archetypes of production schedules. Accordingly, the grand policy π*. (?̵ ), where ?̵ = ?̵ 1 ∪ ?̵ 2 ∪ ..., is hopefully equipped to handle unseen inflow rate patterns by interpolating optimal mode sequences based on merging the solutions found for production schedules archetypes seen during training. As an example of this merging procedure, assume that in addition to the nominal schedule, the grand policy π.* is obtained by combining the optimal solutions found using three meaningful variants of the nominal production schedule based on increasing reaction times due to shared resources. It is noteworthy that the grand policy π.* can be generated to the desired level of accuracy since only simulations are involved. The more comprehensive and representative are the statistically different variants considered during training, the more robust the grand policy π.* will be. A distinctive advantage of using a GP to approximate π.* is that alternative production schedules can be considered for merging it (or not) in this comprehensive policy based on the uncertainty intervals for the sequence mode parameters obtained when unseen inflow rate patterns are tested. As a result, if the predictive distributions of the optimal controls π.*(?̵ ) have high variance, a conservative grand policy will not add more energy to the system. Hence, the two cornerstones for policy robustness under uncertainty are (i) using the mean function of the policy GP and (ii) selectively adding data to the support set ?̵ in order to lower the variance of optimal controls. To test the robustness of π.*, it has been is applied to two unseen production schedules with their corresponding inflow rate patterns. Results obtained are given in Figures 23 and 24. Despite that only four representative inflow rate patterns were used, the capability of π.* to deal high peaks and deep valleys in the tank inflow rate is remarkably good. Also, it can be said that the tank capacity is properly used to generate a smooth outflow discharge downstream. There is still some room for improvement, though. The uncertainty interval in the optimal mode parameters in some states (see Figures 23c and 24c) would be reduced by further training using additional production schedules to gather informative simulation data.

■

5. CONCLUDING REMARKS From supply chains to hybrid chemical plants, from bioprocesses and biological systems to solar collectors and wind turbines are all representative examples of a special type of hybrid dynamical systems known as discretely controlled continuous systems. For optimal operation of DCCSs under uncertainty, timely switching to different control modes is necessary to handle external events and disturbances that may affect performance and safety. Control modes are made up of feedback control laws with parameters and stopping conditions. The latter are triggered in response to events, state constraints, or the achievement of some goal of interest. Due to the uncertainty of state transitions, three important issues have been addressed in the development of dynamic programming algorithms for near optimal switching

AUTHOR INFORMATION

Corresponding Author

§Phone: +54 (342) 4534451. Fax: +54 (342) 4553439. E-mail: [email protected]. Notes

The authors declare no competing financial interest.

■

NOMENCLATURE

Symbols

d = vector of disturbances  = disturbance space f = system dynamics F = outflow rate - = state transition simulator 13762



Article

(9) Xu, X.; Antsaklis, P. J. Results and perspectives on computational methods for optimal control of switched systems. In Hybrid Systems: Computation and Control; Maler, O.; Pnueli, A. Eds.; Springer: Germany, 2003; pp 540−555. (10) Borrelli, F.; Baotić, M; Bemporad, A.; Morari, M. Dynamic programming for constrained optimal control of discrete-time linear hybrid systems. Automatica 2005, 41, 1709. (11) Lincoln, B.; Rantzer, A. Relaxing dynamic programming. IEEE Trans. Autom. Control 2006, 51, 1249. (12) Rantzer, A. Relaxed dynamic programming in switching systems. IEE Proc.: Control Theory Appl. 2006, 153, 567. (13) Zhang, W.; Hu, J.; Abate, A. On the value function of the discretetime switched LQR problem. IEEE Trans. Autom. Control 2009, 54, 2669. (14) Liberzon, D. Switching in Systems and Control. Systems and Control: Foundations and Applications; Birkhäuser Boston Inc.: Boston, 2003. (15) Egerstedt, M.; Azuma, S.; Axelsson, H. Transition-time optimization for switched-mode dynamical systems. IEEE Trans. Autom. Control 2006, 51, 110. (16) Axelsson, H.; Wardi, Y.; Egerstedt, M.; Verriest, E. I. Gradient descent approach to optimal mode scheduling in hybrid dynamical systems. J. Optimization Theory Appl. 2008, 136, 167. (17) Ding, X. C.; Wardi, Y.; Egerstedt, M. On-line optimization of switched-mode dynamical systems. IEEE Trans. Autom. Control 2009, 54, 2266. (18) Görges, D; Izák, M.; Liu, S. Optimal control and scheduling of switched systems. IEEE Trans. Autom. Control 2011, 56, 135. (19) De Paula, M.; Martínez, E. C. Development of multi-modal control programs for continuous-discrete process supervision. Comput.Aided Chem. Eng. 2009, 27, 1383. (20) Xu, Y. K.; Cao, X. R. Lebesgue-sampling-based optimal control problems with time aggregation. IEEE Trans. Autom. Control 2011, 56, 1097. (21) Lehmann, D.; Lunze, J. Extension and experimental evaluation of an event-based state-feedback approach. Control Eng. Pract. 2011, 19, 101. (22) Lunze, J.; Lehmann., D. A state-feedback approach to event-based control. Automatica 2010, 46, 211−215. (23) Azuma, S.; Imura, J.; Sugie, T. Lebesgue piecewise affine approximation of nonlinear systems. Nonlinear Analysis: Hybrid Systems 2010, 4, 92. (24) Bemporad, A.; Morari, M. Control of systems integrating logic, dynamics, and constraints. Automatica 1999, 35, 407. (25) Barton, P.; Lee, C.; Yunt, M. Optimization of hybrid systems. Comput. Chem. Eng. 2006, 30, 1576. (26) Mehta, T. R.; Egerstedt, M. An optimal control approach to mode generation in hybrid systems. Nonlinear Anal. 2006, 65, 963. (27) Sutton, R. S.; Barto, A. G. Reinforcement Learning: An Introduction. MIT Press: Cambridge, MA, 1998. (28) Busoniu, L.; Babuska, R.; Schutter, B. D.; Ernst, D. Reinforcement Learning and Dynamic Programming Using Function Approximators, 1st ed.; CRC Press: Boca Raton, FL, 2010. (29) Deisenroth, M. P.; Rasmussen, C. E.; Peters, J. Gaussian process dynamic programming. Neurocomputing 2009, 72, 1508. (30) Bensoussan, A.; Menaldi, J. L. Stochastic hybrid control. J. Math. Anal. Appl. 2000, 249, 261. (31) Cassandras, C. G.; Lygeros, J. Stochastic Hybrid Systems; Taylor & Francis: Boca Raton, FL, 2007; Chapters 1, 2, 6, 7. (32) Song, C.; Li, P. Near optimal control for a class of stochastic hybrid systems. Automatica 2010, 46, 1553. (33) Adamek, F.; Sobotka, M.; Stursberg, O. Stochastic optimal control for hybrid systems with uncertain discrete dynamics. Proc. IEEE Int. Conf. Automation Sci. Eng., CASE 2008, Washington, DC, Aug. 2008; pp 23−28. (34) Blackmore, L.; Ono, M.; Bektassov, A.; Williams, B. C. A probabilistic particle-control approximation of chance-constrained stochastic predictive control. IEEE Trans. Robotics 2010, 26, 502.

subscript GP = Gaussian Process model GPd = Gaussian process for zd .7] = Gaussian process for the value function .7X = Gaussian process for Q*(·,·) Cov(·,·) = covariance function or kernel 2( ·, ·) = feedback control law KL(°|°) = Kullback−Leibler distance m(·) = mean function of a Gaussian Process N = number of decision stages Q*(·,·) = optimal state/disturbance-mode parameter value function r(·) = reward function u = vector of control variables < = utility function  = control space V*(·) = optimal value function ^ = exploration coefficient x = vector of system state variables  = state space 3 = subspace of reachable states ?3 = set of Lebesgue sampling points ?k3 = set of Lebesgue sampling points in the kth decision stage ?̵ k: = set of support states z = state/disturbance pair Greek Symbols

α = smoothing parameter β = exploitation coefficient γ = discount rate δ = tolerance for Kullback−Leibler distance ξe = set of endogenous binary inputs π = switching policy π.* = grand policy Θ = parameter space θ = vector of mode parameter ΘDP = finite set of mode parameter values ΘS = support set for mode parameters ψ = vector of hyper-parameters in a GP

■

REFERENCES

(1) Geist, S.; Gromov, D.; Raisch, J. Timed discrete event control of parallel production lines with continuous outputs. Discrete Event Dyn Syst. 2008, 18, 241. (2) Engell, S.; Kowalewski, S.; Schulz, C.; Stursberg, O. ContinuousDiscrete Interactions in Chemical Processing Plants. Proc. IEEE 2000, 88, 1050. (3) Markakis, M. G.; Mitsis, G. D.; Papavassilopoulos, G. P.; Ioannou, P. A.; Marmarelis, V. Z. A switching control strategy for the attenuation of blood glucose disturbances. Optim. Control Appl. Methods 2011, 32, 185. (4) Liu, C.; Gong, Z.; Feng, E.; Yin, H. Optimal switching control for microbial fed-batch culture. Nonlinear Anal.: Hybrid Syst. 2008, 2, 1168. (5) Pasamontes, M.; Alvarez, J. D.; Guzman, J. L.; Lemos, J. M.; Berenguel, M. A switching control strategy applied to a solar collector field. Control Eng. Pract. 2011, 19, 145. (6) Lygeros, J.; Johansson, K. H.; Simic, S. N; Zhang, J.; Sastry, S. S. Dynamical properties of hybrid automata. IEEE Trans. Autom. Control 2003, 48, 2. (7) Goebel, R.; Sanfelice, R.; Teel, A. Hybrid dynamical systems. IEEE Control Syst. 2009, 29, 28. (8) Di Cairano, S.; Bemporad, A.; Júlvez, J. Event-driven optimizationbased control of hybrid systems with integral continuous-time dynamics. Automatica 2009, 45, 1243. 13763



Article

(35) Abate, A.; Prandini, M.; Lygeros, J.; Sastry, S. Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems. Automatica 2008, 44, 2724. (36) Bemporad, A.; Di Cairano, S. Model-predictive control of discrete hybrid stochastic automata. IEEE Trans. Autom. Control 2011, 56, 1307. (37) Shi, P.; Xia, Y.; Liu, G. P.; Rees, D. On designing of sliding-mode control for stochastic jump systems. IEEE Trans. Autom. Control 2006, 51, 97. (38) Crespo, L. G.; Sun, J. Q. Solution of fixed final state optimal control problems via simple cell mapping. Nonlinear Dyn. 2000, 23, 391. (39) Crespo, L. G.; Sun, J. Q. Stochastic optimal control of nonlinear systems via short-time Gaussian approximation and cell mapping. Nonlinear Dyn. 2002, 28, 323. (40) Crespo, L. G.; Sun, J. Q. Stochastic optimal control via Bellman’s principle. Automatica 2003, 39, 2109. (41) Rasmussen, C. E.; Williams, C. K. I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, 2006. (42) MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge, U.K., 2003. (43) Verdinelli, I.; Kadane, J. B. Bayesian designs for maximizing information and outcome. J. Am. Stat. Assoc. 1992, 87, 510. (44) Simeonova, I. On-line periodic scheduling of hybrid chemical plants with parallel production lines and shared resources. PhD Thesis, Université Catholique de Louvain, Louvain-la-Neuve/Brussels, 2008. (45) Melas, S.; PVC, S. Line Predictive Inventory Control: A Production Rate Optimization for a Hybrid System, Solvay Group Report; Solvay: Brussels, Belgium, 2003; p 1 (46) Ghaeli, M.; Bahri, P. A.; Lee, P. L. Scheduling of a mixed batch/ continuous sugar milling plant using Petri nets. Comput. Chem. Eng. 2008, 32, 580. (47) Crisafulli, S.; Peirce, R. Surge tank control in a cane raw sugar factory. J. Process Control. 1999, 9, 33. (48) Tani, T.; Murakoshi, Umano, S. M. Neuro-fuzzy hybrid control system of tank level in pretroleum plant. IEEE Trans. Fuzzy Syst. 1996, 4, 360.

13764


Optimal Operation of Discretely Controlled Continuous Systems under

Recommend Documents