Data-based Suboptimal Neuro-control Design with Reinforcement

Apr 7, 2014 - State Key Laboratory of High Performance Complex Manufacturing, Central South ... RL is a machine learning technique that has been wildl...
0 downloads 3 Views 1MB Size
Article pubs.acs.org/IECR

Data-based Suboptimal Neuro-control Design with Reinforcement Learning for Dissipative Spatially Distributed Processes Biao Luo,†,‡ Huai-Ning Wu,*,† and Han-Xiong Li§,∥ †

Science and Technology on Aircraft Control Laboratory, School of Automation Science and Electrical Engineering, Beihang University (Beijing University of Aeronautics and Astronautics) Beijing 100191, P. R. China ‡ State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, P. R. China § Department of Systems Engineering and Engineering Management City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong SAR ∥ State Key Laboratory of High Performance Complex Manufacturing, Central South University, Changsha, Hunan China ABSTRACT: For many real complicated industrial processes, the accurate system model is often unavailable. In this paper, we consider the partially unknown spatially distributed processes (SDPs) which are described by general highly dissipative nonlinear partial differential equations (PDEs) and develop a data-based adaptive suboptimal neuro-control method by introducing the thought of reinforcement learning (RL). First, based on the empirical eigenfunctions computed with Karhunen−Loève decomposition, singular perturbation theory is used to derive a reduced-order model of an ordinary differential equation that represents the dominant dynamics of the SDP. Second, the Hamilton−Jacobi−Bellman (HJB) approach is used for the suboptimal control design, and the thought of policy iteration (PI) is introduced for online learning of the solution of the HJB equation, and its convergence is established. Third, a neural network (NN) is employed to approximate the cost function in the PI procedure, and a NN weight tuning algorithm based on the gradient descent method is proposed. We prove that the developed online adaptive suboptimal neuro-controller can ensure that the original closed-loop PDE system is semiglobally uniformly ultimately bounded. Finally, the developed data-based control method is applied to a nonlinear diffusion-reaction process, and the achieved results demonstrate its effectiveness.

1. INTRODUCTION Due to the underlying physical spatial-temporal phenomena, such as diffusion, convection, phase-dispersion, vibration, flow, etc., spatially distributed processes (SDPs) are essentially a nonlinear partial differential equation (PDE) system. Until now, many optimal control approaches1−14 have been reported for PDE systems, which could be generally divided into two classes. The first class is the design-then-reduce1,2,8−10 methods, where an infinite-dimensional controller is synthesized based on the original PDE systems and then lumped for realization. For example, algebraic operator Riccati equations1,8 were derived for linear quadratic optimal control of PDE systems described by the abstract evolution equation. In ref 2, a policy iteration method was proposed for solving the space-dependent Riccati differential equation associated with the optimal control design of linear hyperbolic PDE systems. The second class is the reduce-then-design3−6 approaches, which uses an approximating reduced-order model of an ordinary differential equation (ODE) for control design. The reduce-then-design framework was mostly used in highly dissipative PDE systems, since their long-term dynamic can be accurately represented by finite-dimensional (typically small) modes. For instance, the finite difference method and orthogonal collocation were respectively used to obtain a reduced-order ODE model,4 which subsequently was applied for the linear quadratic regulator design. One popular approach for model reduction is Galerkin’s method based on empirical eigenfunctions (EEFs) or analytical basis functions. Recently, some optimal control © 2014 American Chemical Society

approaches together with model reduction have been developed, such as nonlinear programming,6 dual heuristic programming,3 model predictive control,7,12,13 the offline model-based iterative method,5 and the input/output approach using residence time distributions.14 However, most of these existing approaches require the full PDE system model, which is often unavailable for many practical industrial SDPs, such as chemical processes, thermal process, traffic flows, fluid dynamics, aeronautics, and astronautics. For these SDPs, it is often costly or impossible to obtain their mathematical models. Consequently, these difficulties prevent the direct application of model-based approaches for the control design of SDPs. In recent years, the thought of reinforcement learning (RL) was introduced to design controllers when the system model is unavailable. RL is a machine learning technique that has been wildly investigated by the artificial intelligence community.15−23 As one of the most popular RL schemes, approximate dynamic programming15,23 uses value function approximation structures (such as linear or nonlinear function approximation24) for the implementation of the RL algorithms, thus many works (see survey paper (ref 25) and references therein) even used approximate dynamic programming as the synonym of RL. Sutton and Barto15 suggested a definition of the RL method, Received: Revised: Accepted: Published: 8106

September 24, 2013 March 3, 2014 April 6, 2014 April 7, 2014 dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research

Article

semiglobally uniformly ultimately bounded. Finally, the effectiveness of the developed method is demonstrated on the simulation of a nonlinear diffusion-reaction process. The rest of this paper is organized as follows. The problem description is given in section 2, and the methodology framework of this paper is briefly described in section 3. The derivation of the reduced-order model based on Karhunen− Loève decomposition and singular perturbation theory is shown in section 4. In section 5, the online data-based suboptimal neuro-control method with RL and NN is developed. Section 6 gives simulation studies, and a brief conclusion is drawn in section 7.

i.e., any method that is well suited to solve a RL problem can be considered to be a RL method, where the RL problem is defined in terms of optimal control of Markov decision processes. This obviously established the relationship between the RL method and optimal control community. Moreover, RL and approximate dynamic programming methods have the ability to find an optimal control policy from an unknown environment, thus it is promising to introduce the thoughts of RL for control design of practical industrial SDPs. However, most of the existing RL and approximate dynamic programming methods were developed for discrete-time Markov decision processes15−17,24 and ODE systems25−32 with known or unknown dynamics. For example, heuristic dynamic programming was suggested for nonlinear optimal control design of discrete-time ODE systems.27,29 A neural heuristic dynamic programming method33 was provided to learn output feedback control policy. Inspired by action dependent heuristic dynamic programming, Si and Wang34 developed a direct heuristic dynamic programming approach. The finite-horizon optimal control of nonlinear discrete-time ODE systems31 was studied by introducing an ε-error bound. Lewis and Vamvoudakis32 derived two approximate dynamic programming algorithms: output feedback policy iteration (PI) and value iteration, which only require measurements of input/ output data. To solve continuous-time problems, Murray et al.35 suggested two PI algorithms that avoid the necessity of knowing the internal system dynamics. Vrabie et al.36,37 extended the result and proposed a new PI algorithm to solve the linear quadratic regulator problem36 and nonlinear optimal control problem37 online along a single state trajectory. Vamvoudakis and Lewis38 gave an online PI algorithm which tunes synchronously the weights of both actor and critic NNs for the nonlinear optimal control problem. However, to the best of our knowledge, it remains an interesting open issue how to introduce RL and approximate dynamic programming techniques into the optimal controller synthesis of nonlinear continuous-time SDPs, which is rarely solved and motivates the present study. To solve the optimal control problem of nonlinear industrial SDPs without an accurate process model, in this paper, we consider a class of highly dissipative nonlinear continuous-time PDE systems with partially unknown dynamics and propose a data-based suboptimal neuro-control method by using RL and Karhunen−Loève decomposition techniques. The whole control design procedure is divided into two phases. The first one is to set up a reduced-order model. Specifically, Karhunen− Loève decomposition is initially employed to compute EEFs for the SDP with the snapshots measured online, and then these EEFs are used as basis functions for deriving a reduced-order model via singular perturbation theory. In the second phase, a RL-based suboptimal neuro-control method is developed on the basis of the reduced-order model and the Hamilton− Jacobi−Bellman (HJB) approach, where an online PI algorithm is developed for learning the HJB equation’s solution. By constructing a sequence of Lyapunov function equations for successively approximating the HJB equation, the convergence of the PI algorithm is proved. For implementation purposes, the data-based suboptimal neuro-controller is synthesized with the application of NN to the cost function approximation in the PI algorithm, and a tuning algorithm of NN weights is proposed via the gradient descent method. Moreover, we prove that the developed online suboptimal neuro-controller can ensure that the signals of the closed-loop PDE system are



NOTATION

, n , and n × m are the set of real numbers, the n-dimensional Euclidean space, and the set of all real matrices, respectively. |·|, ∥·∥, and ⟨·⟩n denote the absolute value for scalars, Euclidean norm, and inner product for vectors, respectively. Let ∞ be the vector space of infinite sequences a = [a1...a∞]T of real 2 1/2 numbers equipped with the norm ∥ a ∥∞ ≜ (∑∞ i=1 ai ) , n which is a natural generalization of space  . The superscript T is used for the transpose, and I denotes the identity matrix of appropriate dimension, and ∇ ≜ ∂/∂x denotes a gradient operator notation. σ̅(·) and σ(·) denote the maximum singular value and the minimum singular value. For a symmetric matrix, M,M > (≥)0 means that it is a positive (semipositive) definite matrix. ∥ν∥M2 ≜ νT Mν for some real vector v and symmetric matrix M > (≥)0 with appropriate dimensions. C1(Ω) is function space on Ω with first derivatives being continuous. L2([z1,z2],n ) is an infinite-dimensional Hilbert space of ndimensional square integrable vector functions ω(z) ∈ L2([z1, z2],n ), z ∈ [z1, z2] ⊂  equipped with the inner product and norm: ⟨ω1(·), ω2(·)⟩ ≜ ∫ zz21 ⟨ω1(z), ω2(z)⟩ n dz and ∥ω1(·)∥2 ≜ 

⟨ω1(·),ω1(·)⟩1/2 for ∀ ω1(z), ω2(z) ∈ L2([z1, z2],n).

2. PROBLEM DESCRIPTION We consider continuous-time SDPs which are described by highly dissipative nonlinear PDEs with the following state-space representation: ⎛ ∂y ∂ 2y ∂y(z , t ) ∂ n0y ⎞ , 2 , ..., n0 ⎟ + B̅ (z)u(t ) = 3⎜y , ∂t ∂z ⎠ ⎝ ∂z ∂z

(1)

subject to the mixed-type boundary conditions ⎛ ∂y ∂ 2y ∂ n0 − 1y ⎞ q⎜y , , 2 , ..., n − 1 ⎟ ∂z 0 ⎠ ⎝ ∂z ∂z

=0 z = z1orz = z 2

(2)

and the initial condition y(z , 0) = y0 (z)

(3)

where z ∈ [z1, z2] ⊂ is the spatial coordinate, t ∈ [0, ∞) is the temporal coordinate, y(z, t) = [y1(z, t)...yn(z, t)]T ∈ n is the state, and u(t) ∈ p is the manipulated input. 3 ∈ n is a sufficiently smooth nonlinear vector function that involves a highly dissipative, possibly nonlinear, spatial differential operator of order n0 (an even number), which is referred to as internal system dynamics and assumed unknown in this study. q is a sufficiently smooth nonlinear vector function. B̅ (z) is a known sufficiently smooth matrix function of appropriate 8107

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research

Article

dimension, and y0(z) is a smooth vector function representing the initial state profile. Remark 1. The highly dissipative nonlinear PDE system 1−3 represents a large number of nonlinear industrial SDPs. Representative examples include transport-reaction processes with significant diffusive and dispersive mechanisms that are naturally described by nonlinear parabolic PDEs (such as activated sludge process39 for wastewater treatment, chemical reactors,4 catalytic rod,5,6 heat transfer,3 rapid thermal chemical vapor deposition,40 crystal growth processes,41 etc.) and several fluid dynamic systems (such as the Kuramoto−Sivashinsky equation,6,42 Navier−Stokes equations,42 etc.). For practical complicated SDPs, the accurate process modeling and identification43 are often costly or impossible to conduct. This implies that their system model is partially unknown or completely unknown. Thus, solving the control problem of such SDPs based on online data is promising, but still a challenging problem. For simplicity, we denote y(·,t) ≜ y(z, t) ∈ L2([z1,z2],n ), and M(·) ≜ M(z), z ∈ [z1,z2] for some space-varying matrix function M (z), z ∈ [z1,z2]. We define the following infinitehorizon linear quadratic cost functional: Vu(y( ·, t )) ≜

∫t

+∞

processes 6 and 7, which means that at each time instant, its state is a numerical vector that belongs to a finite-dimensional Euclidean space n . In this paper, we aim to solve the optimal control problem of infinite-dimensional SDPs 1−5, which means that at each time instant, its state y(·,t) is a vector function of space that belongs to an infinite-dimensional Hilbert functional space L2([z1, z2],n ). To the best of our knowledge, this problem has not been solved with RL thought, especially when internal system dynamics is unknown.

3. METHODOLOGY FRAMEWORK For clarity, before starting control synthesis, it is necessary to give a brief introduction of the methodology framework of this paper. Figure 1 shows the structure of the methodology, which

(⟨y( ·, τ ), Q̅ ( ·)y( ·, τ )⟩ + || u(τ )||2R ) dτ (4)

where R > 0 and Q̅ (z) ⩾ 0. In this paper, we consider the optimal control problem of PDE system 1−3 with cost functional Vu(y0(·, t)), i.e., u(t ) ≜ u *(t ) ≜ arg min Vu(y0 ( ·, t ))

Figure 1. Methodology of online data-based suboptimal neuro-control for SDP with Karhunen−Loève decomposition (KLD) and RL.

(5)

u

15

In the RL community, the general RL problem is given based on the standard Markov decision process by defining the state with respect to statistics for prediction. From ref 15, the Markov decision process can be briefly described as: Given any state s and action a, the probability of each possible next state, s′, is 7 as , s ′ = Pr {st + 1 = s′|st = s , at = a}

has two main units. The first is a unit of model reduction. By collecting data of real SDP, EEFs of the SDP are computed with Karhunen−Loève decomposition without the requirement of internal system dynamics. Consequently, singular perturbation theory is applied to derive a reduced-order model, which is prepared for the optimal control design purpose. The second is a unit that learns suboptimal control law for the SDP with the RL method. By online measuring system states, a critic-actor structure based on NN is employed to implement the suboptimal control without the internal system model. In addition, a modal analyzer is used to transform the PDE state to the ODE state of the reduced-order model, i.e., convert a timeand space-varying signal to a time-varying mode signal. In the following sections, we will give a detailed description of the proposed method. It is worth again mentioning that the whole control method is an online procedure that uses online measured data for implementation instead of an accurate model of the SDP.

(6)

These quantities are called transition probabilities. Similarly, given a current state s and action a, together with any next state, s′, the expected value of the next reward is 9 as , s ′ = E{rt + 1|st = s , at = a , st + 1 = s′}

(7)

One can refer to references 15, 16, and 24 for more detailed knowledge about the Markov decision process. The Markov decision process is one of the most general settings for decision-making problems and has been widely used in many different communities. Thus, it is worthwhile to discuss the relationship between the Markov decision process and the optimal control problem considered in this paper. In fact, the optimal control problem 5 of the infinite-dimensional PDE system 1−3 with cost functional 4 can also be viewed as a decision-making problem, which can be explained by the Markov decision process model. y(·,t) can be viewed as the state s and u(t) as the action a; the next state y(·,t + Δt) of s′ is generated from the real SDP 1−3 with probability 7 as,s′ = 1 for such a deterministic system, and the instant cost

4. REFORMULATE OPTIMAL CONTROL PROBLEM OF REDUCED-ORDER MODEL Noting that the internal system dynamic model of PDE system 1−3 is unknown, we use Karhunen−Loève decomposition to compute a set of EEFs, which are then employed to derive a reduced-order model based on singular perturbation theory. Subsequently, the optimal control problem is reformulated based on the reduced-order model, which is prepared for synthesizing a finite-dimensional online suboptimal controller with RL. 4.1. Computation of EEFs with Karhunen−Loève Decomposition. Here, we omit the detailed derivation of

t +Δt

(⟨y( ·, τ ), Q̅ ( ·)y( ·, τ )⟩ + || u(τ )||2R ) dτ is9 as , s ′ = ∫ t It is worth pointing out that most of the existing RL methods focused on finite-dimensional systems, such as Markov decision 8108

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research

Article

can be represented as an infinite weighted sum with {ϕi(z)}, i.e.,

Karhunen−Loève decomposition and give the implementation procedure simply as follows: Algorithm 1. Karhunen−Loève decomposition implementation procedure • Step 1: For the real SDP, collect an ensemble of states {yi(z)}, i = 1,...,M. z • Step 2: Compute matrix C = (ckj)M×M with ckj = (1/M)∫ z21



y(z , t ) =

∑ xi(t )ϕi(z)

(8)

i=1

where xi(t) is the mode of the PDE system. By taking the inner product with ϕi(z)(i = 1, ...) on both sides of PDE system 1−3, we obtain the following infinite-dimensional ODE system:

yTk (ξ)yj(ξ) dξ. • Step 3: Solve the eigenvalue problem Cαi = λiαi and assume that λ1 ⩾ λ2 ⩾...⩾λM. • Step 4: Normalize the eigenvector αi to satisfy ⟨αi, αi⟩ n =

⎧ x(̇ t ) = f (x(t ), xf (t )) + Bu(t ) s ⎪ ⎪ ⎨ xḟ (t ) = 3 f (x(t ), xf (t )) + Bf u(t ) ⎪ ⎪ x(0) = x0 , xf (0) = xf 0 ⎩



1/Mαi. Then, EEFs can be computed with ϕi = Y(z)αi, where Y(z) = [y1(z) ... yM(z)]. Remark 2. Karhunen−Loève decomposition has widely been used to compute EEFs for SDPs.5,6,42−45 The EEFs are computed from the ensemble {yi(z)}, which can be obtained either by the direct measurements of the process or by the computer simulations of the model. In this paper, we assume that the ensemble is collected from direct measurement of the SDPs. However, the system state yi(z) is a continuous function of space variable z, which is impossible to measure at the whole spatial domain. In fact, for real application, the computer can only process the discrete data. Thus, one practical way for collecting ensemble {yi(z)} is to locate a finite number of sensors at nz different points zl = z1 + (l − 1)Δz, l = 1, ..., nz with equivalent space distance Δz = (z2 − z1)/(nz − 1) and measure state values yi(zl). Consequently, the elements of the matrix C can be approximately computed with ckj = Δz/Mnz z−1 ∑nl=1 yTk (zl)yj (zl). Another potential method could be to use measurement of a limited number of sensors for reconstructing the system state at the whole spatial domain, which is a promising topic and left for future investigation. From this point of view, Karhunen−Loève decomposition can be regarded as a data-based approach.5,6,42−45 Remark 3. It is noted from the Karhunen−Loèv e decomposition procedure that it needs the measurement of the space-continuous state y(z), which requires a sufficiently large number of sensors. Recently, the development of microelectro-mechanical systems (MEMS) techniques has made it possible, and some results have been reported, such as the distributed control of diffusion-reaction processes,4 drag reduction,46 spatially invariant systems,47 and temperature control of catalytic surfaces.48 Admittedly, measurement at the boundary is one of the most practical ways for PDE systems. Due to the complexity and difficulty of the control design using the boundary measurements only, this paper first starts from the distributed measurements to develop a framework for adaptive suboptimal control design and theory analysis for PDE systems with the thought of RL. The issue of boundary measurements will be left for future investigation. 4.2. Reformulation of Optimal Control Problem Based on Singular Perturbation Theory. It is known that the eigenspectrum of highly dissipative PDE systems can be divided into an infinite-dimensional stable fast part and a finitedimensional slow set, which implies that the long-term dynamic behavior of PDE systems can be accurately described by a finite-number of modes. In this subsection, we use the singular perturbation theory to derive a reduced-order model of the PDE system 1−3. Without a loss of generality, we consider the PDE system 1−3 with n = 1. Letting {ϕi(z)} be a complete set of orthogonal basis functions, assume that the PDE state y(z,t)

(9)

where x(t ) = ⟨y( ·, t ), Φs( ·)⟩ ≜ [x1(t )...xN (t )]T

(10)

and xf(t) = ⟨y(·, t), Φf(·)⟩ ≜ [xN+1(t)...x∞ (t)]T ∈ ∞, fs (x, xf) ≜ ⟨3 ,Φs (·)⟩, 3 f (x, xf) ≜ ⟨3 ,Φf (·)⟩, B ≜ ⟨B̅ (·), Φs(·)⟩, Bf ≜ ⟨B̅ (·), Φf (·)⟩, x0 ≜ ⟨y0(·).Φs (·)⟩, xf 0 ≜ ⟨y0(·),Φf(·)⟩ with Φs(z) ≜ [ϕ1(z)...ϕN(Z)]T and Φf(z) ≜ [ϕN+1(z)...ϕ∞(z)]T. Owing to the highly dissipative nature of PDE system 1−3, it is reasonable to assume that 3 f (x , x f ) =

1 A fεxf + f f (x , xf ) ε

where ε > 0 is a small parameter for slow−fast separation, Afε is a matrix that is stable (in the sense that the state of the system ẋf = Afεxf tends exponentially to zero), and f f(x, xf) satisfies || f f (x , xf )|| ⩽ k1 || x || + k 2 || xf ||

∞

(11)

for ∥x∥ ⩽ β1 and ∥xf∥ ⩽ β2 with k1, k2 > 0. Then, system 9 can be rewritten as the following standard singularly perturbed form:49 ⎧ x(̇ t ) = f (x(t ), xf (t )) + Bu(t ) s ⎪ ⎪ ⎪ εxḟ (t ) = A fεxf (t ) + εf f (x(t ), xf (t )) ⎨ + εBf u(t ) ⎪ ⎪ ⎪ x(0) = x0 , xf (0) = xf 0 ⎩

(12)

Through the introduction of a fast time-scale τ = t/ε, by letting ε = 0, the following fast subsystem can be obtained:

dxf dτ

= A fεxf

(13)

It is noted from reference 47 that system 13 is exponentially stable. By setting ε = 0, it follows from system 12 that xf = 0, and thus the following reduced-order model is obtained: ⎧ ̇ t ) = f (x(t )) + Bu(t ) ⎪ x( ⎨ ⎪ ⎩ x(0) = x0

(14)

where f(x) = fs(x, xf). The substitution of eq 8 into the cost functional 4 yields 8109

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research Vu(y( ·, t )) =

∫t

=

∫t

+∞

(⟨y( ·, τ ), Q̅ ( ·)y( ·, τ )⟩ + u(τ ) +∞

Article 2 R)

Then, the optimal control law u*(x) can be obtained by solving the HJB equation



min H(x , u , ∇V *) = 0

[9(x(τ ), u(τ )) + E(x(τ ), xf (τ ))] dτ

From the first order necessary condition for optimality of u:

(15)

∥x(τ)∥2Qs

where 9 (x(τ), u(τ)) ≜ + ∥u(τ)∥2R, E(x(τ), xf(τ)) ≜ 2xT(τ)Qsf xf(τ), + xTf (τ)Qf xf(τ), Qs ≜ (qijs ) ∈ N × N , Qsf ≜ (qijsf ) ∈ N ×∞, and Qf ≜ (qijf ) ∈ ∞×∞ with qijs ≜ ⟨ϕi(·),Q̅ (·) ϕj(·)⟩,qijsf ≜ ⟨ϕi(·),Q̅ (·)ϕN−1+j(·)⟩, qijf ≜ ⟨ϕN−1+i(·),Q̅ (·)ϕN−1+j(·)⟩, Vu(x) = C1(Ω), and Vu(0) = 0. By using xf = 0, we have that E(x,xf) =

∂ H(x , u , ∇V *) = 0 ∂u

it follows that the optimal control u* is given by 1 u* = − R−1BT ∇V * 2

0, then the cost functional 15 is written as Vu(x(t )) =

∫t

+∞

(∥x(τ )∥Q2 s + ∥u(τ )∥2R ) dτ

(16)

(∇V *)T f + ∥x∥Q2 s −

M

(17)

j=1

where ζ > 0 qualifies the estimation accuracy.

5. DATA-BASED SUBOPTIMAL NEURO-CONTROL DESIGN WITH RL For the optimal control problem, the finite-dimensional nonlinear ODE system 14 with cost functional 16, it can be transformed into solving the so-called HJB equation. However, HJB equations are nonlinear PDE that are extremely difficult to solve both numerically and analytically. Even worse, we notice that the internal dynamic model f(x) in the reduced-order model 14 is also unknown, since the internal dynamic model 3 of the original PDE system 1−3 is unknown. This implies that explicit formulation of the HJB equation for the nonlinear system 14 is unavailable, which further complicates the optimal control design problem. Thus, we use RL together with NN to design an online suboptimal neuro-controller based on the reduced-order model 14 and cost functional 16, without the internal system model. 5.1. Suboptimal Controller Synthesis by HJB Approach. We first introduce the following definition of admissible control. Definition 1. (Admissible control) For the given system 14, x ∈ Ω ⊂ N , a control u (x):Ω → p is defined to be admissible with respect to cost functional 16 on Ω, denoted by u(x) ∈ U(Ω), if (1) u is continuous on Ω, (2) u(0) = 0, (3) u(x) stabilizes the system, and (4) Vu(x) < ∞, ∀ x ∈ Ω. The optimal control problem of the reduced-order model 14 with cost functional 16 can be formulated as to find an admissible control policy such that the linear quadratic cost functional Vu(x0) is minimized, i.e., u(t ) ≜ u*(t ) ≜ arg min Vu(x0(t )) u

V (i + 1)(x(t )) =

∫t

t +Δt

(22)

9(x(τ ), u(i)(τ )) dτ

+ V (i + 1)(x(t + Δt ))

(23)

(i+1)

where V is the unknown cost function. • Step 3: (Policy improvement) Update the control policy by u(i + 1) = arg min H(x , u , ∇V (i + 1)) u ∈ U (Ω)

(24)

the explicit expression of which is 1 u(i + 1) = − R−1BT ∇V (i + 1) 2

(25)

• Step 4: Set i = i + 1. Go back to step 2 and continue. Now, we give some lemmas to demonstrate the convergence of the PI algorithm, i.e., its relationship with HJB eq 24. Lemma 1. Consider system 14. Let u(0) ∈ U(Ω) and u(i+1) be given by eq 25. Then, (1) V(i+1) is the solution of eq 23, if and only if it is a solution of the Lyapunov function equation (∇V (i + 1))T (f + Bu(i)) + 9(x , u(i)) = 0

(26)

(2) u ∈ U(Ω). Proof. See the Appendix. Part 2 of Lemma 1 shows that if an initial admissible control policy is given, then all control policies in the PI algorithm are admissible. Lemma 2. Consider system 14. For any u ∈ U(Ω), the cost function V associated with u satisfies the Lyapunov function equation (i)

(18)

Define the Hamiltonian of the problem H(x , u , ∇V ) ≜ (∇V )T (f (x) + Bu) + 9(x , u)

1 (∇V *)T BR−1BT ∇V * = 0 4

Obviously, the optimal control law 21 relies on the solution of HJB eq 22 for value function V*(x). 5.2. RL for Online Solution of the HJB Equation. Since the internal dynamics f in the reduced-order model 14 is unknown, it is prohibited to employ the model-based approaches to solve HJB eq 22. To overcome this difficulty, we introduce the idea of a PI method15 to learn the solution of the HJB equation, which can find a suboptimal control policy online from an unknown environment. The PI procedure for solving the HJB equation is presented as follows: Algorithm 2. Policy iteration (PI) • Step 1: Give an initial control policy u(0) ∈ U (Ω) . • Step 2: (Policy evaluation) By using the control policy u(i), solve equation:

∑ λi / ∑ λj ≥ 1 − ζ i=1

(21)

Substituting eq 21 into eq 20, the HJB equation is rewritten as

In the next section, the reduced-order model 14 and cost functional 16 are used as a basis for the online suboptimal control design of the PDE system 1−3. In this paper, the EEFs are used for model reduction; i.e., the basis function ϕi(z) is computed with Karhunen−Loève decomposition. The dimension of Φs(z) (i.e., N) is chosen such that it satisfies N

(20)

u ∈ U (Ω)

(19) 8110

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research (∇V )T (f + Bu) + 9(x , u) = 0

Article

It follows from eqs 30 and 32 that when ŵ → w, e1 → ε2. Thus, by choosing the objective function as the squared residual error

(27)

Proof. See the Appendix. From Lemmas 1 and 2, it follows that the solution V(i+1) obtained by eq 23 in the PI algorithm is the cost function associated with the control policy u(i). Lemma 3. Consider system 14. Given u(0) ∈ U (Ω), let V* be the solution of HJB eq 22, u* be given by eq 21, V(i+1) be the solution of Lyapunov function eq 26, and u(i+1) be given by eq 25. Then, V(i) ⩾ V(i+1). Moreover, when i → ∞, V(i) → V* and u(i) → u*. Proof. See ref 48. Lemma 1 shows that eq 23 in the PI algorithm is equivalent to the Lyapunov function eq 26, while Lemma 3 indicates that the solution of Lyapunov function eq 26 converges with the solution of the HJB eq 22, when i → ∞. Thus, we can conclude that the PI algorithm converges to the solution of the HJB eq 22. 5.3. Online Actor-Critic Structure with NN. It is observed from the PI algorithm that we need to solve eq 23 in iterative steps. An online actor-critic structure (as shown in Figure 1) is developed, in which critic and action NNs are used for estimating cost function and control policy, respectively. From Lemmas 1 and 2, we prove that V(i+1) of eq 23 in the PI algorithm is the cost function of the control policy u(i). Thus, it follows Lemmas 1 and 2 that the problem of solving eq 23 can be generally formulated as, given u ∈ U(Ω), solve the equation V (x(t )) =

∫t

t +Δt

E1(ŵ ) =

ŵ ̇ = −α1

t +Δt

9(x(τ ), u(τ ))dτ + ŵ T ΔΥ(t ))ΔΥ(t ) (34)

where α1 > 0 is the adaptive gain of the critic NN. When the critic NN weight estimation error is defined as w̃ ≜ ŵ − w, then it follows from eqs 30 and 34 that ·

w̃ = ŵ ̇ = −α1(

∫t

t +Δt

9(x(τ ), u(τ )) dτ + (w̃ + w)T ΔΥ(t ))

ΔΥ(t ) = −α1(w̃ T ΔΥ(t ) + ε2)ΔΥ(t ) (35)

After the weight vector of critic NN is convergent, it follows from eqs 31 and 25 that the output of action NN is directly obtained by

1 u ̂ = − R−1BT ∇ΥT ŵ 2

(29)

where w = [w1,w2,...,wL] is the unknown ideal constant weight vector, and ε1 is the NN approximation error. The substitution of eq 29 into eq 28 yields 9(x(τ ), u(τ )) dτ + wT ΔΥ(t ) = ε2

(30)

where ε2 ≜ ε1(x(t)) − ε1(x(t + Δt)) and ΔΥ(t) ≜ Υ(x(t + Δt))−Υ(x(t)). The ideal critic NN weight vector w provides the best approximate solution of eq 28. However, since w is unknown, the output of the critic NN is V̂ (x) = ŵ T Υ(x)

(31) T

where ŵ = [ŵ 1,ŵ 2,...,ŵ L] is the estimated vector of the ideal weight vector w. With eq 31, eq 28 is rewritten as 9(x(τ ), u(τ )) dτ + ŵ T ΔΥ(t ) = e1

(36)

It is noted that the weight vectors of critic and actor NNs are the same, which means that only critic NN is needed. Remark 4. It is noted that control policy 36 with parameters tuning law 34 is an adaptive control method, by using the idea of RL. The concept of the persistence of excitation condition is important in identification and adaptive control problems,53,54 which is also required in the training of NN weights.55 It should be mentioned that the persistence of excitation condition is a sufficient condition56 for the convergence of the unknown parameters in adaptive control, such that the state space can be visited as much as possible. This implies that the persistence of excitation condition acts similarly to “exploration” in the RL community. In the proposed control method, the signal should be sufficiently rich that it is assumed to be persistently exciting. From the definition of the persistence of excitation condition,54 there exist η 1 , η 2 , and T 0 > 0 such that η 1 I ⩽ ∫ tt+T 0 ΔΥ(τ)ΔΥT(τ) dτ ⩽ η2I. 5.4. Specific Suboptimal Neuro-control Implementation Procedure. In the previous sections, based on the structure in Figure 1, we have developed an online data-based suboptimal neuro-control approach for the PDE system without the need of internal dynamics. However, each part of the control method is presented separately, and the actor-critic method in subsection 5.3 is only designed for solving one eq 23 in PI. Therefore, it is necessary to put them together and give a complete and implementable procedure. Algorithm 3. Suboptimal neuro-control implementation procedure

T

t +Δt

∫t

9(x(τ ), u(τ )) dτ + V (x(t + Δt ))

V (x) = wT Υ(x) + ε1

∫t

∂E1 ∂ŵ

= −α1(

for the cost function V of the control policy u. It follows from the high-order Weierstrtrass approximation theorem52 that a continuous function can be represented by an infinite-dimensional linearly independent basis function set. For real practical application, it is usually required to approximate the function in a compact set with a finite-dimensional function set. We first consider the critic NN for estimating V(x). Let Υ(x) = [γ1(x) γ2(x) ··· γL(x)]T be a vector of linearly independent activation functions, where L is the number of NN hide layer neurons. Then, V(x) can be expressed with Υ(x) as follows:

t +Δt

(33)

It is desired to select ŵ such that E1(ŵ ) is minimized. Select the tuning law of weight vector for critic NN as a gradient descent scheme, which is given by

(28)

∫t

1 T e1 e1 2

(32) 8111

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research

Article

⩽ ε2M, and the ideal weight vector w is bound, i.e., ∥w∥ ⩽ υwM, where ε2M and υwM are a positive numbers. (2) The critic NN activation functions Υ and their gradients ∇Υ are bounded on a compact set containing Ω, i.e., ∥Υ∥ ⩽ υγM and ∥∇Υ∥ ⩽ υDγM, where υγM and υDγM are positive numbers. The following theorem will show that the critic NN weight vector ŵ converges to the ideal unknown weight vector w in 29 by using NN weight tuning rule 34. This means that V̂ (x) given in eq 31 converges to the actual cost function V(x) of the control policy u. Theorem 1. Consider the system 14 and cost functional 16. Let, assumption 1 hold and the tuning law for critic NN weight vector be given by eq 34. Then, the critic NN weight estimation error is semiglobally uniformly ultimately bounded. Proof. See the Appendix. Notice that theorem 1 gives the convergence of the NN method in subsection 5.3. However, this method is just used for solving eq 23 in the PI algorithm. Thus, we also need to show that the tuning law 38 for critic NN weight vector can guarantee the convergence of the PI algorithm, the stabilities of the closed-loop reduced-order model 14, and the original PDE system 1−3. With the NN activation function vector Υ(x), the solution of HJB eq 22 is represented as

• Step 1: Give the number of snapshots M, ζ in eq 17, and time interval Δt. Select activation function vector Υ(x) of critic NN and initial NN weight vector w(0) such that û(0) = −(1/2)R−1BT∇ΥTŵ (0) is an admissible control law. • Step 2: Collect an ensemble of snapshots from the real SDP. Based on the ensemble, use the Karhunen−Loève decomposition described in subsection 4.1 to compute EEFs and determine the dimension N of the reducedorder model based on eq 17. • Step 3: Set i = 0. • Step 4: Apply the following control law û(i) to the real SDP for closed-loop simulation: 1 u ̂ = u(̂ i) = − R−1BT ∇ΥT w(i) (37) 2 Measure real system state y(·,t) and y(·,t + Δt), and use eq 9 to obtain the mode states x(t) and x(t + Δt). Then, tune the critic NN weight vector ŵ with ŵ ̇ = −α1(

∫t

t +Δt

9(x(τ ), u(̂ i)(τ )) dτ + ŵ T ΔΥ(t ))

ΔΥ(t ) (38) until it is convergent, and then take w(i+1) as the convergent ŵ . • Step 5: Set i = i + 1. If the convergence of the critic NN weight vector is achieved, stop tuning the weight vector and keep it invariable from that instant on, otherwise go back to step 4 and continue. From the procedure described above, it is obvious that the developed suboptimal-neuro control approach is an online one. By measuring states of real SDP online, both procedures of Karhunen−Loève decomposition for computing the EEFs and the PI algorithm for learning the suboptimal control law do not require the internal dynamic model of the SDP. Remark 5. Note that the motivation of this paper is to introduce the thought of RL for control design of infinitedimensional SDPs in the control community and establish related stability theories. To the best of our knowledge, this work has rarely studied either in machine learning or control communities, since most existing works about RL mainly focused on finite-dimensional systems. It is hopeful that the thought of PI used in this paper can be replaced with other RL approaches, such as TD methods,15 Q-learning,57 etc. In this paper, we are not trying to develop the most effective or a totally new RL algorithm. Also, it is not a direct task to establish the related theories for solving the optimal control problem of infinite-dimensional SDPs by introducing the idea of TD methods to the control community. That is to say, for infinitedimensional SDPs, it is still theoretically unclear whether these RL methods will converge and guarantee the stability or not. Thus, these issues are left for future studies. It is promising to introduce other thoughts of RL to solve the control problems of infinite-dimensional SDPs, and there still exists a great deal of work needed to be done. 5.5. Stability Analysis. Before starting the stability analysis, we first give the following definition and assumption. Definition 2.58 Consider system 14, the solution x(t) is semiglobally uniformly ultimately bounded, if for any compact set Ω ⊂ N and x(t0) = x0 ∈ Ω, there exist positive constants μ and T(μ,x0) such that ∥x(t)∥ < μ for all t > t0 + T. Assumption 1. (1) For ∀ u ∈ U(Ω), the residual error ε2 in eq 30 is bounded on a compact set containing Ω such that |ε2|

V *(x) = (w*)T Υ(x) + ε3

(39)

where w* = [w1*,w2*,...,wL*] is the unknown ideal constant weight vector, and ε3 is the approximation error. It is known that the HJB eq 22 can also be rewritten as T

V *(x(t + Δt )) − V *(x(t )) +

∫t

t +Δt

9(x(τ ), u*(τ )) dτ (40)

=0

The substitution of 39 into the HJB eq 40 yields

∫t

t +Δt

9(x(τ ), u*(τ )) dτ + w*T ΔΥ(t ) = ε4

(41)

where ε4 ≜ ε3 (x(t)) − ε3(x(t + Δt)). Define the weight estimation error w̃ 1 ≜ ŵ − w*. By using the control policy u(i) in an iterative step of the PI algorithm, it follows from eq 38 that ·

w1̃̇ = ŵ = −α1(

∫t

t +Δt

9(x(τ ), u(̂ i)(τ )) dτ

+ (w1̃ + w*)T ΔΥ(t ))ΔΥ(t ) = −α1(w1̃ T ΔΥ(t ) + ε2(i))ΔΥ(t )

(42)

where ε2(i) ≜

∫t

t +Δt

9(x(τ ), u(̂ i)(τ )) dτ + w*T ΔΥ(t )

It is noted form lemma 3 that u(i) → u* when i → ∞; this implies that ε(i) 2 → ε4, i.e., ŵ → w*. To show the convergence of w̃ 1, the following assumption is required. Assumption 2. The NN approximation errors ε3 of V* and its derivative ∇ε3 are bounded on a compact set containing Ω, 8112

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research

Article

i.e., ∥ε3∥ ⩽ ε3M and ∥∇ε3∥ ⩽ ε3DM where ε3M and ε3DM are positive numbers. Note that assumptions 1 and 2 are a set of standard assumptions. Theorem 2. Consider the PDE system 1−3. Use eq 38 as the tuning law for critic NN weight vector. Let the control policy be given by eq 37. Select an initial critic NN weight vector w(0) such that û(0) ∈ U(Ω), and assumptions 1 and 2 hold. Then, there exist positive real constants σ1, σ2, and ε*, such that if ∥x0∥ ⩽ σ1, ∥xf 0∥∞ ⩽ σ2, and ε ∈ (0,ε*), the estimation error w̃ 1 of eq 42 and the solution x of the closed-loop reduced-order model 14 are semiglobally uniformly ultimately bounded; moreover, the state y(z, t) of the original closed-loop PDE system 1−3 is semiglobally uniformly ultimately bounded in 3 2 -norm; i.e., there exists a positive number σ3 such that ∥y(·,t)∥ < σ3. Proof. See the Appendix. In theorem 2, it is proved that the original infinitedimensional closed-loop PDE system 1−3 is semiglobally uniformly ultimately bounded in 3 2 -norm under the adaptive suboptimal control policy 37 and NN weight tuning law 38 derived from the idea of RL. To this end, it is necessary to involve the singular perturbation theory for finding a small parameter ε*. Moreover, the persistence of excitation condition and critic NN estimation error are also considered in theorem 2.

Figure 2. The state profile of the open-loop diffusion-reaction process for case 1.

Karhunen−Loéve decomposition for computing EEFs, it is found that the first two EEFs (i.e., N = 2) satisfy the condition 17, which are shown in Figure 3. This means that the first two EEFs will be employed to compute states of the reduced-order model.

6. APPLICATION TO A NONLINEAR DIFFUSION-REACTION PROCESS To test the developed data-based suboptimal neuro-control method, we consider the following nonlinear diffusion-reaction process:

Figure 3. First two EEFs for case 1.

Remark 7. It worth mentioning that for computer simulations, the system model 43−45 is just for generating data. For real application, the data can be measured from the real system rather than the simulation of the system model 43−45. That is to say, the internal system model is not required in practice. To test the discrepancy between the actual state y(·,t) and estimated state yN(·,t), Figure 4 shows the following ratio of the open-loop system:

∂y(z , t ) ∂y(z , t ) ⎞ ∂⎛ = ⎜k(y) ⎟ + βT (z)(e−ρ /(1 + y) − e−ρ) ∂t ∂z ⎝ ∂z ⎠ + βU (b(z)u(t ) − y)

(43)

subject to the Dirichlet boundary conditions y(0, t ) = y(π , t ) = 0

⎛ y( ·, t ) − y ( ·, t ) N ry(t ) ≜ ⎜ ⎜ y( ·, t ) 2 ⎝

(44)

and the initial condition

y0 (z) = 0.3sin(3z)

(45)

where y(z,t) is the PDE state, z ∈ [0,π], k(y) is the statedependent diffusion coefficient, βU is the heat transfer coefficient, βT(z) is the spatially varying heat of reaction, ρ is activation energy, and b(z) is actuator distribution function. The weight matrices in cost functional 4 are selected as Q̅ (z) = 1 for ∀ z ∈ [0,π], and R = 1. Remark 6. In the computer simulations, the computational CPU time of the developed suboptimal neuro-control method contains two parts. The first part is for computing EEFs with Karhunen−Loève decomposition, which is an offline procedure. This means that once the EEFs are computed, they remain invariable during the online control process. The second part is consumed by online adaptive control, which is almost negligible for it is a real-time control approach. Case 1: Linear Spatial Differential Operator. In this case, we consider the SDP 43−45 with constant diffusion coefficient k = 1. Other system parameters are given as βT(z) = 12, βU = 1, ρ = 1, and b(z) = H(z − 0.4π) − H(z − 0.5π). Figure 2 gives the open-loop system state profile. By using

⎞1/2 2⎟ ⎟ ⎠

(46)

Figure 4. For case 1, the curave ry(t) for the open-loop system state in Figure 2 with the EEFs in Figure 3.

It is seen from Figure 4 that a good approximation is achieved for reproducing states of the SDP with the first two EEFs. We now apply the developed online suboptimal neurocontrol method to the SDP 43−45. Select the state measurement time interval Δt = 0.02 (s), the NN activation function vector Υ(x) of size 8 (i.e., L = 8) as Υ(x) = [x21, x1x2, 8113

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research

Article

x22, x21x22, x1x32, x31x2, x41, x42]T, and the initial NN weight vector ŵ (0) as ŵ (0) = [3, 0, 0, 0, 0, 0, 0, 0]T. Figure 5 gives the NN weight vector norm ŵ (t), and the NN weight vector converges to ŵ = [3.7974, 0.1811, 0.0149,

With the same parameter settings as in case 1, Figure 9 shows the open-loop state profile, and Figure 10 gives first three EEFs

Figure 5. Trajectory of the NN weight vector norm ŵ (t) for case 1.

Figure 9. The state profile of the open-loop diffusion-reaction process for case 2.

0.0263−0.0036, 0.3356, 1.4785, 0.0009]T at t = 5(s). From that instant on, we remove the noise, stop the PI algorithm, and keep the weights invariant. Figures 6−8 show the control

Figure 10. First two EEFs for case 2.

computed by Karhunen−Loève decomposition. Since the first three EEFs account for more than 99% of the energy, we use the three-order model of ODE (i.e., N = 3) to describe the SDP. Figure 11 demonstrates the ratio 46 for the open-loop system state.

Figure 6. Control action for case 1.

Figure 7. State trajectories of closed-loop reduced-order system for case 1.

Figure 11. For case 1, ry(t) for open-loop system state in Figure 10 with the EEFs.

We select NN activation function vector Υ(x) of size 18 (i.e., L = 18) as Υ(x) = [x21, x1x2, x1x3, x22, x2x3, x23, x21x22, x21x23, x22x23, x1x32, x1x33, x2x33, x31x2, x31x3, x32x3, x41, x42, x43]T and elements of initial NN weight vector ŵ (0) as ŵ 1 = 1.4 and ŵ i = 0 for i = 2,...,18. Figure 12 gives the NN weight vector norm ŵ (t), and the NN weight vector converges to ŵ = [1.4858, 0.0272, 0.0132, 0.0059, 0.0044, 0.0022, 0.0128, 0.0019, 0.0001, 0.0058, 0.0003, 0.0001, 0.0308, 0.0118, 0.0006, 0.0861, 0.0025, 0.0001]T at t =

Figure 8. The state profile of the closed-loop diffusion-reaction process for case 1.

action, actual state trajectories of closed-loop reduced-order system, and the state profile of the diffusion-reaction process, respectively. It can be seen from the figures that these signals tend to zero. Case 2: Nonlinear Spatial Differential Operator. In this case, we consider the SDP 43−45 with nonlinear spatial differential operator k(y) = 0.5 + 0.7/(y + 1). Other system parameters are given as βT(z) = 16(cos(z) + 1), βU = 2, ρ = 2, and b(z) = H(z − 0.4π) − H(z − 0.5π).

Figure 12. Trajectory of the NN weight vector norm ŵ (t) for case 2. 8114

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research

Article

online measured data instead of the accurate mathematical models of the real industrial SDPs. Finally, simulation studies are conducted, and the achieved results demonstrate the effectiveness of the developed suboptimal neuro-control method.

5 (s). It can be seen from Figure 12 that a good convergence behavior is achieved. From that instant on, we remove the noise, stop the PI algorithm, and keep the weights invariant. Figures 13−15 show the control action, actual state trajectories



APPENDIX

Proof of Lemma 1

(1) Applying the control policy u(i) to the system 14, we have the following closed-loop system: x ̇ = f + Bu(i) Figure 13. Control action for case 2.

(A1)

By using eq A1, the derivative of V

(i+1)

can be written as

d (i + 1) V = (∇V (i + 1))T (f + Bu(i)) dt

Integrating the above equation from t to t + Δt yields V (i + 1)(x(t + Δt )) − V (i + 1)(x(t )) = Figure 14. State trajectories of closed-loop reduced-order system for case 2.

If V

∫t

(i+1)

t +Δt

(∇V (i + 1))T (f (x(τ )) + Bu(i)(τ )) dτ

(A2)

is a solution of Lyapunov function eq 26, we have

(∇V (i + 1))T (f + Bu(i)) = −9(x , u(i))

The substitution of the above equation into A2 yields V (i + 1)(x(t + Δt )) − V (i + 1)(x(t )) =−

∫t

t +Δt

9(x(τ ), u(i)(τ )) dτ

which means that V(i+1) is also a solution of eq 23. Now, we prove the solution’s uniqueness of eq 23 by contradiction. From 23, we have d (i + 1) 1 V (V (i + 1)(x(t + Δt )) − V (i + 1)(x(t ))) = lim Δt → 0 Δt dt

Figure 15. The state profile of the closed-loop diffusion-reaction process for case 2.

of the closed-loop reduced-order system, and the state profile of the diffusion-reaction process, respectively. It can be seen from the figures that these signals tend to zero.

= − lim

Δt → 0

1 Δt

∫t

t +Δt

9(x(τ ), u(i)(τ )) dτ

t +Δt 1 ( 9(x(τ ), u(i)(τ )) dτ Δt → 0 Δt 0 t − 9(x(τ ), u(i)(τ )) dτ )

= − lim

7. CONCLUSION In this paper, the optimal control design problem of the highly dissipative nonlinear continuous-time SDPs with partially unknown dynamics has been solved by introducing the idea of RL. Based on the EEFs computed by Karhunen−Loève decomposition, a reduced-order model is obtained with the singular perturbation theory, and the optimal control problem is reformulated based on the reduced-order model. Then, the HJB theory is applied for synthesizing a suboptimal controller, and the idea of PI is introduced for online learning of the solution of the HJB equation. By using NN to approximate the cost function, a data-based suboptimal neuro-control method is developed that can be implemented on the basis of an actioncritic structure. Subsequently, a NN weight tuning law based on the gradient descent algorithm is provided. Moreover, we also show that the tuning law can guarantee the stabilities of the closed-loop reduced-order system and the original PDE system. The whole suboptimal neuro-control method is an online procedure without requiring the internal system model, because both Karhunen−Loève decomposition and the PI algorithm use



∫0

=−

d dt

∫0

t +Δt

9(x(τ ), u(i)(τ )) dτ

= −9(x(t ), u(i)(t )) (A3)

Assume that Ṽ (i+1) is another solution of 23. Then it follows from A3 that d ̃ (i + 1) V = −9(x(t ), u(i)(t )) dt

(A4)

Subtracting A4 from A3, we get d (i + 1) (i + 1) (V )=0 − Ṽ dt

for ∀ x ∈ Ω. Then, V(i+1)(x) − Ṽ (i+1)(x) = c, where c is a real constant, and c = 0 when x = 0. Thus, V(i+1)(x) − Ṽ (i+1)(x) = 0, 8115

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research

Article

for ∀ x ∈ Ω, i.e., V(i+1) = Ṽ (i+1) This completes the part 1 of lemma 1. (2) Due to V(i) ∈ C1(Ω), it follows from eq 25 that u(i) is continuous. Since V(i) ⩾ 0 and V(i)(0) = 0 is the minimum, the expression 25 of u(i) implies that u(i)(0) = 0. To show that u(i) is a stabilizing control law, select V(i) as a Lyapunov function candidate. Under the control policy u(i), the closed-loop system is given by eq A1. Then, taking the derivative of V(i) along the trajectories of eq A1 yields V̇

(i)

= (∇V (i))T (f + Bu(i))

·

L̇(t ) = α1−1w̃ T ŵ = −w̃ T (w̃ T ΔΥ + ε2)ΔΥ = −(w̃ T ΔΥ)2 − ε2w̃ T ΔΥ = −|w̃ T ΔΥ|2 + |ε2 || w̃ T ΔΥ| = −|w̃ T ΔΥ|(|w̃ T ΔΥ| − |ε2|)

(A5)

According to part 1 of lemma 1 (i.e., eq 26), we have (∇V (i + 1))T f = −(∇V (i))T Bu(i − 1) − 9(x , u(i − 1))

(A6)

(i)

= (∇V (i))T B(u(i) − u(i − 1)) − 9(x , u(i − 1))

= −2(u(i))T Ru(i) + 2(u(i))T Ru(i − 1) − 9(x , u(i − 1))

(i)



= (u(i))T Ru(i − 1) − (u(i))T Ru(i) + (u(i))T Ru(i − 1) − (u(i − 1))T Ru(i − 1) − (u(i))T Ru(i) − || x ||Q2 s

=

= (u(i))T R(u(i − 1) − u(i)) + (u(i − 1))T R(u(i)) − u(i − 1)) − (u(i))T Ru(i) − || x ||Q2 s

|| α1ΔΥ || || w̃ T ΔΥ + ε2 ||

η1

∫t

μ1 +

t + T0

η2T0 η1 η2T0 η1

α1η2 η1 α1η2 η1

(μ1 + ε2M )

α1η2 η1

t + T0

|| ΔΥ(τ )|| dτ

(μ1 + ε2M )

|| ΔΥ(τ )||2 dτ )1/2 (

μ1 +

∫t

∫t

t + T0

dτ )1/2

(μ1 + ε2M ) η2T0

(μ1 + α1η2(μ1 + ε2M ))

⎧ ∂y ⎛ ∂y ∂ 2y ∂ n0y ⎞ 1 ⎪ = 3⎜y , , 2 , ..., n0 ⎟ − B̅ (z)R−1 ⎪ ∂t ∂z ⎠ 2 ⎝ ∂z ∂z ⎪ T T (i) B ∇Υ w ⎨ t +Δt ⎪ 9(x(τ ), u(̂ i)(τ )) dτ ⎪ ŵ = −α1( t ⎪ + ŵ T ΔΥ)ΔΥ ⎩

which implies V < 0 for ∀ x ≠ 0. In other words, the control policy u(i) can stabilize system 14. Thus, ũ(i) ∈ U(Ω). Proof of Lemma 2

With the admissible control policy u, the closed-loop system and its cost function V are given by eqs 14 and 16 (by taking V = Vu), respectively. Taking the derivative of V with respect to system 14, we have



(A14)

Using the SP technique in subsection 4.2 for model reduction, the system A14 is equivalently rewritten as

(A10)

⎧ 1 −1 T T (i) ⎪ x ̇ = fs (x , 0) − BR B ∇Υ w + fs (x , xf ) 2 ⎪ − f ( x , 0) ⎪ s ⎪ ⎪ εx ̇ = A x + εf (x , x ) − 1 εB R−1BT fε f f f f ⎨ f 2 T (i) ⎪ ∇Υ w ⎪ t +Δt ⎪ ̇ 9(x(τ ), u(̂ i)(τ )) dτ ⎪ ŵ = −α1( t ⎪ + ŵ T ΔΥ)ΔΥ ⎩

On the other hand, taking the derivative on eq 16 yields (A11)

It is clear from eqs A10 and A11 that V satisfies Lyapunov function eq 27. Proof of Theorem 1



Select the following Lyapunov function candidate: 1 T w̃ w̃ 2α1

t + T0

Under control policy 37 and critic NN weight vector tuning law 38 in the PI algorithm, the closed-loop PDE system is represented as

(A9)

L (t ) =

η1

∫t

Proof of Theorem 2

(i)

d V = −9(x , u) dt

η2

Therefore, w̃ is semi-globally uniformly ultimately bounded.

= −(u(i − 1) − u(i))T R(u(i − 1) − u(i)) − (u(i))T Ru(i) − || x ||Q2 s

d V = (∇V )T (f + Bu) dt

η2T0 (

Since (u(i))T Ru(i−1) = (u(i−1))TRu(i) and 9 (x,u(i−1)) = ∥x∥Qs2 + (u(i−1))T Ru(i−1), eq A9 can be rewritten as

μ1 +

η1

< (A9)



η2T0

(A8)

Hence, it follows from eqs A7 and A8 that V̇

η1


0. Leting m10 ≜ σ(II), it follows from eq A27 that (A23)

L̇(t ) ≤ −m10 || νxw ||2 + bmT || νxw || + m9

Taking the derivatives of V*(x), Vf(xf), and L1(t) along the trajectories of eq A15 yields

Thus, L̇ (t) ⩽ 0 if 8117

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research || νxw || ≥

m8 +

Article

(14) Li, M. H.; Christofides, P. D. An input/output approach to the optimal transition control of a class of distributed chemical reactors. Chem. Eng. Sci. 2007, 62, 2979−2988. (15) Sutton, R. S.; Barto, A. G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, 1998. (16) Bertsekas, D. P.; Tsitsiklis, J. N. Neuro-Dynamic Programming; Athena Scientific: Belmont, MA, 1996. (17) Hafner, R.; Riedmiller, M. Reinforcement learning in feedback control. Mach. Learn. 2011, 84, 137−169. (18) Hernandez-del-Olmo, F.; Gaudioso, E.; Nevado, A. Autonomous adaptive and active tuning up of the dissolved oxygen setpoint in a wastewater treatment plant using reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C 2012, 42, 768−774. (19) Kaelbling, L. P.; Littman, M. L.; Moore, A. W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237−285. (20) Busoniu, L.; Babuska, R.; De Schutter, B. A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern., Part C 2008, 38, 156−172. (21) Doya, K. Reinforcement learning in continuous time and space. Neural Comput. 2000, 12, 219−245. (22) Grondman, I.; Busoniu, L.; Lopes, G. A. D.; Babuska, R. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans. Syst. Man Cybern. Part C 2012, 42, 1291−1307. (23) Powell, W. B. Approximate Dynamic Programming: Solving the Curses of Dimensionality; Wiley-Interscience: Hoboken, NJ, 2007. (24) Tsitsiklis, J. N.; Van Roy, B. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control 1997, 42, 674−690. (25) Lewis, F. L.; Vrabie, D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circ. Syst. Mag. 2009, 9, 32−50. (26) He, P.; Jagannathan, S. Reinforcement learning neural-networkbased controller for nonlinear discrete-time systems with input constraints. IEEE Trans. Syst. Man Cybern. Part B 2007, 37, 425−436. (27) Al-Tamimi, A.; Lewis, F. L.; Abu-Khalaf, M. Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof. IEEE Trans. Syst. Man Cybern. Part B 2008, 38, 943−949. (28) Lu, C.; Si, J.; Xie, X. R. Direct heuristic dynamic programming for damping oscillations in a large power system. IEEE Trans. Syst. Man Cybern. Part B 2008, 38, 1008−1013. (29) Zhang, H.; Song, R.; Wei, Q.; Zhang, T. Optimal tracking control for a class of nonlinear discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans. Neural Netw. 2011, 22, 1851−1862. (30) Tosukhowong, T.; Lee, J. H. Approximate dynamic programming based optimal control applied to an integrated plant with a reactor and a distillation column with recycle. AIChE J. 2009, 55, 919− 930. (31) Wang, F.; Jin, N.; Liu, D.; Wei, Q. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans. Neural Netw. 2011, 22, 24−36. (32) Lewis, F. L.; Vamvoudakis, K. G. Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data. IEEE Trans. Syst. Man Cybern. Part B 2011, 41, 14−25. (33) Yang, Q.; Jagannathan, S. Reinforcement learning controller design for affine nonlinear discrete-time systems using online approximators. IEEE Trans. Syst. Man Cybern. Part B 2012, 42, 377−390. (34) Si, J.; Wang, Y. Online learning control by association and reinforcement. IEEE Trans. Neural Netw. 2001, 12, 264−276. (35) Murray, J. J.; Cox, C. J.; Lendaris, G. G.; Saeks, R. Adaptive dynamic programming. IEEE Trans. Syst. Man Cybern. Part C 2002, 32, 140−153.

m82 + 4m10m9 2m10

This means that x,xf and w̃ 1 are semi-globally uniformly ultimately bounded. Considering ∥y(,t)∥22 = ∥x(t)∥2 + ∥xf(t)∥R∞2, this means that the state y(z,t) of the original closed-loop PDE system 1−3 is semi-globally uniformly ultimately bounded in 3 2 -norm.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported in part by the National Basic Research Program of China (973 Program) (2012CB720003), the National Natural Science Foundation of China under Grants (61121003; 51175519), a project from RGC of Hong Kong (CityU: 117310), and Innovation Foundation of BUAA for Ph.D. Graduates under Grant YWF-13-RBYJ-009. The authors would like to thank anonymous reviewers for their valuable comments.



REFERENCES

(1) Curtain, R. F.; Zwart, H. J. An Introduction to Infinite-dimensional Linear Systems Theory; Springer Verlag: New York, 1995. (2) Luo, B.; Wu, H.-N. Online policy iteration algorithm for optimal control of linear hyperbolic PDE systems. J. Process Control 2012, 22, 1161−1170. (3) Yadav, V.; Padhi, R.; Balakrishnan, S. N. Robust/optimal temperature profile control of a high-speed aerospace vehicle using neural networks. IEEE Trans. Neural Networks 2007, 18, 1115−1128. (4) Li, M.; Christofides, P. D. Optimal control of diffusionconvection-reaction processes using reduced-order models. Comput. Chem. Eng. 2008, 32, 2123−2135. (5) Luo, B.; Wu, H.-N. Approximate optimal control design for nonlinear one-dimensional parabolic PDE systems using empirical eigenfunctions and neural network. IEEE Trans. Syst. Man Cybern. Part B 2012, 42, 1538−1549. (6) Armaou, A.; Christofides, P. D. Dynamic optimization of dissipative PDE systems using nonlinear order reduction. Chem. Eng. Sci. 2002, 57, 5083−5114. (7) Dubljevic, S.; Christofides, P. D. Predictive output feedback control of parabolic partial differential equations (PDEs). Ind. Eng. Chem. Res. 2006, 45, 8421−8429. (8) Aksikas, I.; Winkin, J. J.; Dochain, D. Optimal LQ-feedback regulation of a nonisothermal plug flow reactor model by spectral factorization. IEEE Trans. Automat. Control 2007, 52, 1179−1193. (9) Iftime, O. V.; M. A. Demetriou, M. A. Optimal control of switched distributed parameter systems with spatially scheduled actuators. Automatica 2009, 45, 312−323. (10) Wu, H.-N.; Luo, B. Heuristic dynamic programming algorithm for optimal control design of linear continuous-time hyperbolic PDE systems. Ind. Eng. Chem. Res. 2012, 51, 9310−9319. (11) Christofides, P. D. Control of nonlinear distributed process systems: Recent developments and challenges. AIChE J. 2001, 47, 514−518. (12) Dubljevic, S.; El-Farra, N. H.; Mhaskar, P.; Christofides, P. D. Predictive control of parabolic PDEs with state and control constraints. Int. J. Robust Nonlinear Control 2006, 16, 749−772. (13) Dubljevic, S.; Mhaskar, P.; El-Farra, N. H.; Christofides, P. D. Predictive control of transport-reaction processes. Comput. Chem. Eng. 2005, 29, 2335−2345. 8118

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119

Industrial & Engineering Chemistry Research

Article

(36) Vrabie, D.; Pastravanu, O.; Abu-Khalaf, M.; Lewis, F. L. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009, 45, 477−484. (37) Vrabie, D.; Lewis, F. L. Neural network approach to continuoustime direct adaptive optimal control for partially unknown nonlinear systems. Neural Netw. 2009, 22, 237−246. (38) Vamvoudakis, K. G.; Lewis, F. L. Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 2010, 46, 878−888. (39) Lee, T. T.; Wang, F. Y.; Newell, R. B. Advances in distributed parameter approach to the dynamics and control of activated sludge processes for wastewater treatment. Water Res. 2006, 40, 853−869. (40) Baker, J.; Christofides, P. D. Output feedback control of parabolic PDE systems with nonlinear spatial differential operators. Ind. Eng. Chem. Res. 1999, 38, 4372−4380. (41) Armaou, A.; Christofides, P. D. Crystal temperature control in the Czochralski crystal growth process. AIChE J. 2001, 47, 79−106. (42) Holmes, P.; Lumley, J. L.; Berkooz, G. Turbulence, Coherent Structures, Dynamical Systems and Symmetry; Cambridge University Press: Cambridge, UK, 1996. (43) Qi, C.-K.; Li, H.-X.; Li, S.; Zhao, X.; Gao, F. Probabilistic PCAbased spatiotemporal multimodeling for nonlinear distributed parameter processes. Ind. Eng. Chem. Res. 2012, 51, 6811−6822. (44) Pitchaiah, S.; Armaou, A. Output feedback control of dissipative PDE systems with partial sensor information based on adaptive model reduction. AIChE J. 2013, 59, 747−760. (45) Pitchaiah, S.; Armaou, A. Output feedback control of distributed parameter systems using adaptive proper orthogonal decomposition. Ind. Eng. Chem. Res. 2010, 49, 10496−10509. (46) Ho, C.; Tai, Y. REVIEW: MEMS and its applications for flow control. ASME J. Fluid Eng. 1996, 118, 437−447. (47) Bamieh, B.; Paganini, F.; Dahleh, M. A. Distributed control of spatially invariant systems. IEEE Trans. Automat. Control 2002, 47, 1091−1107. (48) Wolff, J.; Papathanasiou, A. G.; Rotermund, H. H.; Ertl, G.; Li, X.; Kevrekidis, I. G. Local manipulation of catalytic surface reactivity. J. Catal. 2003, 216, 246−256. (49) Christofides, P. D.; Daoutidis, P. Finite-dimensional control of parabolic PDE systems using approximate inertial manifolds. J. Math. Anal. Appl. 1997, 216, 398−420. (50) Christofides, P. D. Robust control of parabolic PDE systems. Chem. Eng. Sci. 1998, 53, 2949−2965. (51) Saridis, G. N.; Lee, C. G. An approximation theory of optimal control for trainable manipulators. IEEE Trans. Syst. Man Cybern. Part B 1979, 9, 152−159. (52) Courant, R.; Hilbert, D. Methods of Mathematical Physics; WileyVCH: Hoboken, NJ, 1989; Vol. 1. (53) Sastry, S.; Bodson, M. Adaptive Control: Stability, Convergence, and Robustness; Prentice-Hall, Inc.: Upper Saddle River, NJ, 1989. (54) Ioannou, P. A.; Sun, J. Robust adaptive control; PTR PrenticeHall: Upper Saddle River, NJ, 1996. (55) Wang, C.; Hill, D. J. Learning from neural control. IEEE Trans. Neural Netw. 2006, 17, 130−146. (56) Bitmead, R. Persistence of excitation conditions and the convergence of adaptive schemes. IEEE Trans. Inf. Theory 1984, 30, 183−191. (57) Watkins, C. J. C. H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279−292. (58) Ge, S. S.; Hang, C. C.; Lee, T. H.; Zhang, T. Stable Adaptive Neural Network Control; Kluwer: Boston, MA, 2001. (59) Khalil, H. K. Nonlinear Systems, 3rd ed, Prentice-Hall: Upper Saddle River, NJ, 2002.

8119

dx.doi.org/10.1021/ie4031743 | Ind. Eng. Chem. Res. 2014, 53, 8106−8119