Postdecision-State-Based Approximate Dynamic Programming for

Jan 10, 2011 - The current generation of advanced process control (APC) methods does not handle uncertainty in a systematic, optimal fashion. Take mod...
0 downloads 0 Views 1024KB Size
Ind. Eng. Chem. Res. 2011, 50, 1389–1399

1389

Postdecision-State-Based Approximate Dynamic Programming for Robust Predictive Control of Constrained Stochastic Processes Wee Chin Wong and Jay H. Lee* School of Chemical & Biomolecular Engineering, Georgia Institute of Technology. 311 Ferst DriVe NW, Atlanta, Georgia 30332, United States

This article proposes the use of approximate dynamic programming (ADP) based on the “postdecision state” as a computationally efficient tool for the systematic handling of uncertainty in the context of stochastic process control. The need to handle uncertainties systematically, which is unmet by current advanced process control methodologies, such as model predictive control (MPC), is highlighted through two classes of problems that are commonly encountered in process control. These are (i) scenarios where processes, made to operate near constraint boundaries for economic reasons, exhibit frequent excursions into the infeasible region as a result of exogenous disturbances and (ii) situations where there exists significant interaction between state or parameter estimation and control such that the often employed assumptions of certainty equivalence or separation fail to hold. Most previous works on ADP, as specialized for process control problems, are better suited for deterministic problems. For stochastic problems, such as those treated in this work, the postdecision stage formulation confers immediate practical benefits since it allows the efficient use of off-the-shelf optimization solvers found in all MPC technology. 1. Introduction The current generation of advanced process control (APC) methods does not handle uncertainty in a systematic, optimal fashion. Take model predictive control (MPC), the current de facto APC solution in the process industries,1 as an example. In MPC, current control action is obtained by minimizing a cost criterion defined on a finite time interval. In its standard form, nominal parameter values and deterministic trajectories of future disturbance signals are assumed to obtain an optimization problem amenable to online solution. Parameter uncertainties or uncertainties in the disturbance signals are seldom accounted for directly in the optimization. Even though the mechanism of receding horizon control imparts some degree of robustness through feedback updates, the resulting control can be substantially suboptimal since uncertainty is not treated explicitly in the optimization of the input. The lack of systematic handling of uncertainty is one of the main limitations of MPC at current time. Most of the past attempts at addressing uncertainty within the framework of MPC are reflected in so-called “robust MPC” formulations based on the objective of minimizing the worstcase scenarios at the expense of overly conservative policies.2,3 For a review of robust MPC formulations, readers are referred to ref 4. One fundamental problem may be that MPC in its pure form solves an open-loop optimal control problem at each sample time and thus may not be the best suited framework for achieving optimal feedback control under uncertainty. An openloop formulation ignores the beneficial effect of information feedback pertaining to the revelation of uncertainties and, therefore, can lead to substantial suboptimality, even if uncertainties were accounted for explicitly in the optimization. Multiscenario formulations5 of MPC have also been developed, but they do not give closed-loop optimal policies in general because they ignore the closed-loop nature of the problem. Stochastic programming based methodologies6 allow for recourse actions at the computational expense of enumerating an * To whom correspondence should be addressed E-mail: jay.lee@ chbe.gatech.edu.

exponentially growing number of scenarios with respect to the number of stages. The approach is all but impractical beyond two stage problems. The first intended contribution of this paper is in highlighting two categories of practical situations where a systematic treatment of uncertainty is necessary for satisfactory closedloop behavior. The first category of such problems treated in this article involves stochastic optimal control in the presence of constraints. It is often cited that the optimal operating point of a unit operation lies at constraint boundaries.7 Indeed, the popularity of MPC is oftentimes ascribed to its ability to handle constraints. Since the effect of uncertainties are not accounted for explicitly, MPC tends to place the system trajectory next to a constraint boundary. As a result, excursions into the infeasible region frequently occur. Adjustments to the set-point are oftentimes required so as to “back off”8,9 from the constraint boundaries. As will be demonstrated in Section 4, this approach can be substantially suboptimal and the postdecision-state-based ADP approach we propose in this manuscript tends to yield a control policy that gives significantly tighter control than the one derived from MPC with shifted set-points. The other class of problems involves situations where significant interaction exists between estimation and control. In this case, the control action directly impacts the propagation of future uncertainty. For example, dual control10,11 deals with the control of systems whose parameters are initially unknown. In this case, the control action serves the dual role of probing the system as well as regulating it. Popular APC approaches, such as MPC, do not achieve an optimal balance between the two objectives.12 Similarly, in general stochastic nonlinear control problems, due consideration needs to be given to the interplay between control and state estimation.13 Although a state estimator, for example, the extended Kalman filter (EKF), is generally capable of returning an entire (albeit approximate) probability distribution, most APC solutions employ only a point estimate. Very often, uncertainty in the point estimate is completely ignored. The implicit assumptions of separation and certainty equivalence do hold in general, thereby resulting in poor closed-loop performance.

10.1021/ie1007882  2011 American Chemical Society Published on Web 01/10/2011

1390

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011

The second intended contribution of this paper is a new approach to solve stochastic optimal control problems via approximate dynamic programming (ADP) based on a so-called “postdecision”14,15 (in contrast to the more commonly encountered “predecision”) state variable. Dynamic programming (DP), first developed by,16 is a well-known general approach for solving multistage, stochastic control problems. However, methods approximating exact DP are often required due to computational intractability stemming from a “curse-ofdimensionality”. In previous studies, a particular ADP strategy, based on closed-loop simulations and function approximation, had been successfully employed as a computationally attractive method for obtaining near-optimal closed-loop policies for process control problems.17-20 However, the predecision-statebased ADP formulation involves an optimization over an expected quantity, which can be computationally cumbersome. The use of the postdecision state allows the generally noncommutative optimization and expectation operations to be interchanged, affording significant reduction in the off-line and online computation. For example, the online computation of the optimal control action involves a single stage deterministic optimization (without any expectation operator). The proposed formulation allows the efficient use of off-the-shelf optimization solvers, which form the cornerstone of all MPC technology. Another computational benefit is that the off-line computation of the approximate value function too can be significantly simpler because it decomposes a large optimization into many small ones that can be solved in parallel. The rest of the article is organized as follows. In section 2, the DP-based on the postdecision state, the definition of which was first introduced in ref 14, is presented as a method for stochastic optimal control. An approximate DP algorithm, along the lines of that proposed in refs 17-19 and 21 is given in section 3. However, the new version of the algorithm uses post decision states and employs a smooth value function for the use of off-the-shelf optimization packages. The proposed algorithm is demonstrated on a couple of representative examples in sections 4 (a problem where constraint violations pose an issue) and 5 (a problem involving interaction between estimation and control). Conclusions and future work are presented in section 6. 2. Dynamic Programming Based on the Postdecision State 2.1. Classical Formulation of Dynamic Programming Based on the Predecision State. Consider the optimal control of the following discrete-time stochastic system: xt+1 ) f (xt, ut, ωt)

(1)

where xt ∈ X ⊆ Rnx refers to the system state at discrete time index t, ut ∈ U ⊆ Rnu a control or action vector, and ωt an exogenous, unmeasured, white noise signal. x may contain physically meaningful states as well as measured disturbances, and parameters subject to uncertainty. f refers to the singlestage transition function. For problems where the system’s dynamics are represented by ordinary differential equations, f is then the result of numerical integration across a single sampletime, with vectors u and ω held constant. Throughout this article, it is assumed that full state feedback is available. In the event that only output feedback is available, x is interpreted as an information vector (I ) that contains the sufficient statistics of the state estimate’s probability density function. f in this case gives the temporal relationship between I at time indices t and

t + 1. Such lifting is possible as the information vector is governed by another related set of equations (i.e., the filter dynamics). Let µ ∈ Γ be a state-feedback control policy that maps the state vector to the action vector, where Γ represents the set of all admissible, stationary control policies. Jµ(x) will be used to denote the “cost-to-go” or value function, which is defined as the infinite horizon, discounted sum of the stage-wise costs achieved by the control policy µ starting from an arbitrary state x: ∞

Jµ(x) ) E[

∑ γ φ(x , u k

k

k

) µ(xk))|x0 ) x]

(2)

k)0

φ represents a prespecified stage-wise cost (e.g., φ(x,u) :) |x|Q2 + |u|R2 , where Q and R are weighting matrices) and γ ∈ [0,1) is a discount factor. The goal then is to find the optimal (stationary) policy µ*: X f U, that yields the minimum costto-go function as shown below. ∞

Jµ*(x) ) minE[ µ∈Γ

∑ γ φ(x k

t+k, ut+k

) µ(xt+k))|xt ) x]

(3)

k)0

Jµ*: X f R0+ is the optimal cost-to-go function and is an indication of the attractiveness of a given state in terms of the future costs. By definition, Jµ*(x) e Jµ(x),∀x and ∀µ ∈ Γ. On the basis of the principle of optimality,16 one is able to rewrite eq 3, thereby obtaining Bellman’s optimality equations Jµ*(x) ) min{φ(x, u) + γE(ω|x)[Jµ*(f(x, u, ω))]} ) (TJµ*)(x) u∈U

(4) T above represents the single-pass DP operator represented by the minimization operation. With Jµ*(x) at hand, the optimal policy can be implicitly defined (and implemented) through the online solution of the following single-stage optimization: µ*(x) ) arg min{φ(x, u) + γE(ω|x)[Jµ*(f(x, u, ω))]} u∈U

(5)

In essence, the optimal control problem can be considered to be solved once Jµ* is known. The repeated, off-line application of T on an arbitrarily initialized cost-to-go leads to convergence and underpins the idea behind value iteration (VI). Jµ*(x) ) TJµ*(x) ) lim(T)iJµ(x), ∀µ, x if∞

(6)

Note that eqs 4 and 5 involve a minimization over an expected quantity whose probability distribution function is generally unknown. 2.2. Interchanging Optimization and Expectation Operations via the Postdecision State. The postdecision state, xp ∈ X p ⊆ Rnx, is introduced to alleviate the problem of having to perform minimization of an expectation quantity. xp refers to the system state immediately after the control vector is introduced to the system but before the uncertainty is realized. To be precise, we restrict our discussion from hereon to those f that can be decomposed into the following subtransitions xpt ) f1(xt, ut)

(7)

xt+1 ) f2(xpt , ωt)

(8)

where the composition of f1 and f2 is equivalent, in effect, to f, in eq 1. [Such as decomposition is possible when one uses the

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011

appropriate definitions of f1 and f2 and the appropriate definition of the pre- and postdecision state variable (e.g., concatenations may be required). The case of additive noise is perhaps the most instructive. This spans a large number of examples found in process industries.] Note that f1 describes a deterministic transition between the predecision state variable (x) and xp. f2 involves the transition because of uncertainty after the control action is implemented. The value function of xp, Jµ,p(xp), may be defined in terms of the value function of x, as such Jµ,p(xpt ) ) E(ω|xtp)[Jµ(xt+1)], ∀µ

(9)

By considering the optimal policy µ*, and substituting eq 4 into eq 9, the min and E operators are interchanged, yielding p Jµ*,p(xpt ) ) E(ω|xtp) min {φ(xt+1, ut+1) + γJµ*,p(xt+1 )}

[

ut+1∈U

] (10)

The single-stage online optimization that defines the control policy given the value function is also streamlined as follows: µ*(x) ) arg min{φ(x, u) + γJµ*,p(f1(x, u))} u∈U

(11)

As with the predecision state formulation, the optimal control problem is essntially solved once Jµ*,p is known. The postdecision state DP operator, Tp, may be similarly defined so that VI (based on xp) converges to Jµ*,p, as such Jµ*,p(xp) ) TpJµ,p(xp) ) lim(Tp)iJµ,p(xp), ∀µ, xp if∞

(12)

Convergence (see eqs 6 and 12) to a unique point is guaranteed because both T and Tp are γ-contraction maps (see section 3.2). The introduction of the postdecision state allows the generally noncommutative min and E operators to be interchanged. To see the benefits achieved, let us suppose that the evaluation of expectation is to be done by taking a sample average. Then, (i) eq 10, used during the off-line value iteration, consists of an independent collection of deterministic optimization problems, each of which is easier to solve using off-the-shelf solvers than the one in the predecision case, (ii) the off-line optimizations for different samples may be run in parallel, and (iii) online optimizations are streamlined since they also do not involve an expectation operator at all. In general, DP does not yield analytical forms of the value function. A straightforward, though often impractical, solution for performing VI would involve discretization of the state space and the subsequent application of the Bellman’s equations for each of the points until convergence. In process control problems, because of the continuous nature of the state and action spaces, such numerical solutions become quickly bottlenecked as the problem dimensions grow. In fact, the growth would be exponential as the number of discretized points grows with the dimension as such. To circumvent this so-called “curse of dimensionality”, one needs to resort to approximations that involve (i) an intelligent state-sampling/discretization scheme to avoid having to run VI over all (or an exorbitant number of) points in X and (ii) an efficient representation of the cost-to-go.19,22 For the latter, an appropriately chosen function approximator is required for the purpose of value-function generalization. Discussions specific to the case involving the postdecision state (xp) are found in the following section. 3. Postdecision-State-Based Approximate Dynamic Programming 3.1. Basic Idea. It is often the case that only a small portion of X, X p, and U will ever be visited under the optimal or

1391

high-quality suboptimal control policies. Denote the subset of the predecision and postdecision state spaces that are “relevant”, that is, visited with nontrivial probability under the optimal ,p , respectively. A premise behind control, as X *REL and X *REL ADP is that, though such sets would be continuous, they would be much smaller-sized than X and X p in general. The key ,p notion is that if one could identify X *REL and X *REL or parsimonious supersets of them, one can sample the sets with sufficient density to perform the dynamic programming at significantly reduced computation. Of course, the difficulty is that it is not easy to obtain such a set ahead of time without knowing the optimal controller itself. This work expands on the ADP approach proposed in refs 17-19 and 21 for process control applications to the postdecision-state-based approximate dynamic programming approach. An additional modification is on the structure of the function approximator to ensure its smoothness. The main idea is to employ carefully designed simulation schemes for the sampling of the state space and function approximation (for the purpose of cost-to-go interpolation) to this end. The off-line VI computations are as follows: 1. Identify a finite-sized, “relevant” predecision state-space, p . Note Xsam ⊂ X and its postdecision counterpart, Xsam p that each (x, x ) tuple are related through eq 7, where xp is defined to be the state variable obtained immediately after an action is taken but before the realization of any uncertainties. Also, the number of elements in Xsam and p p are equal, that is, |Xsam| ) |Xsam | ) N. Identification Xsam of these finite sets is achieved, for instance, by simulating different combinations of operating conditions and disturbances under some initial suboptimal policy (or set of policies), µ[0]. Dithering may also be introduced for the purpose of random exploration. p . The initial, 2. Assign a cost-to-go for all elements of Xsam finite-sized cost-to-go table, denoted by T[0]p } µ*,p p p (x )|xp ∈ Xsam }, is thus obtained. The circumflex {xp,Jˆ[0] symbol is used to emphasize the approximate nature of the cost-to-go sequence, even in the limit of the iterative process. Exact initialization is not critical per se since it will be improved in the following step of value iteration. In the sequel, initialization proceeds as follows: µ*,p p µ* Jˆ[0] (x ) ) E(ω|xp)[Jˆ[0] (f2(xp, ω))]

(13)

µ* ∞ (x) ) E [∑k)0 γtφ(xk,µ[0](xk))|x0 ) x]. For pracwhere Jˆ[0] tical implementation, an exact computation of the infinite sum may be approximated by a summation up to a finite ˜ length (t˜) that corresponds to γt e τ, where τ is a sufficiently small number. When there are a number of candidate initial, suboptimal policies, the most promising one yielding the lowest cost may be used for initialization of the value function. p through VI, 3. Obtain converged cost-to-go values for Xsam yielding the sequence of value tables {T[0]p, T[1]p, ...}. Since the VI requires the evaluation of the cost-to-go function for postdecision states (xp) not necessarily in Xpsam, a function approximator is needed to interpolate among the stored values (see discussion in section 3.2). A certain choice of function approximator ensures that each pass of the iteration is a contraction-map with a unique fixed point (see section 3.2). In other words, each step of the modified VI, indexed by i, involves

µ*,p µ*,p p (xp) ) (TpF(Jˆ[i] ))(xp), ∀xp ∈ Xsam Jˆ[i+1]

(14)

1392

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011

Figure 1. Proposed ADP algorithm based on the postdecision state.

Here F(Jˆµ*,p [i] ) denotes the cost-to-go function approximator based on the stored values from T[i]. Termination occurs µ*,p µ*,p when ||Jˆ[i+1] - Jˆ[i] ||∞ is less than a predefined tolerance. µ*,p ˆ µ*,p µ*,p )(J[i+1] - Jˆ[i] )||∞ Alternatively, the relative measure ||(1/Jˆ[i] may be used for cases where the cost-to-go values are exceedingly large. 4. Return to step 1, since the relevant domain of the statespace may not be properly ascertained a priori. If this is the case, the control policy for the next round of simulations would be the one instructed by the current converged cost-to-go table. The aforementioned steps are summarized in Figure 1. The concept of the postdecision state had been previously adopted in ref 23 in the context of solving Markov decision processes (MDPs). They used an approximate policy iteration scheme where Jµ,p(xp),∀xp ∈ X p is assumed to be linear in a set of basis functions (known or otherwise assumed to be orthogonal polynomials of sufficiently large degree). The coefficients are learnt through a least-squares procedure once the system of interest is allowed to evolve according to the current policy, which is similar to step (1) where relevant states are collected. Unlike this work, which uses nonparametric function approximation, the main limitation there may be that suitable basis functions are difficult to ascertain in general. 3.2. Function Approximation and Stable Learning. The need for function approximation for the purpose of generalization has been discussed. Given a training set (that is, an existing value table, where the VI index will be dropped in this subsection for N , composed of notational compactness) T p } {xp(j),Jˆµ*,p(xp(j))}j)1 a finite number (N) of input (xp(j) ∈ Xpsam) and target values (Jˆµ*,p(xj)), a function approximator, F, whose domain is X p, maps an arbitrary query point xqp ∈ X p to (a subset of) the real line. The dominant and natural choice for function approximators has been parametric global approximators, such as neural networks, or the use of basis functions, such as high-order orthogonal polynomials or Fourier series.24,25 While this approach has met with some success in certain applications, for example, in Backgammon,26 it is not immune from divergent behavior17,19 when employed in the context of ADP. In certain cases, the off-line iteration would fail to converge, with the costto-go approximation showing nonmonotonic behavior or instability with respect to iterations.27 were the first to attribute the failure with function approximation to an “overestimation” effect.28 demonstrated that suboptimality can be severe when a global approximator with a linear combination of basis functions is employed.29 provide insightful illustrations showing the failure of popular function approximators during off-line learning.

There are considerably fewer papers that address function approximation schemes for problems with continuous state and action spaces.23,21 The problem of linear quadratic regulation, for which the value function is known to be quadratic in structure, is a noted exception.The authors of refs 30 and 31 proposed a kernel-based approach for problems with continuous states but finite actions and demonstrate convergence to the optimal cost-to-go value function with an increasing number of samples and decreasing kernel bandwidth under a modelfree scheme.The authors of ref 23 proposed a provably convergent approximate policy iteration under the assumption of known basis functions and other technical conditions. Stable learning during the off-linear value iteration step of the proposed ADP strategy is highly desirable as it can be frustrating to run a large number of iterations only to have the result “blow up” all of sudden due to some complicated coupling between the function approximation error and value iteration. To have provable convergence of the approximate value iteration (not necessarily to the optimal value function, however), one needs to use a function approximator with a certain property called “non-expansion”. Gordon32 discussed the viability of using such a class of function approximators. With such a choice, the overall operator composed of value-iteration and then function approximation results can be shown to be a contraction map therefore ensuring convergence. Definition: A γ -contraction mapping m (with respect to the infinity norm) defined on a normed vector space (mapping elements from this space, V, to itself) is defined as ∀V1, V2 ∈ V, |m(V1) - m(V2)|∞ e γ |V1 - V2 |∞, γ ∈ [0, 1) where V1 and V2 are arbitrarily chosen elements of V. Definition: When γ ) 1, m: V f V is termed a nonexpansion.32 From Banach’s fixed-point theorem, it can be easily shown that the iterated sequence {V, m(V), m2(V), ...} converges to a unique fixed point. As explained earlier, the proposed ADP µ*,p p p (x ),∀xp ∈ Xsam . This is method starts with initial estimates Jˆ[0] followed by function approximation (recall that this mapping is denoted by F), and an application of the DP operator, Tp to µ*,p p p (x ),∀xp ∈ Xsam . The process is repeated again. A yield Jˆ[1] sufficient condition for convergence is to demonstrate that the overall operator Tp composed with function approximator F is a also contraction map. This, in turn, holds true if F is a nonexpansion map. Proposition 0.1: TpF is a γ-contraction map if F is nonexpansive. Proof: Given arbitrary vectors Jˆ1µ*,p,Jˆ2µ*,p ∈ RN, both of which p , correspond to the same Xsam

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011 p

|T

F(Jˆµ*,p 1 )

-T

p

F(Jˆµ*,p 2 )|∞

e

- F(Jˆµ*,p 2 )| ∞ e µ*,p γ|Jˆ1 - Jˆµ*,p 2 |∞

γ|F(Jˆµ*,p 1 )

(15)

The first line is true since Tp is a γ -contraction map defined on the space of value functions. The second inequality follows if one employs a function approximator with a nonexpansion property. Function approximators that employ averaging, as defined below, can be shown to possesses a nonexpansion property. Definition: F is an averager if every fitted valued is the weighted average of target values, potentially with the addition of a bias term. Specifically N F(Jˆµ*,p)(xpq) ) β0(xpq, {xp(j)}j)1 )+ N

∑ β (x , {x (j)} i

p q

p

N ˆ µ,p p j)1)J (x (i))

(16)

i)1

N N g 0, and Σi)1 βi e 1. Note that the weights β are Here, {βi}i)0 allowed to depend on the query point (xqp) and input values N ) but not the target values. That such an averager is a ({xp(i)}i)1 nonexpansion (i.e., eq 15 is true) is easily proved (see Appendix). 3.3. Smooth Function Approximation via Kernel Regression. To use off-the-shelf optimization solvers, a preferred choice of F is one that is differentiable. Nadaraya-Watson Kernel regression33 represents a popular class of smooth averagers used to approximate the conditional expectation of any random variable in a nonparametric manner. Consider approximating the conditional expectation of a random variable J ∈ R given the random variable x ∈ Rn. Through simple probability, one has



p(x, J) J dJ p(x)

E[J|x] )

(17)

Using Kernel Density Estimation (KDE) [also known as the Parzen window method. KDE is composed of a mixture of N Gaussians, each of which is centered at a particular training point, x(j)], a popular nonparametric smooth density estimator, one may approximate p(x) as pˆ(x) )

1 N

∑ K( N

j)1

|x - x(j)|2 σ

)

(18)

N , {x(j)}j)1

a training set. where x is an arbitrary query point and K( · ) is a Kernel function that decays exponentially with the squared distance between a query point and a particular point in the training set. A popular choice of K( · ) is the Gaussian Kernel (with a diagonal covariance matrix Σn×n :) diag(σ2, · · · ,σ2))

(

K

)

|x - x(i)|2 1 ) exp(-0.5σ-2 |x - x(i)|22) σ (2πσ2)0.5n

The bandwidth parameter, σ, controls the trade-off between bias and variance errors and may be ascertained through crossvalidation.33 From eqs 17 and 18, the function approximator for Jˆµ*,p is given by )(xpq)

ˆ µ*,p

F(J

1 ) × p |xq - xp(j)|2 K σ j)1



(

N

(

∑K j)1

)

From eq 19, it is immediately apparent that the value function of an query point is given by the weighted average of value functions of neighboring points, with a larger contribution ascribed to closer neighbors. Another interpretation of Nadaraya-Watson kernel regression is that it gives a local constant fit to the data.33 Small values of σ imply that the values of a relatively fewer number of neighbors (of the query point) are used in the averaging process for function approximation, and vice versa. It is noted that previous process control-relevant ADP work21 based on the predecision state, used a k-nearest neighbor (k-NN) approach for function approximation. While that yields a local, nonexpansive averager, it also corresponds to one that is nonsmooth. 3.4. Cautious Learning: Preventing Overextrapolations. It has been demonstrated21,34 that simply using a local averager (that is, with β0 ) 0), though guaranteeing convergence, does not necessarily give a converged function leading to a stable closed-loop behavior. This is because function approximation error can be significant, particularly when the training data is insufficient. Safeguards against overextrapolation during value iteration are often needed for the successful implementation of ADP schemes using local, nonexpansive averagers, as has been shown in ref 21. For a query point located in regions with little data present, distance-weighted averaging may fail to provide meaningful generalizations of the cost-to-go. As suggested in ref 21, the prevention of accepting such a query point may be achieved by including in the cost-to-go term β0 g 0, a penalty that is imposed whenever the minimization step encounters a query point (xqp) located in a region with data sparsity:21

)

)(

1 -F p ˆ (x 1 q) -F β0(xpq) ) AU pˆ(xq) F

(

)

2

(20)

Here, F is a data-sparsity (defined as the inverse of data-density) threshold, A is a scaling parameter, and U is the step function that returns a zero value whenever its argument is nonpositive and unity, otherwise (that is, when (1/pˆ(xq)) > F). pˆ(xq) is a measure of data density as ascertained by fitting a Kernel density estimator over the independent variables of the training set. Using the tuning rules suggested in ref 21, F is given a value of F}

[

1 N

N

∑ K(1) i)1

]

-1

) K(1)-1

(21)

In other words, the threshold data density is that density obtained by assuming that an arbitrary query point is 1 standard deviation (σ) away from every other point in the training set. σ is obtained through cross-validating the initial value table obtained, T[0]p . A is assigned a value such that the penalty term corresponds to 10 times the maximum estimate of the initial value function µ*,p µ*,p :) 10|Jˆ[0] |∞) when a query point is assumed to (that is, Jˆmax be 3 standard deviations away from any other point in the training set. Namely, one gets A}

|xpq - xp(j)|2 µ*,p p Jˆ (x (j)) σ

1393

1 Jˆµ*,p max (K(3)-1 - K(1)-1)2

(22)

To ensure boundedness of β0, a maximum value is imposed on the penalty term

(19)

β0(xpq) r min(β0(xpq), Jˆµ*,p max )

(23)

1394

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011

The bias term, as proposed in the above, is nonsmooth because of the presence of the step function in eq 20 and the upperbound on β0 eq 23. To obtain a smooth approximation of it, we propose to perform the following computations whenever a query point is encountered and its local data density, pˆ(xqp), calculated: ˜ (pˆ(xpq), K(3)) pˆ1(xpq) r max

(24)

˜ (pˆ1(xpq), K(1)) pˆ2(xpq) r min

(25)

max ˜ (a, b) }

√(b - a)2 + δ b+a + 2 2

(26)

(27)

˜ are smooth approximations of the max where max ˜ and min and min operators, respectively (see eqs 26 and 27), and δ is a small positive real number (set to 0.1% for the rest of this article). The effect of eqs 24 and 25 is to constrain the data density (at any query point) to the interval [K(3),K(1)]. Whenever the data density exceeds the upper bound of this said interval, it is assigned a value of K(1), such that no penalty is assigned. Whenever the data density goes below the lower bound of the said interval, it is assigned a value of K(3), such that the maximum penalty is given. With this, the function approximator has the following form

(

)

2

+

1 × |xpq - xp(j)|2 K σ j)1

(

∑ N

(

∑K i)1

)



[

30

]

∑ φ(x , u ) t

t

t)0

ADP

LMPC

1600

10 000

running into infeasibility issues) and the proposed ADP strategy. As is typically done, LMPC is implemented assuming ω remains at its nominal value of 0 over the prediction horizon. Namely, for LMPC, the following is solved at each time step, t

)

min

∑ 0.7|x˜

p {νk}k)0 k)0

2 2 k+1 | 2

+ 0.33|ν2k | 22 + 100|εk | 22

(29)

where x˜0 is initialized as xt, and εk, k ) 0, 1, ..., p g 0 are non-negative auxiliary decision variables representing the least amount of slack required to make the LMPC problem feasible. That is, [0 1]x˜k + εk g 0. These inequalities are easily incorporated into the math program defined by eq 29. The relationship between x˜k+1 and (x˜k, νk) is given by eq 28, with the noise term assumed to be zero throughout the entire prediction horizon. Also, the input vector is constrained to satisfy the aforementioned bounds of (0.5. For the proposed ADP approach, the discount factor is set to a value close to unity, that is, γ ) 0.98. Furthermore, the stage wise cost is modified to penalize deviations from the state constraints. Namely, the stage-wise cost used during VI is given by φ(xt, ut) } 0.70|xt | 22 + 0.33|ut | 22 + 100 max(0, -[0, 1]xt)2

|xpq - xp(i)|2 µ*,p p Jˆ (x (i)) σ

with A given by eq 22 and F by eq 21. 4. Constrained Stochastic Linear Systems 4.1. Double Integrator Problem. We start the demonstration of the algorithm by means of a simple constrained linear system. The following constrained double integrator problem studied in ref 35 in the context of MPC for stochastic systems is considered xt+1 ) Axt + But + Υwt

Score

p

2 ˜ (a, b) } a + b - a - √(b - a) + δ min 2 2

1 -F p p ˆ 2(x q) p µ*,p F(Jˆ )(xq) ) A F

Table 1. Double Integrator Example: Comparing Closed-Loop Performance

(28)

where matrix A ) [10;11] (in Matlab notation), B ) [1;0], Υ ) [1;0], and ωt is zero-mean, white Gaussian noise with its second moment, E [ωtωt′] ) 0.2,∀t. The control objective is to bring the system from an arbitrary initial state ([0;14] in the following simulations) to the origin while minimizing the actuator movement. The second dimension of the state vector is constrained to be non-negative (x2 g 0), as in the input vector ut ∈ [-0.5, 0.5]. In this example, full state-feedback is assumed. The performance of a linear model predictive controller (LMPC) (with prediction and control (p) horizon set to a sufficiently long duration of 15 time steps) is compared against the proposed ADP approach based on the postdecision state variable. The postdecision state is defined as the quantity obtained after an action is taken but before the uncertainty is realized. That is, xpt } Axt + But. Since ω is an unbounded signal, a soft-constraint approach is employed for both LMPC (to avoid

Hard constraints on u are imposed during the off-line value iteration process and online implementation of the ADP-based controller. p , the aforementioned LMPC controller is To construct Xsam used to conduct closed-loop experiments bringing the system from 40 different initial predecision states to the origin. Note that the initial state used for online testing is different from these 40 initial states. Namely, all permutations of the sets {-2, 0, -1, 1, 2} and {-4, -2, 0, 2, 4, 6, 8, 10} are considered to create various values for the first and second dimension of the initial state respectively. Consequently, a total of 3500 training points, whose initial cost-to-go values were initialized by computing the cost for LMPC over a sufficiently long horizon (of 120), were obtained as a result of the initialization scheme. For the purpose of function approximation, Kernel regression was employed with the bandwidth, σ, set to 0.10. To avoid overextrapolation, the following parameters were set as A ) 2.6 × 103 and F ) 0.1036. Value-iteration converged within 50 iterations, with the relative error termination criterion set to

|

µ*,p µ*,p Jˆ[i+1] - Jˆ[i] µ*,p Jˆ[i]

|



e 0.1

Results from 100 stochastic realizations are presented in Table 1. As can be seen, the proposed ADP controller has an average finite horizon score (denoted by Eˆ in the table) an order of magnitude lower than a deterministic approach typified by LMPC. In particular, LMPC suffers from excessively high variance in terms of closed-loop performance. A look at the time series plots of the second dimension of x for both methods (see Figure 2) reveals that LMPC results in significant constraint

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011

violation. On the other hand, the majority of the realizations based on the ADP approach do not violate the lower bound constraint. A plot of the average trajectory is presented in Figure 3. As can be seen, the proposed ADP approach automatically recedes from the constraint and settles (on average) to a value of 5 for the second dimension and 0 for the first (not shown). It must be noted that employing slMPC with a set-point shifted to [0;5]′ does not produce an equivalent result. Figure 4 serves to emphasize this point; it is evident that excursions into the infeasible region are still significant. 4.2. Linearized Bioreactor. Consider the maximization of a chemostat’s productivity subject to constraints on raw material conversion. Such requirements are commonly found within process industries. The governing equations of an archetypal chemostat are as follows:

1395

Figure 3. Mean trajectory (〈x2〉 vs t) of 100 stochastic realizations.

µmaxx2 - x1u κ + x2 µmax x1x2 ) u[x2f - x2] Y κ + x2

x˙1 ) x1 x˙2

where x ) [x1; x2] ∈ R2 is the state vector composed of the instantaneous concentration of the product (x1) and substrate (x2) respectively. u g 0, the dilution rate, is the non-negative manipulated variable. x2f refers to the instantaneous concentration of the substrate feed. The maximum specific growth rate µmax is set to 1, Y to 1, and κ to 0.02. In the following, the minimum acceptable conversion, ψ } 1 -(x2/x2f), is set to be 95%. Product productivity is defined as Pt } x1,tut. Upon linearization about steady-state conditions corresponding to 95% conversion and mean feed concentrations of 1, we obtain dx ) Φcx + Bcu + Γcx2f dt

Figure 4. LMPC with set point shifted to [0;5]′: x2 vs t for 100 realizations.

assuming full feedback (of the state and instantaneous feed concentration) is available. For MPC, a prediction and control horizon, p, of 10 sample units, is used. The following math program is solved at each time instant:

(30)

With a sampling time of 0.5 (dimensionless) units and x2f simulated to fluctuate around a mean value of 1, while perturbed by zero-mean, white Gaussian noise with a covariance of 10-3, we obtain the concatenated, discrete-time linear system

( ) ( )( ) ( ) ( )

xt+1 Φ Γ xt B 0 ) + u + ω , ω ∼ N (0, 10-3) x2f,t+1 0 0 x2f,t 0 t 1 t t (31)

To investigate fulfilling the aim of maximizing productivity while obeying conversion constraints, the performance of linear MPC is compared against that of the proposed ADP strategy,

p

min

∑ (|P˜

p {νk}k)0 k)0

k+1

- P *| 22 + 100|εk | 22)

where the associated quadratic program is initialized with x˜0 ) xt, the relationship between x˜k+1 and x˜k is governed by the discrete-time dynamics shown above (assuming no temporal changes in x2f), and the corresponding productivity and conver˜ k } 1 - (x˜2,k/x2,t), respectively. ˜ k } x˜kνk and ψ sion given by P ˜ k g 0.95. ε g 0, is a non-negative slack variable: εk + ψ For the proposed ADP approach, γ is set to 0.98, and the stage-wise cost defined as φ(xt,ut) } ||Pt - P * ||22 + 100 max p , an MPC controller (with (0,0.95 - ψt)2. To determine Xsam horizon length of 10 time units) was used to conduct closed-

Figure 2. Double integrator example: x2 vs t for 100 realizations. Lower bound for x2 is 0.

1396

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011

Figure 5. Constrained linearized chemostat: closed-loop performance of a typical realization. ADP, solid line (s); slMPC, dotted line( · · · ); lower bound on conversion, dash-dot (- · -).

Figure 6. Interaction between estimation and control: x vs t for a typical realization.

loop experiments regulating the system at an initial state corresponding to a conversion of 0.95. A total of 250 training points was obtained from the initialization scheme. Kernel regression was employed for function approximation with the bandwidth, σ, set to 0.12, A to 1.50, and F to 0.08, to prevent overextrapolation. Results from a typical realization are depicted in Figure 5. The premise behind a sound control scheme is to regulate the system at an equilibrium point that corresponds to the largest possible value of the dilution rate without exceeding the conversion bound so that productivity is maximized. It is apparent that the ADP-based approach, compared to MPC, results in minimal constraint violation albeit at a lower productivity. 5. Problem Involving Interaction between Estimation and Control Most of the work on MPC has assumed full state feedback or readily available point state estimates, the latter being the more likely scenario. The implicit assumption of the separation principle between controller and estimator design, while accurate for linear systems, does not necessarily hold for nonlinear ones. In the general case, one would expect the quality of the state estimate to significantly affect the control action and vice versa. In other words, for nonlinear stochastic control problems, there exists significant interactions between state estimation and control such that popular open-loop optimal control formulations (such as MPC) might give poor performance. In this example, the proposed ADP approach is applied to a system which is weakly unobservable at the reference state. The same problem was studied by Hovd and Bitmead13 who modified the optimi-

zation problem within MPC to account of the quality of the state estimates. Consider the following dynamical system: xt+1 ) xt + 0.1xtut + ut + ωt

(32)

yt ) xt3 + Vt

(33)

where xt ∈ Rnx refers to the system state at discrete time index t, ut ∈ Rnu, the manipulated variable, and yt ∈ Rny, the observed variable. ωt (the state excitation noise) and Vt (the measurement noise) are uncorrelated, zero-mean, white Gaussian noise signals with covariance of Q ) 0.01 and R ) 0.1, respectively. Since full-state feedback is assumed to be unavailable, state estimation is performed via an extended Kalman filter (EKF). The time and measurement update equations are given by eqs 34 and 35, respectively. xt|t-1 ) xt-1|t-1 + 0.1xt-1|t-1ut-1 + ut-1 Pt|t-1 ) (1 + 0.1ut-1)2Pt-1|t-1 + Q

(34)

Rεt|t-1 ) (3xt|t-12)2Pt|t-1 + R Kt ) Pt|t-1(3xt|t-12)Rεt|t-1-1 xt|t ) xt|t-1 + Kt(yt - xt|t-13)

(35)

Pt|t ) Pt|t-1 - Kt(3xt|t-12)Pt|t-1 The control objective is to bring the system from an arbitrary initial state to the origin, while minimizing the actuator movement. The performance of MPC is compared against the proposed ADP approach. For MPC, the following open-loop

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011

1397

Figure 7. Interaction between estimation and control: u vs t for a typical realization.

Figure 8. Interaction between estimation and control: Pt|t vs t for a typical realization.

optimization is solved at each time instant, t p

min

∑ {2x˜

p {νk}k)0 k)0

} + 10x˜p+12

2 k

(36)

where x˜0 is initialized at xt|t, the prediction horizon (p) is set to 9, and the relationship between x˜k+1 and x˜k governed by eq 32, with ω assumed to be at its mean value of zero across the prediction horizon. It is noted that increasing prediction horizon length has little effect on altering closed-loop performance. For the proposed ADP approach, the discount factor, γ, is set to 0.98 to ensure the convergence of VI. The predecision hyper-state at time t, contains the same information as the pair (xt|t,Pt|t). That is to say, It ) (xt|t,Pt|t)′. With this, the stage-wise cost used within the context of ADP is given by φ(It, ut) ) E( · |It )[2xt2] ) 2xt|t2 + 2Pt|t From eq 34, it is natural to define the postdecision state as the quantities obtained immediately after the time-update step. That is, I pt } (xt+1|t, Pt+1|t). Transitions between the post-to-pre decision state occur through a realization of the innovations term, whose covariance is given by Rεt+1|t. To construct Xpsam, the aforementioned NMPC controller (with a prediction horizon of 9) was used to conduct closed-loop experiments (each lasting 50 time units) to bring the system from 15 different predecision initial states, the set of which is (-10, -8, -6, -4, -2, 0, 2, 4, 6, 8, 10, 13, 15, 17, 20), to the origin. For each initial state, 5 different but representative

trajectories of ω and V were considered. A total of 250 training points were obtained. The initial postdecision state value functions were approximated by computing the cost corresponding to running the NMPC controller over a horizon of 120 time steps. For the purpose of function approximation, a bandwidth (σ) of 0.10 (obtained through leave-one-out cross validation on p the initial postdecision state value table, T [0] ) was used. To avoid overextrapolation, A was selected to be 8.7311 and F set µ*,p µ*,p to 0.1036. A termination criterion of |Jˆ[i+1] - Jˆ[i] |∞ e 0.10 was enforced. VI converged within 50 iterations. Results from a typical realization comparing the performance of the proposed ADP and NMPC controllers in bringing the system from an initial state of x ) 8 to the origin are given in Figures 6-8. Note that this choice of initial state is representative. It is noted that similar results are obtained with other initial states. It is noted that, with NMPC, the system is brought to the origin quickly. However, once the state estimates get to zero, there is a loss of observability (see eqs 34 and 35 and note the exponential increase in prediction error as shown in Figure 8b). Consequently, the control action remains, at the zero, resulting in the uncontrolled drift-like behavior observed in the state (see Figure 6b). For the proposed ADP controller, the system is perturbed whenever the actual state drifts a significant distance away (as reflected in the increase in prediction error, shown in Figure 8a) from the reference state. Periodic perturbations can be seen in Figure 7a. Similar observations have been made by the Hovd and Bitmead.13 Closed-loop performance results for 100 stochastic realizations are given in Table 2, where Eˆ refers to an empirical average (over 100 realizations).

1398

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011

Table 2. Interaction between Estimation and Control: Performance over 100 Realizations score

[ ]

ADP

NMPC

120.67

295.48

0



∑ 2x

2

t

t)0

6. Conclusions and Future Research In this paper, it has been argued that the postdecision-state formulation offers the ability to use deterministic math programming solvers both off-line and online and therefore may be more convenient than the more conventional predecision-state formulation. Key examples are used to highlight the importance of treating uncertainty in a systematic fashion and to demonstrate the proposed algorithm. The standard ADP algorithm used (that is, without step 4) throughout this work involves fixing, at the beginning, the p and ceasing to introduce set of sampled states in Xsam/Xsam any more samples as the learning proceeds. One potential problem with this is that Xsam/Xpsam may not contain sufficient samples in all the important regions of the state space. Future p work involves designing intelligent ways to expand Xsam . Another area under investigation is the integration of ADP with model predictive control. Some ways include (i) using the learned cost-to-go function in order to reduce the horizon size and (ii) having dual mode implementation, where MPC replaces the ADP controller whenever one encounters a state that is sufficiently new and the information in the learned value function cannot be trusted. Appendix: Proof-averagers are nonexpansions Consider arbitrary vectors Jˆ1µ*,p,Jˆ2µ*,p ∈ RN, both of which p correspond to the same Xsam . Taking an arbitrary query point, p xq, the following may be written: p ˆ µ*,p p |F(Jˆµ*,p 1 )(xq) - F(J2 )(xq)|

| |∑ N

)

∑ β (x , {x (j)} i

p q

p

N ˆ µ*,p p j)1)(J1 (x (i))

p - Jˆµ*,p 2 (x (i)))

i)1 N

e e

N βi(xpq, {xp(j)}j)1 )|Jˆµ*,p - Jˆµ*,p 1 2 |∞

i)1 |Jˆµ*,p 1

-

Jˆµ*,p 2 |∞

|

| (37)

The first line is true by definition. The second is true since all the weights (βi) are non-negative and the third is true since the weights (excluding the bias term) sum up to less than unity. Since the above is true for arbitrary xqp, one immediately concludes that * * * * |F(Jˆµ1 ,p) - F(Jˆµ2 ,p)| ∞ e |Jˆµ1 ,p - Jˆµ2 ,p | ∞

Literature Cited (1) Morari, M.; Lee, J. H. Model predictive control: Past, present and future. Comput. Chem. Eng. 1999, 23, 667–682. (2) Scokaert, P.; Mayne, D. Min-max feedback model predictive control for constrained linear systems. IEEE Trans. Autom. Control 1998, 43, 1136– 1142. (3) Lee, J. H.; Yu, Z. Worst-case formulations of model predictive control. Automatica 1997, 33, 763–781.

(4) Bemporad, A.; Morari, M. Robust moidel predictive control: A survey. In Robustness in Identification and Control; Lecture Notes in Control and Information Sciences, Vol. 245; Springer Berlin: Heidelberg, Germany, 1999. (5) Laird, C. D.; Biegler, L. T. Large-scale nonlinear programming for multi-scenario optimization. In Modeling, Simulation and Optimization of Complex Processes; Bock, H. G., Kostina, E., Phu, H. X., Rannacher, R., Eds.; Springer: Heidelberg, Germany, 2008; pp 323-336. (6) Pena, D. M. d. l.; Bemporad, A.; Alamo, T. Stochastic Programming Applied to Model PredictiVe Control; 44th IEEE Conference on Decision and Control, and the European Control Conference, Seville, Spain, 2005; IEEE: Piscataway, NJ, 2005; pp 1361-1366. (7) Arkun, Y.; Stephanopoulos, G. Studies in the synthesis of control structures for chemical pro-cesses. Part IV. Design of steady-state optimizing control structures for chemical process units. AIChE J. 1980, 26, 975–991. (8) L.T. Narraway, J. P.; Barton, G. Interaction between process design and process control: economic analysis of process dynamics. J. Process Control 1991, 1, 243–250. (9) Loeblein, C.; Perkins, J. D. Structural design for on-line process optimization: I. dynamic economics of MPC. AIChE J. 1999, 45, 1018. (10) Fel’dbaum, A. Dual control theory, part I. Autom. Remote Control 1960, 21, 874–880. (11) Fel’dbaum, A. Dual control theory part III. Autom. Remote Control 1960, 21, 1453–1464. (12) Lee, J. M.; Lee, J. H. An approximate dynamic programming approach based approach to dual adaptive control. J. Process Control 2009, 19, 859–864. (13) Hovd, M.; Bitmead, R. R. Interaction between Control and State Estimation in Nonlinear MPC. Presented at the 7th International symposium on dynamics & control of process systems, Boston, MA, 2004. (14) Roy, B. V.; Bertsekas, D. P.; Lee, Y.; Tsitsiklis, J. A neuro-dynamic programming approach to retailer inventory management. IEEE Conf. Decis. Control 1997, 4052–4057. (15) Powell, W. B. What You Should Know about Dynamic Programming; Technical Report, Wiley Periodicals, Inc., Naval Research Logistics, 2009. (16) Bellman, R. E. Dynamic Programming; Princeton University Press: Princeton, NJ, 1957. (17) Lee, J. M.; Lee, J. H. Simulation-based learning of cost-to-go for control of nonlinear processes. Korean J. Chem. Eng. 2004, 21, 338– 344. (18) Lee, J. M.; Lee, J. H. Approximate dynamic programming-based approaches for input-ouptut data-driven control of nonlinear processes. Automatica 2005, 41, 1281–1288. (19) Lee, J. H.; Lee, J. M. Approximate dynamic programming based approach to process control and scheduling. Comput. Chem. Eng. 2006, 30, 1603–1618. (20) Lee, J. M.; Lee, J. H. Value function-based approach to the scheduling of multiple controllers. J. Process Control 2008, 18, 533– 542. (21) Lee, J. M.; Kaisare, N. S.; Lee, J. H. Choice of approximator and design of penalty function for an approximate dynamic programming based control approach. J. Process Control 2006, 16, 135–156. (22) Powell, W. B. Approximate Dynamic Programming: SolVing the Curses of Dimensionality; Wiley Series in Probability and Statistics; WileyInterscience: New York, 2007. (23) Ma, J.; Powell, W. A convergent recursive least squares policy iteration algorithm for multidimensional Markov decision process with continuous state and action spaces. Presented at the IEEE Conference on Approximate Dynamic Programming and Reinforcement Learning (part of IEEE Symposium on Computational Intelligence), Nashville, TN, 2009. (24) Tsitsiklis, J.; Roy, B. V. Feature-based methods for large scale dynamic programming. Machine Learning Journal 1996, 22, 59–94. (25) Konidaris, G.; Osentoski, S. Value function approximation in reinforcement learning using the Fourier basis; Technical Report, 2008. (26) Tesauro, G. Practical issues in tempora-difference learning. Mach. Learn. J. 1992, 8, 257–277. (27) Thrun, S.; Schwartz, A. Issues in Using Function Approximation for Reinforcement Learning; Fourth Connectionist Models Summer School: Hillsdale, NJ, 1993. (28) Sabes, P. Approximating Q-Values with Basis Function Representations; Fourth Connectionist Models Summer School: Hillsdale, NJ, 1993. (29) Boyan, J. A.; Moore, A. W. Generalization in Reinforcement Learning: Safely Approximating the Value Function. AdV. Neural Inf. Process. Syst. 7 1995, 369–376.

Ind. Eng. Chem. Res., Vol. 50, No. 3, 2011 (30) Bradtke, S. Reinforcement learning applied to linear quadratic regulation. AdV. Neural Inf. Process. Syst. 1993, 295–302. (31) Ormoneit, D.; Sen, S. Kernel-based reinforcement learning. Mach. Learn. J. 2002, 49, 161–178. (32) Gordon, G. J. Stable Function Approximation in Dynamic Programming; Technical Report, Carnegie Mellon: Pittsburgh, PA, 1995. (33) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statisical Learning: Data Mining, Inference and Prediction, 2nd ed.; Springer-Verlag: New York, 2008.

1399

(34) Smart, W.; Kaelbling, L. Practical reinforcement learning in continuous spaces. 17th Int. Conf. Mach. Learn. 2000, pp 903–910. (35) Batina, I. Ph. D. thesis, Technische Universiteit Eindhoven: The Netherlands, 2004.

ReceiVed for reView April 1, 2010 ReVised manuscript receiVed October 26, 2010 Accepted November 21, 2010 IE1007882