On-Line Optimization of Stochastic Processes A. M. Morshedi Department of Chemical Engineering, fahlavi University, Shiraz, /ran
R. H. Luecke* Department of Chemical Engineering, University of Missouri-Columbia, Columbia, Missouri 6520 7
A search method for optimization of stochastic systems is described in which an optimizer that had been developed for noise-free systems is used and the effect of noise is reduced by use of a filter based on sequential regression analyses fitting past data to a quadratic. The value of the object function at the new search point as predicted by the quadratic is combined with the current measured value to give an unbiased, minimum variance estimate. Use of this estimator yielded a significant improvement in the response of the search, but a wandering, random phenomena was observed as the search approached the neighborhood of the optimum. This behavior indicated that noise reduction was insufficient in that region so a further modification was made to increase filtering as the search progressed. Substantial improvements were obtained and this modified search was far more efficient than other methods on comparable problems.
Introduction On-line optimization of chemical processes has received extensive attention in the literature. Sophisticated mathematical and statistical concepts have been invoked to define “best” solutions to this problem. And yet few workers in the field would pronounce this problem to be solved. The reason for this state of affairs is that chemical processes encompass incredibly complicated physical situations. The number and nature of the constraints and the accuracy and precision of data available to describe a given process may vary widely and a particular mathematical approach may apply in only very limited situations. One basic problem in optimizing process systems is often described as obtaining an accurate estimate of the state of the system. More precisely one might characterize this problem as obtaining an accurate estimate of the system state insofar as it is affected by variations in selected, known inputs. It is impossible, or at least uneconomical, to measure all significant inputs to a typical chemical process. Thus it is inevitable that statistical concepts be invoked to account for effects of unknown or unmonitored deterministic relationships. In a large number of processes, the magnitude of some of the stochastic loads are of similar magnitudes to that of the controlled inputs. A classification of optimization techniques may be made on the degree of knowledge of the process assumed (or needed) to be available. The most ambitious methods use state estimators based on process models. Of those estimation techniques based on unsteady-state models, the Kalman filter can be shown to be “best.” In this context “best” refers to minimum variance estimates. Nonetheless there is wide use of other types of dynamic filters since the form of system constraints is often not strictly compatible with the derivation of the Kalman filter. Furthermore, the Kalman filter is often quite sensitive to inaccuracies in model form or parameters. In many process situations the quantity and quality of available observations are such that little is gained by construction of models-either steady state or unsteady state. In these cases, the search for optimum steady-state conditions may be treated in a manner analogous to conventional optimization of a known mathematical function. The main difference is that the data from the process are corrupted by noise.
The approach to solution of the noisy search probelm can be divided into three categories-stochastic approximation, multifactor experimental design or interpolating methods, and optimal replication techniques. Stochastic approximation search patterns are designed in such a way to have a t least convergence in probability regardless of noise levels (Wilks, 1963). Despite proven convergence as the number of stages approach infinity, the convergence rate of stochastic approximation tends to be discouragingly slow. Many studies have been initiated in order to accelerate the convergence rate. (See, for example, the bibliography in Wilde (1964),Venter (1967), and also Albert and Gardner (1967)). In some other investigations, proof of the convergence has been sacrificed in an effort to speed the response rate (Ahlgren and Stevens, 1966; Kushner, 1962). Despite these efforts convergence rate remains slow in terms of engineering requirements. One basic weakness of stochastic approximation is that in the limiting case where the noise is zero, the method is similar to either steepest ascent or cyclic minimization of one variable at a time. It is well known that convergence of these methods is very slow, particularly with response functions having long, narrow ridges. The presence of noise does not, of course, improve this performance. Saridis (1974) has reviewed stochastic approximation for a wide variety of problems including the steady-state optimization problem discussed here. Accelerated stochastic approximation yielded the best results but this method has been shown to be less efficient than that of Ahlgren and Stevens (1966) and of Luecke (1970) which is used as the standard for comparison in this work. In another approach to noisy optimization, ordinary steepest ascent (descent) is used as a search procedure. The search is continued until the rate of convergence becomes very slow. At this point the response surface is approximated by an interpolating polynomial of second (or higher) order. The extremum of the approximating polynomial is found. The predicted extremum is then used as the center point for a new multifactor experiment to predict a new extremum (Box, 1951; Box and Hunter, 1957). A problem with this sequence is the difficulty in approaching the extremum sufficiently close in order for the interpolating polynomial to approximate the response surface, The interpolation method (along with steepest ascent) also Ind. Eng. Chem., Process Des. Dev., Vol. 16, No. 4 , 1977
473
encounters difficulty with both sharp ridges and asymmetric surfaces (Heaps and Wells, 1965). Little work has been done with the interpolating polynomial regarding constrained optimizations. Search algorithms which have been designed for noise-free functions can also be used as search procedures for optima for the noisy functions if the effect of the noise which is present in the measurement of the function is sufficiently decreased. The advantage of this approach is that efficient optimizers, such as those designed to move smoothly along steep ridges, can be used. Other features of well-known optimization routines, such as bounded optimization, can also be available. In one study (Luecke, 1970), the effect of error was decreased by averaging replications of measurements of the object functions. The performance was equivalent or superior to that of stochastic approximation on all test functions, but especially on the test functions with terms higher than quadratic. Ram Narayan (1971) explored optimization of a simulated complex chemical process using this approach in which the number of replications was doubled whenever the search failed to make progress. This procedure converged somewhat faster than the best results described for modified stochastic approximation on a complicated four-dimensional noisy problem (Ahlgren and Stevens, 1966). In the present work, replication is used but information accumulated in previous search stages is also utilized in the search. An interpolating polynomial is constructed using data available from past search stages. This polynomial is used to predict the functional value a t the new search point and the predicted value is combined with the current measured values by means of two weighting factors which minimize the variance.
Mathematical Development In this section a sequential filter is developed for reducing the noise of a given measurement of variable y which is a function of k independent states. Suppose that a search procedure on a noisy function has proceeded for n - 1steps; then for n - 1values of the state vector x,, the value of the output yj has been measured. A second-degree polynomial may be fitted to these data by means of regression and used to predict the dependent variable a t the new value of an independent variable x n which has been specified by a search procedure. The value of the dependent variable y , can be measured at the new point x n and combined with the predicted value to form an unbiased minimum-variance estimate of the true value of the dependent variable. Since in the neighborhood of an extremum any smooth surface osculates with a quadratic surface, it is assumed that in the region of immediate interest the surface can be approximated by a polynomial of degree two Yn
+ P l X l n + . . . + Pkxkn + P 1 1 X l n 2 + . . . + PkkXkn' + 6 1 2 X l n X Z n + . . . + Pk-l,kxk-l,rnXk,rn
To combine y p n and jjn, two constant weighting factors w1, are chosen such that
w2
j = wlyp
+ wfi
where j is the estimate of the true value. (Note that the subscripts n have been suppressed.) The values of w 1 and up are to be chosen in such a way that j is unbiased and has minimum variance. For j to be unbiased E @ )must be equal to the true value of the dependent variable, Le., E ( Q )= y , the true value of the dependent variable. Taking the expectation from both sides of eq 2 gives
(3) Since 7 is the unbiased measurement of y, E@) = 4. If eq 1 is the true form of the exact function, then y p is an unbiased estimate of y ; that is E ( y p )= y . Thus from eq 3 w1+ w 2
=1
(4)
To minimize the variance of j using eq 2 it must be known whether or not y p and j j are statistically independent. Two derivations of w1 and w2 are presented. . . one when y p and 9 are assumed to be independent and one where these two values are not statistically independent. If the true curve is a quadratic, then E ( 7 ) is exactly equal to E ( y p )and the two values are not independent. However, if there are significant nonlinearities in the true curve, the expectation of the prediction from eq 1 will not be the true value of y . This will reduce the degree of dependence between y p and 7. In that case additional weight should be given to j j as being a better estimate of the local value of y . a. Derivation of w1and w 2 Assuming Statistical Independence of yDand 7.Taking the variance of both sides of eq 2 gives
(5) w 1 and wp are chosen such that uO) in eq 5 is minimum when it is subjected to the constraint eq 4.To do this, eq 4 can be adjoined to the objective eq 5 by using a Lagrange multiplier. The new equation which is then an unconstrained objective function has the form F(wl,w2,X)
= w12u(yp) =
u ; ~ ~ u (-YX) ( W ~+ wp - 1) (6)
To find a stationary point of this function which is an extremum of the constrained objective function, F is differentiated with respect to wl, wp, and X and the results are set equal to zero. The resulting three equations can be solved simultaneously to give
and
= POXOn
(1)
where y , is the response at the nth point, and xon, X l n , . . ., Xknr 21,'. . ., X k n 2 , ~ 1 ~ x 2 ., .,~etc., . are the independent variables. With the variable xoP given the value of unity for all values of n , the estimates Po, P I , etc., of coefficients 80,P I , etc., may be found by linear regression and used with eq 1 to predict the value of the true dependent variable y n a t x. Let y p n denote this predicted value. At this point two sets of information are available: (1)the predicted value y p n of the true dependent variable a t x,; (2) the corrupted measurement of the dependent variables at x, to be denoted as j j n . The objective is to combine these two estimates to form a better estimate for the value of the true dependent variable at X,. 474
Ind. Eng. Chem., Process Des. Dev., Vol. 16, No. 4, 1977
Using these values for w 1 and wp eq 2 can be written as
b. Derivation of w1 and w 2 with JJ and y p Not Independent. In minimizing u@) in eq 2 with respect to w 1 and wp, suppose that y p and 7 are not statistically independent. The variance from both sides of eq 2 becomes
4 9 ) = w 1 2 u ( y p ) + w 2 2 u ( 7 ) + 2 W l W 2 cov(yp,9)
(10)
As before, the constraint eq 4 can be adjoined to the objective (eq 10) by using a Lagrange multiplier. The new equation which is then an unconstrained objective function
expected for the response surface. The value is relatively easy to generate and pertinent input can be restricted to the approximate region of the optimum.
has the form F(WI,WP,X)
= w 1 2 u ( y p ) + w22u(Y)
+ 2w1w2 c0v(yp,j4 - X(w1 + w2 - 1)
(11)
Differentiating eq 14 with respect to w1 and w2 and A, setting the resulting equations to zero and solving yields w1=
u ( Y ) - COV(Yp,Y) u(yp) + u ( Y ) - 2 COV(Yp,Y)
(12)
and
Results Since the complexity of the mathematics precludes analytic comparison between stochastic search methods, a numerical approach was used for this comparison on two well-known test functions (Rosenbrock and Storey, 1966)
+ (I0 - - x 2 ) z y = lO(x2 - x12)2 + (1 - x1)2
y = w2
=
U(Y,) - C O V ( Y p , 7 ) u(yp)
+ u(7) - 2 C O V ( Y p 3
(13)
Using these values for w1 and w2, eq 2 can be written as
c. Derivation of the Equation for Calculation of cov-
(y,,y). To solve for w1 and wp from eq 12 and 13, the quantity cov(y,,Y) must be known. This quantity can be calculated by the following cov(yp,Y) = E ( Y p 7 ) - E ( y p ) E ( Y )
(15)
If eq 1 is a good approximation of the true equation, then E ( y p ) = y ; that is, y p is an unbiased estimate of y. Equation 15 can be rewritten as cov(y,,Y) = E(Yp7) - Y
(16)
d. Computation of E(y,y). From the regression eq 1 (using vector notation) ypn
=
(17)
Xn’B
where x,’ is taken to include the second order products given in eq 1. Multiplying both sides of eq 17 by Fn,the measurement of the dependent variable a t x,, and replacing @ by (X’X)-1X’7 (the solution to eq 17) gives yp& = xn’(X’X)-lX’S/yn
(18)
Taking expectation of eq 18,gives
Since E(t,) = 0
E(Fn2)= E(yn + 6,)’ = E ( y n 2 + 2yntn
+ en2) = yn2 + u ( t n )
(20)
Thus if all of the measurements are independent, the covariance of y p and gn a t x, can be written as
The estimate for the covariance is found by replacing the true values of yI,y2, . . ., by their measured values. The filter is an integral part of the search procedure. Other filters could be postulated but most of the commonly used methods are not applicable because either they require more process information than assumed here or they are designed to give the mean value of a series of measurements. Note that eq 1 is used to predict a value for y at x n using values of xi measured a t other points. The regression estimate given by eq 1has the advantage of being based on the functional form
(XI
- x2)2
2
(22) (23)
Equation 22 is a tilted ellipse with a minimum at x1 = x2 = 5 and the starting point at x 1 = 0, x 2 = -1. Equation 23 is a function having a long narrow curved valley with its spine near the parabola x12 = xp. The initial point is [-1.2, 1.01 so that the search will have to descend into the valley and then follow the valley floor around the curve to the point of the minimum (1,l).Use of these functions allows comparison with the numerical results obtained in earlier investigations (Luecke, 1970). Noise was added to the value of these functions at a specified value of the independent variable in the form of a digitally generated normally distributed random number with a specified mean and standard, deviation. For each test function, two different noise levels have been considered. Equation 22 was corrupted with noise levels of 0.2 and 0.5 which give the percent error of 2.2% and 5.56%,respectively, of the range of the function, where the range is the difference between the value of the function at the start of the search and the value of the function at the optimum. Equation 23, a much more difficult function to optimize, was corrupted with noise levels of 0.02 and 0.05 which give the percent error of 0.136% and 0.34%,respectively, of the range of function. During the search sequence each new point would be expected to be closer to optima, and hence more useful in predicting the value of the next search points. Since the later points should be weighted more than the earlier points, whenever the number of observations in the regression analysis exceeds a selected upper bound, the first half of the number of observations, which are old values, are discarded. Only the more recent points found by the search procedure are used in regression analysis to that the quadratic eq 1 may better describe that limited portion of the surface. The bound on the number of observations used in regression analysis was considered to be a parameter. Three values were selected for the upper bound in performing the experiments; namely 8,20, and 30. The pattern optimization search procedure described by Rosenbrock and Storey (1966) has been used as the search procedure in this work.
Numerical Results a. Function of Eq 22. The results of search experiments on the function given by eq 22 are shown in Figures 1through 5. In each case the value of the objective function is plotted against the number of functional evaluations. In Figure 1the convergence rate of the algorithm described here is compared to that using the replication technique (Luecke, 1970). The algorithm here not only uses the prediction of eq 2 but also allows for doubling of the number of functional replications whenever the search method ceases to make progress. The average of the replications provides the measured value in eq 2. This mean value is also incorporated with the coordinates x, as a single point in the regression equation. At the beginning of the search the progress toward the optimum is good. However, difficulties arise as the optimum is approached. These difficulties can be observed especially with Ind. Eng. Chem., Process Des. Dev., Vol. 16, No. 4, 1977
475
10
1
7
h.
1.0REPLICIIION
TECHNIOVE
20 REGRESSION
R E P L l C l T l O N ALONE IREPLICbll3N
I N C R E A S E D EIGHTFOLD
OESERVIllOHS
01-
I
z 0 z
U
?
001. FILTER
W
?
.PEPLICbIIION
A N 0 OBSERVATIONS AFTER
u
W 7
YO
3
REGRESSION
lUCREIlSED 8 - F O L D
\I(
SEaRCH FaILURE
FILTER - R E P L I C A T I O N INCREASE TWOFOLD OBSEEYAIIIONS IN REGRESSION
m 0
OOUELEO
lFTEE
ELICH S E A R C H FAILURE
0001~ -
FILTER
YO OBSERWTIONS
REPLICATION
0 0001
1
0 0001
1
I
IO
1000
I00
FUNCTIONAL
INCREISE
OF
I
10
EVALUATIONS
100
FUNCTIONAL
and algorithm using prediction by regression. Function: y = + ( ( X I + x:! - 10)/3):,.Noise level = 0.2.
(XI
- xp)*
1-
' I
Figure 1. Comparison of convergence rates of replication technique
IN REGRESSION EIGHTFOLD
1000
EVALUATIONS
Figure 3. Convergence rate of modified regression method. Function:
- x 2 ) 2 + ((xI + x 2 - 10)/3)2.Noise level = 0.2.
=
41.
01
,/ I
,/'
-I
1
0
2
I
3
4
5
XI
Figure 2. Phase plane plot of prediction by regression method. Function: y = ( x l - ~
2
+ ) ~ + x p - 10)/3)2.Noise level = 0.2.
3
4
5
XI
Figure 4. Phase plane plot of modified regression method. Function: y=
(XI
- x p I 2 + ((XI
+ x:, - 10)/3)2.Noise level = 0.2.
((XI
8 regression observations (Figure 1)in which the value of the object function increased after attaining a low level. The cause of this behavior can be better visualized using the phase plane plots, that is, x 1 plotted vs. x2, as shown in Figure 2. While efficient early progress is noted, many random movements in the vicinity of the optima can be observed where the gradient has become very small. The small gradient allows the effect of noise which is present in the value of the object function to become more pronounced. To eliminate this nearly random wandering, it is necessary to reduce the effective noise level in the function a t a higher rate than in the earlier part of the search. Two methods were explored (1)The 476
2
Ind. Eng. Chem., Process Des. Dev., Vol. 16, No. 4, 1977
rate of replication increase was raised so that the number of experiments at each point was increased by a factor of 8 after search failure rather than a factor of 2; and (2) the number of search points used in the regression equation was increased by a factor of 2 after each search failure. Implicit in this latter change is the assumption that search failures occur mainly near the optimum where the quadratic approximation remains valid for larger numbers of search points. A combination of both of these changes was also tested and should be considered a standard part of the algorithm. The result of these three changes, shown in Figure 3, is a substantial improvement over the basic algorithm. The general level of final values was two to eight times smaller. The
IOC
100
10
IO
ez
8 REGRESSiON O B S E e Y I T ONS-
u 3 y
01
W
2
W V
m
0 01
0 001
IO0
IO
I
FUNCTIONAL
+
IO0
REPLICATION IREPLICPTIDN
LLONE
INCREbSED
EIGHT -FOLD AFTER
10
\
IO
z
0 U
?
10 f IJNCTICNAL
EVA L U 4 T l ON S
Figure 5. Convergence rate of algorithm using prediction by regression. Function: y = ( XI ~ 2 ) ( (~X I x2 - 10)/3)2.Noise level = 0.5.
+
I
1000
01
11‘0
loco
EVALUATIONS
Figure 7. Convergence rate of regression algorithm. Function: 19(xi2
- x p ) + (1 -
Noise level = 0.02.
basic algorithm showed significantly better performance than the replication technique (Figure 5 ) ;the modified algorithm was as good but also maintained a high level of consistent performance (Figure ‘6). b. Function of Eq 23. The convergence rates for the basic algorithm on the higher order function of eq 23 are shown in Figure 7. In all cases, important improvements over the replication technique are obtained. Some random wandering is noted as the optimum is approached which can be reduced by the methods previously described. c. Best Isothermal Reactor Yield. In order to illustrate the usefulness of this approach on process industry problems, an example is given of optimizing the yield of an isothermal chemical reactor. In this problem, given by Rosenbrock and Storey (1966),a set of parallel and consecutive chemical reactions occur:
W
? U w J -
(24)
m 0
0 01
The kinetic rate constants ki obey the usual law FILIER-REPLICITION lHCREISE l * O . F D L D O B S E R Y P l l O N S I N REGRESSION DOUBLED A F T E R E b C n S E A R C W FAILURE
/
\.
0 001 IO FUNCTIONAL
I00
1000
EVALUATIONS
Figure 6. Convergence rate of modified regression method. Function: y=
(XI
- X# + ( ( X I
+ x p - 10)/3)2.Noise level = 0.5.
phase plane plot (Figure 4)shows that the wandering in the neighborhood of the optimum was reduced. The same series of experiments was repeated on the same function but a t a higher noise level-0.5 instead of 0.2. The
It is desired to determine the residence time and temperature which will give a maximum yield of product C. Values of the various constants are given by Rosenbrock and Storey (1966). The results shown in Figure 8 show that good search results are obtained even at moderate noise levels. It is seen that using the replication technique alone, convergence is not achieved even at the low noise level. Indeed, it should be noted that this is a very difficult search problem and few search techniques are successful on it even in the absence of noise. It certainly would be unreasonable to expect a simple gradient method, for example, to work with noise present. Ind. Eng. Chem., Process Des. Dev., Vol. 16, No. 4, 1977
477
results shown in Figures 3 and 4 show the highest convergence rates with only minimal "wandering" in the vicinity of the optimum. The complete algorithm should include these features. TRUE
NOISE L E Y E L
0.4
VALUE
:0
OF O P T I M U M
05
z
0
- - - - --- - --
--I
U
Y
z >
REPLiCIITiON
&LONE
UOlSE L E Y E . : O
05
0 U
0.1
T i 873
t: OS
0 .P I
10 NUMBER OF FUNCTION
I00
1000
EVALUATIONS
Figure 8. Convergence rate of modified regression method on isothermal reactor optimization.
Conclusions As expected, higher convergence rates were usually obtained at lower noise levels in cases where the other factors were equal. The few instances where the opposite occurred as the result of the random phenomena were obtained in the immediate neighborhood of the optimum. The best convergence rates were obtained using the larger number of points in the regression analysis. With only a small number of points, performance was often not much better than by replication alone. Again this would be expected since the use of 8 pieces of data in an equation containing 6 constants could not be very effective in filtering noise. However, the larger number of points could encompass a larger area over which the function was poorly approximated by a quadratic surface. Thus it is most efficient to increase the number of regression points only as the optimum is approached. This effect may be seen on the higher order function (23) where the advantage of better filtering with larger numbers of points seems to be counteracted in part by the poorer fit for the quadratic equation. In the region around the optimum where the effect of noise is more serious and where the quadratic approximation is better, the number of observations in the regression may be effectively increased. The most encouraging results were obtained when the rate of increase in replications was raised along with increasing the number of regression points after each search failure. These
478
Acknowledgment The authors appreciate the technical assistance received from the Computing Center and drafting services of Pahlavi University.
Ind. Eng. Chem., Process Des. Dev., Vol. 16, No. 4, 1977
Nomenclature A,B,C,D = chemical species, eq 24 cov = covariance Ci = reaction rate constant, eq 25 d = sum of all the successful moves E = expectedvalue Ei = reaction activation energy F = unconstrained objective function to be minimized G = mean value of the function Ki = reaction rate constant R = gasconstant [ R k ] = set of k-tuples ( X I , . . ., x k ) T = observed temperature u = variance xi, = ith independent variable at the nth experiment x,' = vector of independent variables X = matrix of independent variables (subscript n is implicit) y, = true value of the function at x, yi = response at ith point Y, = measured value of y, at x, = predicted value at x, = filtered value a t x, y = vector of observations w1,wz = weighting factors in eq 2
$zn
Greek Letters pi = coefficients in eq 1 = vector of estimates of the coefficients
Literature Cited Ahigren, T. D., Stevens, W. P.,ind. f n g . Chem., Process Des. Dev., 5, 290
(1966). Box, G. E. P., Wilson, K. B.,J. Roy. Statist. SOC.,Ser. 6,13, l(1951). Box, G.E. P., Hunter, J. S.,Ann. Math. Statist., 28, 195 (1951). Heaps, H. S.,Wells. R . V., Can. J. Chem. Eng., 319 (1965). Kushner, H.J., Trans. ASME, J. Bas. fng., 60, 157 (1963). Luecke. R. H.,Ind. Eng. Chem., Fundam., 7,523(1967). Luecke, R. H., int. J. Control, 12,21 (1970). Ram Narayan, M. G., M. S.Thesis, University of Missouri, 1970. Rosenbrock, H. H., Storey, C.. "Computational Techniques for Chemical Engineers", Pergamon Press, London, 1966. Saridis, G. N., /E€€ Trans. Auto. Control, AC-19, 798 (1974). Stewart, G.W., J. Assoc. Cornput. Mach., 14,72 (1967). Venter, J. H.,Ann. Math. Statist., 38, 181 (1967). Wilde, D. J., "Optimum Seeking Methods", Prentice-Hall, Englewood Cliffs, N.J.,
1964. Wilks, S. S., "Mathematical Statistics", Wiley, New York, N.Y., 1963.
Received for review November 24, 1975 Accepted June 23,1977