Using the Method of Steepest Descent HANS L. STEINMETZ Six types of problems where this technique plus analog or hybrid computers provide a powerful method of attack 'he method of steepest descent can be traced back to a paper published in 1847 by the French mathematician Cauchy (2). More recently the method has h o m e popular as a tool for solving a variety of optimization problems. The method permits transposition of a problem of time-independent nature into a timedependent domain. For this reason it is advantageous to use an analog computer for solution. Systems which require solution of ordinary differential equations (or whose independent variable can be made to increase linearly with time) while one searches for the best value of a s p d e d set of parameters (as in some regression and dynamic optimization problems) may be solved by combining the method of steepest descent with conventional analog computing techniques. This is possible, as shown in later examples, without creating a discrepancy between the two time domains needed: one for solving the timedependent equations of the physical system and one for solving the differential equations resulting from the application of the method of steepest descent. The Method of sth.pe8t Descent
Suppose that n real valued functions fi, . . .,f% of m real variables x i , . . ., x, are given. These functions fi, , . .,/,,canbewrittenas fl
= fi(X1,
f2
=WXl,
. . ., xm) . . .,
fn
= f&l,
,
. , I x,)
or, in a somewhat abbreviated form, f t = jt(xi, . . ., xm); i = 1, . . ., n Let the functions f, be defined and have continuous first-order partial derivatives within a closed region of a
system with coordinates x i , the solution of ft(x1,
. . . ,x.,
. . ., x),
Then, for a fixed i,
= 0
(1)
yields only certain sets of values of the x i , . . ., x, for which Equation 1 holds. For this reason, any arbitrarfiy chosen particular set of values for the XI, . . ., x,, would, in general, render a value of f t which is different from zero. This fact can be used to define an error e< = f h l ,
. . .,x,)
(2)
which is zero only for those sets of values of the xi, . . .,x, for which Equation 1 holds. Obviously, et = &l,
. . , xm)
The method of steepest descent solves the problem of &ding a particular set of values for x i , . . , x,, such thar
.
S = d +
. . . +%2=Ef.: .-I
i
(3)
has ~a local minimum. The quantity S is termed the sum' of the squared errors. Considering each x, to be a function of time, one now con&ructs a trajectory with time as parameter such that the 'corresponding value of S along this trajectory decreases asymptotically to a local minimum as time increases (see Figure 1). Note that S, because of its def&tion as the s u m of the squared errors, cannot become negative. qbviously the goal is reached if one can make ! aspt < 0 (4) A V l h O R Hans L. Steinmetz is Research Engincer at the Richond Labmatory of Chemon Research Co., Richmond,
Cd$. VOL 58
NO. 1
JANUARY 1966
33
for all values oft except at the minimum where it is zero. The total time derivative of S can be obtained from Equation 3 as dS -=2cc,-&
dt
dei
;=I
The total time derivative of the error si is
L c * -af,- dx, dt
j-ldx,
dt
Inserting Equation 6 into Equation 5 one obtains dS
dt
=
2 c e , c ;-l
,-1
ki dx,
--
ax, dt
which can be rewritten as
(If it is desired to maximize S, a plus sign should be used at the right of Equations 9 and 10.) Note that the method of steepest descent as outlined here solves a static problem, namely that of finding the point in the mdimensional coordinate system for which the s u m of the squared errors is minimum. By virtue of its transformation into the time domain, a certain amount of time for computer solution is required. However, application of the method of steepest descent is possible also to dynamic systems (for which the variables of the system are functions of time 7 ) . Here, solution requires both time t for finding Smi.and time I for integrating the system variables. To eliminate this apparent discrepancy in the variable time one must solve the complete problem in two different time domains. How this can be done on an analog computer is shown in later examples. Applications of the Method of Steepest Descent
after interchanging the two summations. Generating the time device of X I , . . .,x, such that
d e s it possible to use the method of steepest descent. Substituting Equation 9 into Equation 8 gives dS 2
Some useful applications of the method of steepest descent follow.
1
Finding a Point in an m-Dimensional Coordinate System fm Which the Sum of the Squared Errors Is Minimum
I
dt
j-1
Equation 10 states that, regardless of the sign of the time derivatives of the X I , . . . , x, the total time derivative of S is always smaller than or equal to zero. This means that the sum of the squared errors as a function of time will continue to decrease until a local minimum is reached. A stationary point other than the desired minimum will most likely be overcome by machine noise. Thus S+S~.ast-*
m
(11)
An example of this type of problem is depicted in Figure 2 where it is desired to find a point in the xlxzplane for which the sum of the squared errors is minimum. Let, for example,
- (xz - 1)Z - 1 + (xz - 0.5)' - 0.5
fl = x1
fz =
XI
(12) (13)
Using the definition of error according to Equation 2, one obtains from Equation 9 the time derivatives of x1 and x z as
7 I
TIME 4
Figwc 7. 34
Tim characteristic of nn of squared mors
INDUSTRIAL AND ENGINEERING CHEMISTRY
$ 4
Figurt 2. Trajcckwy of (XI, X S ) from an a s m d initial starting point PI to point P, at which S =. , S
dxi - = dt dxa - = -4.~13 dt
- xa
-2x1
+ 9x2
-
+ 2.25
8 . 5 ~-~ xi
+ 3.75
(15)
Equations 14 and 15 are programmed on the analog computer. The initial conditions for x i and xz correspond to the starting point Pi in Figure 2. As time advances, the values of XI and xa change in accordance with the dotted trajectory. At point Pa,the sum of the squared errors S is minimum.
2
Finding thc Local Extremum of a Multivariable Function
An example is the extremization of the Hamiltonian function in connection with Pontryagin’s maximum principle (4). The Hamiltonian function H, as used in Pontryagin’s maximum principle, is a function of the state variables x,, . , ,, x., the momentum variables pi, . . . ,f., and the control variables ui, . . . , uI. These three sets of variables are functions of time also. The principle requires that, for fixed values of xi, . . ., xm and pi, , , ., p., the control variables ui, . . ., u9 are chosen such that H(x1, . . ., x,; pi, . . ., pm; ui, . . ., u9) is extremized at each instant of time. Applying the method of steepest descent requires that one generates
The plus sign in Equation 16 is used for maximization; the minus sign is used for minimization of the Hamiltonian function. (It is assumed here that a relative extremum exists and that a solution of Equation 16 does not specify a stationary point, such as a point of inflection.) The time required to generate the righthand member of Equation 16 must be shorter than the
I
I
-
I
“I
Figura 3. Marimirorion of Hmniltonirm funcfion using fhd method of SUcQaSr &scent
Lr;rlLuL=u to solve the time-dependent equations describing the physical system. Figure 3 illustrates graphically the method for finding the maximum of H when Equation 16 is used.
3
Linear and Nonlinear Regression of Dismete rn Continuous Functions
This type of problem requires that an experimentally obtained set of values describing the response be given. The regression problem finds the value for a set of parameters contained in a specified combination of elementary functions which is expected to describe satisfactorily the characteristic of the system. The method of steepest descent can be used to obtain the value of these parameters such that the sum (in the discrete case) or the time integral (in the continuous case) of the squared errors is a minimum. Let the experimentally obtained (observed) dependent variables be designated byyi(r), . . . ,ya(r), where T is machine time which corresponds to the independent variable of the system. Then, the error defined in Equation 2 corresponds to €6
= f,[yi(r), . . . ,~ ~ ( 7 xi, ) ; . . .,1.x
where x , . . .,x, are the unknown regression parameters. The same definition of error can be used when the system is described by a set of ordinary differential equations. One then has, in the case of fust-order differential equations,
where they,(r) and dy,/dr are known from experiment. (Note that for this application the term “dynamic optimization” is often used.) This definition of error for systems described by differential equations is of great importance. Some authors (7) define the error in a statistical sense as the difference between a calculated and the observed response yiarlcd - yfobSd. The mathematical difficulty arising from such a definition is that partial differentiation of the error with respect to the regression parameters (to form Equation 9) requires calculation of @, ,,od/ax,, the latter term being called “influence coefficient.” But Calculation of the influence coefficients requires that the regression parameters xi, . . . , x, change only very slowly during one operation cycle; and the analytical form of yf must be known, which is rarely the case for systems described by differential equations. If the dy,/dr are not known from experiment, solution by steepest descent procedures on a digital computer is suggested. byI can then be approximated by VOL 5 8
NO. 1
JANUARY 1966
35
difference equations while y c &d can be obtained from numerical integration. I n contrast, if the d y j / d r are known from experiment, the error definition as suggested here permits calculation of the partials be,/bxjfrom knowledge of
Using Equation 9, one can now write the equations for the time derivatives of the unknown parameters ( X I , x z , x3, x 4 )
dt
:
dr
+ dyz - 2 ylXle-X2/RT dr
+
which, as mentioned earlier, are composed of elementary functions and can be differentiated without difficulty. As an illustration of this method, consider the following chemical reaction model ki
A d B & C The corresponding equations describing a well stirred batch reaction process are
* dr
= -ylkl
where
kl
=
xle-x21RT
(19)
kz =
x3e-24/RT
(20)
Here, y l and y 2 represent concentrations of A and B, respectively, while the x's are unknown parameters (representing frequency factors and activation energies) of the rate constants defined in Equations 1 9 and 20. The reaction time is represented by T , Proceeding in the same way as before, one can write the two functions f 1 and fz as
fz(x1, x2,
x3,
x4)
dY2 - ylxle-x2iRT+ dr
= -
y z x 3 e - ~ 4 / ~ ~
The errors, as defined previously, become
€2
=
dyz - y l x l e - W R T dr
+
y 2 ~ 3 e - ~ 4 I ~ ~
The partial derivatives of the errors with respect to the unknown parameters ( X I , x 2 , x 3 , x q ) are:
36
I N D U S T R I A L A N D E N G I N E E R I N G CHEMISTRY
This concludes the mathematical formulation of the problem. Equations 21, 22, 23, and 24 are programmed on the analog computer. Figure 4 shows an unscaled analog computer diagram for solution. This program constitutes a nonlinear regression program. The unknown regression parameters (XI, x z , x 3 , x 4 ) are contained in Equations 13 and 20 nonlinearly. The observed functions to be approximated by the kinetic model areyl, y ? , dyl/dr, and dyn/dr. Also known must be the temperature T as a function of T . (Although it is assumed here that y l , y z , and T are continuous functions of r, it is also possible to solve the problem if these variables are known only at discrete values of T . ) Reference to the computer diagram of Figure 4 shows that these observed functions are programmed into the computer by means of arbitrary function generators. The two time domains mentioned earlier are the t and r domains. The integrator whose output is r operates repetitively, integrating from zero to rmaxduring one repetitive operation cycle. The integrators having the unknown regression parameters ( X I , X Z , x3, x q ) at their respective outputs integrate continuously-that is, not repetitively. As soon as the best "fit" is obtained, their respective outputs remain (almost) constant during one repetitive operation cycle. The better the fit-that is, the model and/or the test data-the less these outputs change. If desired, a calculated response can be generated by the computer which, if compared with the observed response, would enable a visual display of the quality of the fit. If the fit is unsatisfactory, the kinetic model may be changed. The program for the analog computer of Figure 4 must then be modified accordingly before a new run can be made. Before leaving the subject, one last remark may be in place. In certain cases it is possible to simplify the analog computer program considerably. For instance, if Equations 21, 22, 23, and 24 were all multiplied on the
right-hand side with a common positive quantity, this quantity could be omitted, thus saving a number of multipliers. If the omitted quantity is always negative, the sign of the right-hand sides must be changed.
d-
Solution of TimeIndebendent Sets of Equations, Li Nonknear in Their vhriabfes
It is interesting to note that application of the method of steepest descent to sets of time-independent equations bypasses all of the problems usually arising with instability of such systems (3). Using the conventional approach of solving sets of time-independent equations by adding accumulation terms to the equations may lead to instability in linear systems which can be overcome by adding stabilization circuits into those loops having a loop gain larger than unity. If the system contains nonlinear equations, this method cannot always be used because loops are not necessarily isolated fmm each other so that stabilization of one loop may cause instability of another loop. Applying the method of steepest descent has the advantage of not only bypassing instability problems for just one equilibrium point, but also assuring that stability is obtained for all equilibrium points for which the functions fl are defined. The latter is not true in general when the conventional approach is used for solution of systems containing nonlinear equations. Using the notation of matrix algebra, one can prove that the equations arising from the method of steepest descent are stable in the following way. (It is suggested that the reader unfamiliar with matrix algebra pass over this section.)
The equations to be solved are d xj
=
dt
-
e, +1
be, ax,
(9)
for i = 1, . . ., n a n d j = 1, . . ., rn. As the number of equations must now correspond to the number of variables, one has rn = n. Let
-3 dt = g j ( x )
(25)
where x is a vector whose components are the state D e h e a new vector variables XI, . . . ,x.,
whose components Zj are so chosen thatfl(i) = 0 for all i’s. (Note that 2 is unique for nonsingular linear systems.) Expanding gj(x) about f in a Taylor series and then linearizing gives gj(x) = g5(f)
-1
+ c axBi k-1
(xk
- xk)
(26)
By definition of % S5W = 0 Therefore, Equation 26 reduces to
Figwe 4. Unscded amlog cmputcr diagrm far solution of EprurriOnr 21-24 V O L 5 8 NO. 1 J A N U A R Y 1 9 6 6
37
1
the (jk)" element of A can be expressed as
which, when differentiated, @ves
I
-c
an
i-1
Equation 31 states that
Intemhanging differentiation and summation yields
A =P B
I
Differentiating Equation 28 according to the chain rule results in
Evaluating ag,J&xk at x = I,one has si = 0 and
(29)
which implies that A is positivedefinite if B is nonsingular. By well known theory a nonlinear system is stable about an equilibrium point if the coefficient matrix of the system, when linearized about this equilibrium point, is positive-delinite. Since no restriction was made in the choice of the equilibrium point, the system must be stable for all equilibrium points. To illustrate the above, consider the following simple example. Let the system be described by x1
+
XX
=2
(32)
0.25 (33) The problem is to find the two equilibrium points El and Ea in the xlxrplane of Figure 5, both eatisfyiog Equations 32 and 33 simultaneously. If one uses the conventional method for solving Equations 32 and 33 by adding an accumulation term according to XiXz =
- -at
ax1
= x1
+ xx - 2
(34) (35)
.
0.134
x,'."?"
Combining Equations 25,27, and 29 gives
af, af E* [ ,-,axt Fax,-. Substitutingzr = xr -
- at
=
i-l
(x:
- %))I
(30)
X,
dt
Equation 30 for j' = 1, matrix notation as
dXl
dt
- - =
.. ., n can them be written in
dxr at
where
38
dt
--=
- az -at = AZ
By s i m p l i g notation according to
a stable equilibrium point El and an unstable equilibrium point EZwould be obtained. Analyzing Equations 34 and 35, one finds that all equilibrium points with coordinate x1 5 0.5 are unstable. For equilibrium point EX,the value of XI is 0.186; therefore it must be unstable. (Smce the system is symmetric in x1 and XI, El is unsbble and E%stable when Equation 32 is solved for x1 and Equation 33 for x1 instead.) In particular, the system described by Equations 34 and 35 is unstable if the starting point for (a,x ~ falls ) into the shaded area of Figure 5. In contrast,when using the method of steepest descent one obtains stable equilibria for both E1 and E%. The equations to be solved are
I
I N D U S T R I A L A N D E N G I N E E R I N G CHEMISTRY
(xi (XI
+ - 2) + (xixz - 0 . 2 5 ) ~ ~ (36) + - 2) -k - 0 . 2 5 ) ~ ~ (37) xx
xx
(xIX~
Equations 36 and 37 yield equilibrium point E1 for all starting values XI > xa and equilibrium point Et for all starting values x ) > XI. The fact that both equilibria can be obtained shows the advantage when the method of steepest descent is used for solving this system. However,programming Equations 36 and 37 requires three multipliers over only one
multiplier when programming Equations 34 and 35. Thus, the method of steepest descent has the disadvantage of using more nonlinear computing equipment than the conventional method. The advent of hybrid computer techniques, however, makes this disadvantage in certain instances of lesser importance.
The second step is to generate (again in accordance with the method of steepest descent)
for i = 1, . . ., m. C, is a positive constant much smaller than CI. It is now advantageous to define the error E in Equation 40 as E
= fobsdbo)
-
fdcdb(X1,
..
.I
YO]
Its partial derivative with respect to xi is
5
Solution of Sets of Equations, Lincar or Noditear in Their Variables, with Time-Varying Coe-cients
The method of steepest descent for this application is powerful if the equilibrium point moves over a large region in the XI, . . ., x, coordinate system as a function of time because of the time variation of the coefficients. In such a case one often experiences, when using the so-called conventional method for solving these equations, stability within one region and instability within another region for which a solution of the system is sought. An example is given in the section below where, in addition to solving sets of equations with time-varying coefficients, an imbedded steepest descent method is used to perform a regression on certain parameters of the system.
6
As in Section 5 but with Imbedded Method of Steepest Descent to Pmfwm a Regression as in Section 3
An example is the determination of chemical equilibrium constants for a certain reaction mix. Such a problem can sometimes be put into the following form. Given is an experimentally obtained (observed) variable jobsd(Y0) and a calculated variable jdd (Y,YO, XI, . . . ,4. Here, y is the concentration of a certain component at chemical equilibrium with initial concentration yo, and XI, , . ., x, are the unknown equilibrium constants. From the kinetic model we obtain the polynomial expression m
P
y'(x1,
=
. . ., xm)aj(xl,
j-0
..
xm;
YO)= 0
(38)
The first step is to obtain y from Equation 38 for a given yo and assumed values for XI, , , .,x., This can be done using the method of steepest descent operating on (since the zeros of P are m i n i of P). Then
dY
- -dr= %
a€=
P
(39)
where Cl is a large positive constant and r is machine time.
The partial derivatives @/ax, in Equation 41 can be calculated from Equation 38. It should be noted that the coefficients a, are timevarying if one replaces yo with machine time T . The computer operates in the repetitive mode solving the system equations over and over again. During the repetitive cycle the values of XI, . . ., x, change only slowly (by making Czin Equation 40 small; Equation 38 is solved instantaneously because of a large CI in Equation 39 to obtain the correct value ofy). Thus, the method of steepest descent is applied twice in the same problem: to solve for y using Equation 39 and to find the best-fitting regression parameters using Equation 40. It may further be remarked that, if one is solving the problem in the manner outlined here, a hybrid computer (analog and sequential digital) can be used to great advantage. The reason for this is that @/ax, in Equation 41 is a function of various cross products between the XI, . . . , x., To generate these cross products on the analog computer would soon c a w capacity problems. If using a hybrid computer, one would hold xl, . . . , x, constant during each compute cycle and then upgrade XI, . . . , x, and recalculate these cross products on the digital machine between each compute cycle of the analog computer. Conclusions
The given examples show the power of the method of steepest descent when applied to analog computer solution for the discussed types of problems. Some of these problems either exceed the computational capabilities of digital computers or, because of the sequential nature of digital computers, would make their application unfeasible. The examples used in this paper do not exhaust all of the possible applications of the method of steepest descent. They serve merely as an outline of some programming techniques and as a stimulus for attacking those problems which belong in the realm of analog computation. REFERENCES (I) &Ley, G.A,,Simdattnn 2 (2). 19 (1964). (2) Cauchy, A.,COW.M., Ad, Sti, Pa& 25,536 (1847). (3) Row% A: E., C-lly, T. W., "Analog Computation in eqirering D=ip," McCraw-Hill. New Yak 1960.
VOL 5 8
NO. 1
JANUARY 1 9 6 6
39