Ind. Eng. Chem. Res. 1995,34, 44204435
4420
WaveARX Neural Network Development for System Identification Using a Systematic Design Synthesis Junghui Chent and Duane D. Bruns*v$ Department of Chemical Engineering, University of Tennessee, Knoxville, Knoxville, Tennessee 37996-2200
The WaveARX network, a new neural network architecture, is introduced. Its development was motivated by the opportunity to capitalize on recent research results that allow some shortcomings of the traditional artificial neural net (ANN)to be addressed. ANN has been shown to be a valuable tool for system identification but suffers from slow convergence and long training time due to the globalized activation function. The structure of ANN is derived from trial and error procedures, and the trained network parameters often are strongly dependent on the random selection of the initial values. There are not even guidelines on the number of neurons needed. Also, few identification techniques are available for distinguishing linear from nonlinear contributions to a system's behavior. The WaveARX integrates the multiresolution analysis concepts of the wavelet transform and the traditional AutoRegressive external input model (ARX) into a three-layer feedforward network. Additional network design problems are solved as the WaveARX formalisms provide a systematic design synthesis for the network architecture, training procedure, and excellent initial values of the network parameters. The new structure also isolates and quantifies the linear and nonlinear components of the training data sets. The wavelet function is extended to multidimensional input space using the concept of a norm. The capabilities of the network are demonstrated through several examples in comparison with some widely used linear and nonlinear identification techniques. Separately, the wavelet network of the WaveARX model is shown for the example investigated to have a better performance than two other existing wavelet-based neural networks.
1. Introduction Extensive research in the field of chemical process control has developed a variety of algorithms for system modeling and identification. In recent years, the artificial neural network (ANN)has received considerable attention and has been applied to system identification and controller design. This work shows that ANN can capture the characteristics of system patterns and perform function approximations for nonlinear systems. It has been proven that a three-layered ANN is able to approximate an arbitrary function which is absolutely integrable and has bounded variations (Irie and Miyake, 1988). Many applications done by researchers have also confirmed this salient feature (Narendra and Parthsarathy, 1990; Bhat and McAvoy, 1990; Chen and Weigand, 1992; Psichogios and Ungar, 1992; Su and McAvoy, 1993; Eikens and Karim, 1994; Barton and Himmelblau, 1994). However, there are no design procedures or even guidelines on how to build the structure of ANN. The missing structural information includes aspects such as the number of neurons, the connections between the neurons, and the initial values of the network parameters. Ambiguous structure determination usually brings, at least, two risks. First, when there are not enough neurons in a network, it is unlikely to form an accurate model for the predicted system. Second, a network with more neurons than needed may overfit the data and have poor generalization to unseen data (Bhat and McAvoy, 1992). Also, a global activation function which is often used as in the backpropagation neural network (BPN) (Rumelhart and McClelland, 1986) approach generally results in long-time training (Moody and Darken, 1989; Bakshi and Stephanopoulos, 1993). These drawbacks motivate us to look for ways to improve ANN design procedures.
' e-mail:
* e-mail:
[email protected].
[email protected].
So far, many modeling techniques have been undertaken to capture the characteristics of linear or nonlinear systems, but few of them are capable of dealing with systems that have a combination of both. A nonlinear system seldom fits very well into any linear model. On the other hand and perhaps less obvious, nonlinear modeling techniques are generally poor at quantifying an exact linear relationship. In practice, one is likely to encounter a system without having prior knowledge of its linearity orland nonlinearity. Even having considerable information about a system may not tell if the input and output relationships in the operation region of interest are linear, nonlinear, or a combination of both. This will be illustrated by the examples studied in section 4. Therefore, a generic mathematical model covering both classifications would be valuable for system identification when the above problem occurs. A new algorithm called the WaveARX network is proposed and rigorously developed in this paper. We apply wavelet transforms, the ANN concepts, and the Auto Regressive external input model (ARX)to the construction of a feedforward neural network for system identification. The wavelet transform and ANN combination has been reported to be superior to the conventional ANN. For example, some research suggests that using a neural network with its neurons based on a wavelet expansion should provide a better approximation for nonlinear systems (Pati and Krishnaprasad, 1990; Pati and Krishnaprasad, 1993; Bakshi and Stephanopoulos, 1993; Bakshi et al., 1994), although these wavelet networks employ a fixed shape (fixed-scaleand translation parameters which are discussed in section 2) of the wavelet function. Zhang and Benveniste (1992) enhance the network using an adaptive wavelet function (adjust the scale and translation parameters). Szu et al. apply an adaptive wavelet to speech signals (Szu et al., 1992). One common advantage they get from wavelet transforms is that it provides a local function
0888-588519512634-4420$09.00/0 0 1995 American Chemical Society
Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995 4421 for approximation in the joint time-frequency space. In this research, the localization property of wavelet transforms is also used t o make the WaveARX network a relatively fast algorithm in both training and applications. Unlike BPN, whose global function results in structural complexity, time-consuming learning, and a great possibility of overfitting data, the WaveARX network has a compact form, reducing training time, improving computational efficiency, and enhancing model accuracy. In addition, using the concept of wavelet transform multiresolution analysis (MRA) (Mallat, 1989) allows a systematic way of determining the number of neurons needed in the network. Finally, most of the wavelet net researchers do not consider the system with partially linear and partially nonlinear characteristics. The present research addresses this problem by incorporating ARX into the wavelet network structure so that we can easily tell the linear from the nonlinear contributions in the input and output relationships of a system. The objectives of the WaveARX network formalisms when applied to a given process are (1)to provide a systematic design procedure t o synthesize the neural network including the ARX input model, (2) to isolate and quantify the linear andor nonlinear characteristics, (3) to map input-output data with the desired accuracy, and (4) to reduce computation time in both the training and application phases. The ability of this new neural network formulation and synthesis algorithm to attain these objectives is not only developed in theory but also demonstrated by using examples to compare with several other popular linear and nonlinear modeling techniques. This paper is organized as follows: Section 2 elaborates on the wavelet transform background theory that forms part of the foundation for the WaveARX network. Section 3 details a systematic design procedure and algorithm for the architecture, construction, and training of the WaveARX net. In section 4, the new network and algorithm are compared with other system modeling methods by implementation on mathematical equations and chemical engineering process models. The examples include static and dynamic systems that contain linear, nonlinear, or a mixture of both these characteristics. Finally, conclusions are made and future studies are considered in the last section.
2. Wavelet Transforms In process control an abundance of linear modeling techniques has been established based on solid theoretical foundations, but the development of nonlinear counterparts has not yet entered its mature stage. This inspires us to look for novel algorithms to solve, at least, part of the nonlinear puzzle. Recently, ANN has been one of the hottest topics in the nonlinear system modeling and identification research areas. The main reason lies in that ANN is a powerful method for nonlinear function approximation. From learning the relationship between given input and output data, ANN can estimate the complex parameters in its model structure which can only be awkwardly computed using the traditional methods, such as the Volterra-Wiener function expansion (Soloway and Bialasiewicz, 1992; Johansson, 1993) and the NARX model (Pottmann et al., 1993; Pro11 and Karim, 1994). However, the structural determination of ANN looks arbitrary since it is a black box approach modeling the causal relationships between inputs and outputs. Another failing is that
ANN suffers from a slow training process due to the global nature of the activation function on its neurons. Therefore, we reformulate ANN to overcome these problems. This reformulation is inferred from the wavelet transform theory. It shows promise as an excellent tool for system approximation with the attribute of fast training and its structure being determined in a systematic way. When wavelet transform theory is applied to ANN, it should be considered in terms of the wavelet function, discrete wavelet transforms, frames, and multiresolution analysis. 2.1. Wavelet Function. Wavelets, as its name suggests, are little waves. They are functions satisfylng the following three requirements (Young, 1992). First, they must have oscillations that look like waves above and below the x-axis. Second, their amplitudes decay quickly to zero in both the positive and negative directions. Last, the integral of the wavelet should be zero. More rigorously, a wavelet, $J(x), in L2(R)is an admissible wavelet function if it satisfies the admissibility condition,
where $ ( w ) is the Fourier transform of q. The original wavelet function that is used to generate a family of wavelets by way of scale and translation operations is called the mother wavelet. Let the wavelet function, ~ ( x ) be , a mother function to explain the meaning of scale and translation. Scale: For any arbitrary v 2 0 and for any real number s t 0, the width scaling of ~ ( xby ) s is defined as
Dividing the argument of v by s expands or compresses the wavelet shape horizontally. Similarly, the factor \SI-’ has an effect on the vertical magnitude or wavelet amplitude. That is, it expands or compresses vertically the shape of the original function. Translation: For any arbitrary t > 0, the translation of the mother wavelet by t is defined as
Subtracting t from the argument of q translates or time shifts the original function which retains its same shape. If the mother wavelet is both scaled and translated, a daughter wavelet, Vs,t(x)is formed (4)
Assume v belongs to L2(R)and that the energy of the scaled wavelet is to be the same as that of the mother wavelet (i.e., l I ~ , , ~ = ll~ then it is necessary that Y = V 2 . More explicitly, the wavelet family with each member having equal energy is defined by
The most useful property of wavelet transforms is probably the time-frequency localization. Take the Mexican hat function, q ( x ) = (2/&) n-ll4(l - x2)e-r2’2,
4422 Ind. Eng. Chem. Res., Vol. 34,No. 12, 1995
the discrete scale and translation step sizes, respectively. In the notation V k l , the first index K is referred t o the discrete scale steps while the second one 1 indicates the discrete translation steps. For efficient computation and convenience in discussion, the binary partitions, a = 2 and b = 1,are selected (Mallat, 1989). That is,
0.5 0
-0.5
-,
-8
-6
-2
-4
0
2
4
6
8
(b)
-,
-8
-6
0
-2
-4
2
6
4
8
(c)
li
1
0.5
0
-"."
-5
-4
-3
-2
-1
0
- 1)
qkl(x) = @q(2-kx
1
2
3
4
5
Figure 1. (a) Mexican hat wavelet with the change of scales in the time domain. (b) Mexican hat wavelet with the change of translations in the time domain. (c) Mexican hat wavelet with the change of scales in the frequency domain.
for example. In Figure la,b, all of the wavelets, no matter how they are scaled or translated, have the same number of oscillations. From Figure l a , each of wavelets is considered a window; both the width and the height of the window would change with each different scale s. That is, as s changes, the wavelet function covers a different size interval which is nonzero in both the horizontal and vertical directions. Within a fxed interval, the smaller the scale is, the more cycles of the scaled wavelet would be included. Since the cycles in the vertical interval indicate frequencies, we may construct a smaller window (smaller s) for higher frequencies or a bigger window (bigger s) for lower frequencies. Figure l b shows the translation simply shifts the wavelet function. In frequency domain (Figure IC),the scale shifts the frequency concentration of the wavelets. For large s, the frequency concentration is shifted to the lower frequencies. Contrarily, as s decreases, the frequency concentration is shifted toward the higher ones. The operations of scale and translation allow the wavelet function t o be narrowed, widened, or shifted. By the operations of scale and translation, wavelet transforms can be accomplished. In this paper, only discrete wavelet transforms are used; therefore, for additional details of continuous wavelet transforms, the reader is referred t o the literature (Daubechies, 1992; Chui, 1992; Benedetto and Frazier, 1993). 2.2. Discrete Wavelet Transforms. A discrete form of the wavelet function must be considered for digital implementation on a computer. Discrete wavelet transforms have a limited choice of scales, S k , and translations, t k l , collected in discrete lattices. This discrete wavelet family can be shown to be
= a-k/2q(a-kx- lb)
(6)
where s k = a k ,t k i = Zbak,and k , 1 E 2. Here a and b are
(7)
Although an f ( x ) in L2(R)can be perfectly inverted from its continuous wavelet transform based on an admissible mother wavelet, the reconstruction may be nonconvergent for the discrete wavelet transform. It depends on the choice of the mother wavelet v and the density of the discrete lattices which is determined by the discrete scale and translation step sizes. For the purpose of the reconstruction from the discretely scaled and translated mother wavelet, a generalized frame is introduced. 2.3. Frames. A frame is made up of a set of vectors. It allows any nonzero function to have a nonzero projection onto at least one of its vectors. The sum of these projections creates a finite function. Frames were first introduced by Duffin and Schaeffer (Duffn and Schaffer, 1952) and later addressed by Daubechies (Daubechies, 1992). They are defined as follows. Definition:Given a family of wavelet functions, I&, k,1 E 2,in a Hilbert space H, the wavelet family is called a frame if there exist constants, 0 < A I B < 03, such that for all f l x ) in H, m-
where A and B are the lower and upper frame bounds, respectively. Each ( q k l , f ) represents the transform of f l x ) with respect to the wavelet function, I / & . Note that, for the special case A = B = 1, the frame is called a tight exact frame which constitutes an orthonormal basis. This is similar to the Parseval identity in Fourier analysis. Thus, f can be reconstructed from the transformation off in the time-scale domain. When A < B, the reconstruction o f f is still possible iff is within the frame bounds. With an admissible wavelet function and a frame, any Ax) in H can be written in terms of the frame element (Daubechies, 1992)
-
where S is the frame operator H H. In practice, it is impossible to count infinite frame terms in a wavelet decomposition. Therefore, an approximation to f i x ) can be obtained: N
(10) where W k l = @ S - ' + k l ) and N is the total number of wavelet functions selected. However, arbitrary truncations may lead to large errors. The terms which should be kept in a wavelet decomposition are discussed in section 3. It may appear that the Fourier transform is similar to the wavelet transform, because both of them are
Ind. Eng. Chem. Res., Vol. 34,No. 12,1995 4423
Figure 2. Scale-translation plane (joint time-frequency plot) of the wavelet transform.
formed from compositions of waves with various frequencies. However, their localization properties make them very different. By way of the scale and translation operations, the wavelet functions provide a timefrequency window to capture and isolate the local characteristics in both the frequency and the time domains. Therefore, the wavelet transform is sensitive to sharp signal transients. The Fourier basis functions, on the other hand, can be localized in frequency only. Any small change in frequency at any point in time leads t o changes everywhere in time. This makes it difficult, if not impossible, for the Fourier transform to handle a signal with local irregularities and transients. The short-time Fourier transform has been proposed to solve the localization problem of the Fourier transform. However, its window size is fixed in the time and frequency plane. Accordingly, the quality of resolution cannot be improved in both the time and frequency domains due to the Heisenberg inequality (Rioul and Vetterli, 1991). As opposed to the short-time Fourier transform, the wavelet transform is free from this resolution limitation. It is helpful to analyze information based on different resolutions whenever the detail and/or the whole picture of system signals is needed. 2.4. Multiresolution Analysis. In the discrete wavelet transform, Sk and tal are analogous to frequency and time, respectively. Initially introduced by Mallat (Mallat, 19891, the choice of the discrete lattices, SQ and t k l , produces a multiresolution analysis (MRA). As each signal is decomposed by MRA, the data resolution cells from the bottom (large S k ) to the top (small S k ) can be illustrated in Figure 2. As shown in Figure 2, the resolution becomes finer by going upward through the cell layers. A large-scale value (lower in the figure) stretches the mother wavelet; thus, it creates a coarse time resolution. Contrarily, a small-scale value compresses the mother wavelet, yielding a fine time resolution. MRA is just like making the signal pass through the identical structured cells. Each cell can be considered as a small filter which can catch the signal belonging t o the cell's time-frequency region. 3. The Wave= Network Utilizing the insight provided by wavelet transforms, we further exploit the MRA structure for formalizing the construction and training methods for the WaveARX network.
Figure 3. WaveARX network structure.
3.1. Basic Architecture. According to the theoretical discussion in section 2, the architecture of the WaveARX network can be written as
wavelet network where Sk = ah,t k l = Ibak,x is a n input, and c and co are corresponding coefficients. This architecture is expanded later to incorporate the concept of a norm for multidimensional systems. The WaveARX network consists of two parts: the wavelet network and the ARX model (Figure 3). The wavelet network contains three layers of neurons in the order of input, hidden, and output layers. Each layer has a t least one neuron. The neurons in the hidden layer contain the wavelets defined by MRA as their activation functions. When any input data information flows to the input layer of the neural network, the neuron of the input layer generates a weighted (a-h,k E Z ) input for each neuron in the hidden layer. Likewise, the weighted (Wkl) output from each hidden neuron provides the inputs for the output neurons. The neurons in the output layer simply sum all of their inputs according to the Whl weights. In other words, wkl are the weights for the linear combinations formed by the output neurons. The factor Zb, I E Z is a threshold for the activation function. Therefore, the wavelet network depicts the nonlinear relationships between the input and output variables. Constructed in parallel with the wavelet network is a linear model, ARX. By providing direct connections from the input to the output, ARX is used to model the linear relationships of a system. Even though the feedfonvard wavelet network scheme generally maps the nonlinear character of data very well, it often produces a poorer mapping of linear characteristics, especially when compared t o the well-established linear modeling techniques. Besides, if the linear characteristics are incorporated into a neural network, the final model is very difficult to analyze since the linear
4424 Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995
information on the input-output relationships is spread throughout the network. Therefore, it is highly advantageous to integrate a linear model in parallel with the nonlinear network. The goal is to isolate the linear and nonlinear contributions to the total system behavior. A similar idea of combining a linear term with a wavelet-based network can be found in Zhang‘s research (Zhang, 1993). He mentioned that incorporating an additional linear term can easily capture linear characteristics of nonlinear approximation problems. However, he did not extend this part of the discussion and demonstrate how the combinational network can extract the linear component from the system through any example. Here in our research, we not only provide tactics of how to separate linear from nonlinear contributions but also make this attribute an objective which is reachable in our simulation studies. Actually, the structure of the wavelet network looks like the radial basis function network (RBFN) (Moody and Darken, 1989). The equation of the RBFN can be shown as (12) where the center, cl, and width, ai, are the parameters in the input space of each neuron, hi is the weight, 4 is the activation function, and x is the input vector. The conventional activation function 4 of RBFN is symmetric so that it can treat the data distribution equally. The wavelet function in the WaveARX network is similar t o RBF. Both the WaveARX network and RBFN make use of local functions. The terminology local approximation means each individual neuron approximation function produces a response only for a small range of input values. Local functions are important because they can analyze data distribution with precision. Despite the similarity between the wavelet function and RBF, the wavelet function is generally considered to be superior to RBF for identification. The RBFN approach is a data-driven technique, because it locates the center of the neuron and determines the neuron width based on data density. In spite of many advantages of RBFN, calculating the appropriate position of centers is time-consuming,especially when the data set is lengthy (Musavi et al., 1992). On the other hand, the wavelet function performs MRA for structure determination. Based on data density like RBF, the wavelet’s scale operation further defines different resolutions for the input data. A higher resolution cell reinforces the details, while a lower resolution cell displays the picture as a whole; therefore, a trade-off between accuracy and generalization should be made when we select the cells (neurons) t o achieve the desired results. In addition, the wavelet function performs M U , so the operation regions of any two functions can overlap. The degree of overlapping is determined based on what resolution we need. This ensures that the data outside the operation region would not be overlooked. However, there is no control of whether the operation regions of any two RBFs would overlap. This may cause poor generalization of RBFN. The localized wavelet function of the Wave= network leads to the advantages of fast convergence and easy training over the widely used BPN, because the WaveARX network can simply correct those parameters that are influenced by the change in the input data. On the contrary, the global approximation character of BPN
deals with the overall data set. For global approximation a small change in the input values leads t o modifications in all the weight estimation. Consequently, this increases computational complexity. Despite the discussed benefits from MRA, using MRA in multidimensional spaces results in structure complexity. MRA suggests that the scale value should increase by a factor of 2 (because a = 2 was selected for this study) for each step toward higher resolution. In other words, the required number of neurons is doubled each time the scale is increased t o gain the next level of approximation accuracy. Likewise, with the increase in dimension, the number of neurons and parameters gets much larger. This often leads to an unnecessarily big network structure. To trim the redundant neurons, we make use of a scheme called the classical GramSchmidt algorithm (CGS),which is detailed in the next section. Extending wavelet functions to multidimensional spaces could be a problem, too. Computingnonlinearity in a wavelet function is sensitive to the dimension. As the dimension increases, approximating nonlinearity using the wavelet function becomes computationally expensive. In order to overcome this shortcoming, the norm concept is used. This is similar to the use of a norm with RBF and is called the “radial wavelet function” in other research (Zhang, 1993; Bakshi et al., 1994). The wavelet function utilizing the norm concept is further defined as van R such that
-
-
q ( x )= #(I 1x1I)
(13)
where $23 R is a radial wavelet function and x E R“. Under such a circumstance, the wavelet function still satisfies the following admissibility condition (Zhang and Benveniste, 1992)
(14) where $(w) is the multivariable Fourier transform of with w E R”. SinceJ!,I is a scalar function, the wavelet function is independent of the number of input dimensions. Thus, we refine our definition of the WaveARX model as stated in eq 11
).I
wavelet network where Sk = ak,t k l = lbak,k E Z, and 1 E 2”.Note that, in the later discussion, the scale ( S k ) and translation ( t k l ) are referred t o as inside parameters while the weights ( W k l ) , ARX coefficients ( c ) , and the constant (CO) are referred t o as outside parameters. 3.2. Constructing and Training the WaveARX Network. The WaveARX network is established through training. Assume there is a set of given input and desired output pairs ( x , y ) E (R“,R ) f ( x j ) = y j , j = 1 , 2 ,...,p
(16)
wherep is the total number of the training patterns and
Ax) is the function to be approximated. In order to reach the desired approximation accuracy, several conditions should be considered for constructing the
Ind. Eng. Chem. Res., Vol. 34,No. 12,1995 4425
W Figure 5. Appmximation function in 4,.
0
I-
k
\
(a)mrdientioo
@)twodimcmh
Figure 4. Selecting neuron candidates
WaveAFX network (1) How many neurons (NJ are needed? (2) What values should be taken as initial scales and translations? (3) How should we adjust the inside ( { s a } , { t a l } ) and outside ({wal},c, CO) parameters ofthe network so as to minimize the objective function? A quadratic error criterion is desired to be used to determine a suitable network, parameterized hy a vector P E P,
c(F(xj)
1 p
P* = arg min J = arg minPtP PCP 2j4
where F(xj) = X$walval(ll(xj - tal)/sall)
- yjI2
(17)
+ cTxj + co and
P* represents the optimal values of the inside and
outside parameter sets P, however, solution of this leastsquares problem is not sufficient. A network with a reasonably small number of neurons is desired, so long as the approximation is sufficiently accurate. Three procedures are undertaken in a systematic way during training to address the concerns mentioned above. They include selecting neuron candidates, purifying the neurons, and optimizing the network parameters. Following is a detailed discussion of the three procedures which provide a guideline for selecting neurons, getting initial values for the network, and adjusting the parameters. 1. Selecting Neuron Candidates: Based on MRA, the distribution of the input training data can be viewed a t different levels of resolutions. A hidden neuron of the WaveARX network is assigned to the center location of each resolution cell that covers the training data. The scale function enables the cell length to be supported by the width of the neuron. The resolution cells in oneand two-dimensional spaces are illustrated in Figure 4. The index (k,O refers to a scale level index and translation index, respectively. The location of each cell is given a unique index (k,Z). For simplicity of explanation, the one-dimensionalcase is taken as an example (Figure 4a). Supposing the input data are distributed on (2,3) at Scale Level 2 Translation Position 3, they must also be on (1,Z) and (0,l) at Scale Level 1 Translation Position 2 and Scale Level 0 Translation Position 1, respectively. From the input space point of view, these three cells are active, so they are selected as the neuron candidates. In this way, the three neurons overlap in
some regions. As the neuron candidates are found, the scale and translation values are determined on each ( k , O Consequently, the initial values of the weights (a-k)between the input and hidden layers and of the thresholds (Zb)are also obtained. Overall, the concept of MRA is employed to define each neuron of the net with respect to each resolution cell. The number of resolution cells which cover input data defines the number of candidate neurons needed in the wavelet neural network. The wavelet function defining each cell is the activation function associated with the corresponding network neuron. The scale and translation values of the cell wavelets are taken as the initial values for the adjustable network parameters. Thus, unlike the conventional feedfonvard neural network which lacks any theoretical framework in structure determination, MRA provides a step-by-step architectural design procedure so that a prototype of a wavelet feedforward neural network can be set up. However, not all of the neuron candidates may have a strong impact on the desired output. The initial structure should be refined by employing the output training data. Thus, in the next section, a new algorithm is introduced for purifying the candidate neurons. 2. Purifying the Neurons: As mentioned above, not every neuron candidate makes a significant contribution to the network model. The reason is that the projections of the output training data may be distributed densely on some neuron candidates and sparsely on the others. Using more neurons than needed often results in overfitting of the data and leads to poor generalization; eliminating neurons which have no function or respond only to a few data points may not reduce the accuracy of the model but will help its generalization and training attributes. For further clarification, this concept is briefly explained from a geometrical viewpoint (Figure 5). Consider flx)Z" R a function to he approximated in a function space L2(Rn).The approximation lies in a subspace of the function space spanned by the basis {& 15 i 5 N n} which consists of a set of wavelet functions v; (i = 1,2, ...,N)and the function x ( x E R"). (The bias term is computed separately.) Ifthe projection of f onto span {r&} is insignificant relative to the projection onto {&}, that means the neurons in {&} do not have a significant impact on fix) and are considered redundant. Under such a circumstance, it is possible to remove the neurons in {&} and keep those in {@j} without sacrificing the performance of the approximating function. Thus, the projection F(x)onto the purified subspace is still close enough to the original function
-
+
XX).
More specifically, fcx) = F ( x )
+e
(18)
4426 Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995
where e is an error function which consists of the approximation error and any additional disturbance. If p is the number of training patterns available, an analogous expression can be written as yJ. = j jJ
+ ej,
-
c
-
A
wavelet neurons linear terms j = 1,2, ...,p
(19)
where jjj represents the approximation model, F(Xj). A formal definition of this algorithm can be denoted in matrix form as
y=@w+e
The regressor matrix @po can be factored into a product Qr using an unnormalized QR decomposition, where Q is a p x (N n ) matrix with ( N n ) orthogonal columns and I’is a (N n ) unit upper triangular matrix. That
+
+
+
(20)
where y2
... ...
ypIT
e = [el e2
... ...
ePlT
Y = bl
r=
a23
***
a2,N+n
...
...
:
0
0
aN+n-l,N+n
...
0
1
in which e is the training error, and
I
[ l , 1,
By introducing the residuals e in eq 18 and under the assumption that e is white noise and therefore uncorrelated with F(x)or A, it yields
... 11
i=N+n+ 1 (constant)
D
(29)
As is well-known, the minimization of the least-squares criterion
1 .min J = - min eTe P*€P
2
P*€P
(23)
leads t o the “Normal Equation” for estimated parameters w
Using eq 24 to achieve a desired performance may require an unnecessarily large number of neurons. This will add comp€exityand cause an ill-conditioned matrix
[email protected], the direct solution of eq 24 is seldom used. To keep the simplest structure of the network, it is necessary to remove the unwanted neurons. There are several ways for computing optimum such as Gaussian elimination, orthogonal decomposition, and singular value decomposition. Here, an orthogonal method called the classical Gram-Schmidt algorithm (CGS) is used to direct the screening for these redundant neurons. It is one of the orthogonal least-squares methods introduced by Chen et al. (Chen et al., 1989; Chen and Weigand, 1992). Like regularization theory in RBFN, CGS is a way to prevent the network from overfitting the training data. In the following procedures, only wavelet and linear terms will be selected. However, the constant term which is the bias of the output y will still be kept in the approximation model. Thus, w and 0 are redefined as
*,
D
or P
N+n
P
J=1
J=1
J=1
Cri2= Cgj2g,Tqj + Ce,2
(30)
The term gJ2QTq, gives the energy of y expanded in the qr direction. Due to qr, the sum of squared regressor [SSRI is defined as
[SSRI, = gJ2qJTqj, 1 I i I N
+n
(31)
Equation 31 gives a criterion to select the most important wavelet neuron or linear term, which maximizes [SSR]. Therefore, the search procedures are as follows: First, y is projected, in turn, onto each of the candidate basis vectors q1(,)= & i = 1, 2, ..., N n; that is,
+
(32) Then compute the sum of squared regression [SSRll(i) by @1(i))2(q1(i))Tq~(i) and select the most important basis which maximizes the sum of squared regression (Le., maxi {[SSRII‘~),1 Ii IN n)). Thus, the basis vector of the unidimensional space, q1 = ql(il)= is selected as the first column of Q, where i l is the index of the most important basis which has the maximum [SSRIl. Next, the second most important neuron from the remaining set is t o be selected. The component of each
+
Ind. Eng. Chem. Res., Val. 34, No. 12, 1995 4427
+
of the remaining neurons {&, 1 Ii IN n and i along the direction of q1 is to be substracted off.
f
il}
Similar to the first step, g2 for each of the remaining candidates can be computed as shown below:
The next most important neuron can thus be found by identifying the neuron whose g2 maximizes [SSRIz Le., getting i 2 that has maxi {[SSR12(i),1 5 i IN n and i i l } ) . Therefore, q2 = q ~ ( ~ =z#iz) - a12~~1)ql is selected as the second column of Q. Since q1 and q2 are derived from 4il and &, then span M i l , &I and span [ql, 421 are the same. The same computation procedures are repeated until the desired number of bases (N,)is reached. (See step 3 below for determining the desired number of bases.) The procedure of purifylng the neurons using the CGS scheme can be summarized as follows. Step 1: k = 1, 1 Ii IN n, compute
+
*
+
(35)
find the index i l that yields
miq{[SSRl,'i', 1 Ii
IN
+ n}
(36)
and select (37)
Step 2: k = k compute
+ 1 and 1 Ii IN + n, i
f
il,
..., i f
ik-1,
Step 3: To generalize the selection procedure, two criteria are used, including Akalke's information criterion (AIC(4)) and the Bayesian information criterion (BIC) (Kashyap, 1977). These criteria consist of a scalar function with two terms. The first term measures goodness of fit, and the other penalizes larger models. Both criteria are taken into account when we evaluate the performance and complexity of the network. When AIC(4)k L AIC(4)k-i or BICk I BICk-1, k t N , = k; otherwise, proceed to find the next most important direction. Therefore, the total number of neurons in the WaveARX network is determined when this step is completed. Step 4 : Since the space spanned by the selected q t l , ..., qLNs is the same space spanned by ..., &Ns, the N , wavelets are selected.
and
where
#iuris called the pseudoinverse of the matrix
*pur.
The CGS algorithm is selected over other orthogonal least-squares methods, such as modified Gram-Schmidt, because the former does not require the storage of the entire large matrix during computation (Chen et al., 1989; Pro11 and Karim, 1994). 3. Optimizing the Network Parameters: After purifying the neurons, a refined network which closely models the system is achieved. However, a higher resolution ratio for each neuron can be obtained using gradient search algorithm. This procedure is optional; it is used only when the training error is not small enough or more approximation accuracy is needed. Past research, such as by Bakshi and Stephanopoulos (Bakshi and Stephanopoulos, 1993), improves the wavelet network approximation accuracy by increasing the number of neurons. This may yield a more complex network structure. An alternative way to their incremental strategy in structure determination is to optimize the wavelet functions using the gradient search algorithm. Rather than leaving the wavelet shape fixed or inserting additional neurons, we can increase the model accuracy by adjusting the scale and translation of the wavelet function based on the objective function (eq 17). In each stage q, = 0' - AIH(Oq)]-'VJ(Oq)
=
find the index i k that yields miq{[SSRlk'i),1 Ii IN and select
+ n , i t i,, ..., i t i k - l } (39)
+
q+l
y
(43) (44)
where 0 = [{tkl}, { S k } ] ; [H(Oq)l-l is the inverse of the Hessian matrix of J(0q); V J contains the gradient vectors, M/at, a/& ( t = {tkl},s = { s k } ) ;and 1 is the step length. Newton's method is used here to minimize the objective function J. Since the model accuracy is improved by adjusting the wavelet functions rather than adding extra neurons, the WaveARX network structure can be kept as simple as possible. Actually, the optimization is 2-fold: adjusting inside parameters and outside parameters separately. This is opposed to the gradient method used by Zhang and Benveniste (Zhang and Benveniste, 1992) and Szu et al. (Szu et al., 19921, which adjusts all translation, scale, and weight coefficients of the wavelet network at one time. Their performance may be trapped in local minimum error
4428 Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995
and cause slow training. We use a two-stage optimization scheme which undertakes the gradient method (eq and the least-squares 43) for inside parameters (0) estimation (eq 44) for outside parameters (w). This method leads to efficient computation. This optimization scheme is not the only choice; it depends on the trade-off between performance and efficiency. Once the network is optimized, the best possible approximation model is achieved for the given system and available training data. In summary, the training procedures are outlined as follows: Suppose a set of data (xj,yj),j = 1,2, . . . , p ,that contains training and testing data is available. Set the desired operation region that can be covered by the wavelet function. 1. Set a proper scale level to cover the resolution (or density) of input data. 2. Construct MRA plane and select neuron candidates; consequently, inside parameters are obtained. 3. K = 1. 4. Select the most important neuron using CGS. 5. If the criterion (AIC(4)or BIC) is reached, go t o 6; else k = K 1,go t o 4. 6. Compute the outside parameters. 7. Calculate J for training and testing data, respectively. 8. If J cannot be reduced or has reached the preset criterion, go to 9; else adjust the inside parameters and then go to 6. 9. If J is satisfactory, stop training; else reset a higher scale level, go t o 1.
+
For these various examples, the WaveARX network or the wavelet network alone has as good as or a better approximation ability when compared to these methods. Its advantages are manifested in several ways, including accuracy (R2),number of neurons, number of training iterations (epochs), or quantifying linear and nonlinear contributions using the WaveARX net. R2 is determined based on Milton and Arnold's definition (Milton and Arnold, 1990). It is denoted as
R2
(I-- "s",")
x
100%
(45)
where SSE = Cf=, (F(xJ- xi))^, Syy=
@'(xi)
-
jV,and ji is the mean value of the desired validation
output. As already mentioned, the WaveARX network retains two types of system interactions
Y = Ywave + YARX
(46)
where ywaveis the nonlinear contribution and YARXis the linear contribution. In order to quantify these two characteristics, the linear and nonlinear contributions are scaled between 0 and 1. The scaled contributions can be written as
where
4. Simulation Studies
Several static and dynamic simulation studies are presented, respectively, in sections 4.1 and 4.2 t o test the capabilities of the WaveARX network. In the first example of section 4.1, only the wavelet network part of the WaveARX network is employed to compare with two other existing wavelet-based neural networks. The purpose of this comparison is to show the possible advantages of CGS and the two-stage optimization used by the proposed wavelet network. Departing from this point, the wavelet network is combined with a linear ARX model and the combinational model, WaveARX, is further investigated. In the rest of the examples, various types of mathematical models are used to generate comparisons between the WaveARX network and other widely used linear and nonlinear methods. For static identification, the methods include multiple linear regression (MLR), polynomial regression (PolyReg), backpropagationneural network (BPN),and radial basis function network (RBFN). Note that ARX or ARMA is the linear method used in section 4.2 instead of MLR since dynamic systems will be investigated. For PolyReg, the model written by authors utilizes self-product terms but no cross terms. The BPN and the RBFN methods come from the neural network toolbox of MATLAB (Howard, 1993). The BPN method uses the adaptive learning rate. The user must determine the number of neurons by trial and error. The initial parameters of the BPN method are selected randomly. The radius of RBFN is defined beforehand by the user, and the number of neurons and the width of the radius are obtained by trial and error in order to seek a good fit to the model. In the WaveARX network, the Laplacian of the Gaussian function, which satisfies the admissibility condition of wavelets, is used as the activation function.
h a v e -
Ywave-Ywave min max - min 3 YWave h a v e
YARX-Yg
YARX=
max min YARX-YARX
Therefore, the contributions from linearity and nonlinearity can be defined by WARX
% linear contribution = -
WWave
% nonlinear contribution = -
+~ A R X awave
WWave
-I-WARX
(48)
(49)
Note that the constants, CWave and CARX, do not change the interpretation of process data because they translate the scaled contributions only. 4.1. Static Models. A static model is a function that maps from the input domain to the output domain without using any differential or integral operators. In this section, two examples are investigated. The first one is to compare our wavelet network with other wavelet-based neural networks. The second one demonstrates the static modeling ability among the five algorithms. Example 1: Piecewise Function. The first simulation study is undertaken to compare our systematically designed wavelet network with two other wavelet-based networks and their implementation algorithms (Zhang and Benveniste, 1992; Bakshi et al., 1994). Zhang and Benveniste and Bakshi et al. have published wavelet
Ind. Eng. Chem. Res., Vol. 34,No. 12, 1995 4429 AS1Y.I
--
-5
- 9 0
-8
-8
0
-2
-4
2
Wavelet network, 27 neuroni, L" error
W.veI.1
Network
4
e
e
io
e
8
io
= 0.5378
- -.
-5
-?So
-8
-6
-4
-2
0
2
4
Wavelet network with 10 ttcrations, 27 neuron#, L" error
= 0.1456
IC)
Wavelet network with 25 iterations, 7 neurons. 6
= 0.0480
Figure 6. (a) Approximation results of wavelet network before optimization based on Bakshi's operation condition. (b) Approximation results of wavelet network after optimization based on Bakshi's operation condition. (c) Approximation results of wavelet network after optimization based on Zhang's operation condition.
networks that are very similar to each other and to our wavelet formulation. However, the design procedures and algorithms to implement the networks are different. A mathematical equation studied by Zhang and Benveniste and Bakshi et al. is used to compare our wavelet network implementation with their waveletbased network and design procedures. The equation is defined as
I
-2.186~ - 12.864, -10 I x < -2
f ( ~ ) = 4.246X, -2 I X ( 0 1oe-0.05"-0.6
sin[(O.O&
+ 0.7b1,
0 5 x I 10 (50)
Note in this example it is not meant t o find the optimal performance of WaveARX. The goal is to make a fair comparison to the other two formulations. Since the other two networks do not incorporate a linear model, only the wavelet network part of the WaveARX is utilized. The comparison also uses the published results. Therefore, our design is constrained to employ the (a) same number of neurons, (b) same training data set, and (c) same evaluation criterion. The research of Bakshi et al. develops a network with 27 neurons which uses 100 training data points to minimize L" error. The resulting L" error for the WaveNet was 0.6842 (Bakshi et al., 1994). Under the same constraints our design procedure yields an approximation before optimization of 0.5378. (See Figure 6a.) This probably results from CGS that is used for our network which is able to select the most significant 27 neurons and avoids using neurons that can be pruned. In applications without the number of neurons preset, this ensures a compact network structure. The Wave-Net procedure does not include any optimization after the network is designed. However, when we optimize the wavelet function of our network by 10 epochs, the Lmerror of the approximation is reduced to 0.1456 (Figure 6b). This indicates that our optimizing
wavelet function formulation can improve the model accuracy, while the Wave-Net with fixed wavelet function parameters cannot without inserting additional neurons. From another angle, using less neurons, our wavelet network can perform as good as the Wave-Net. For comparison between our wavelet network and Zhang and Benveniste's, a uniformly sampled test data of 200 points are generated and 7 neurons are used as was done in their study. It is shown that the figure of merit 6 (Zhang and Benveniste, 1992) of our wavelet network before optimization is 0.2762. With 25 epochs of optimization (Figure 6c), our wavelet network (6 = 0.0480) surpasses Zhang and Benveniste's using 10 000 epochs (6 = 0.050 57). The underlying problem of Zhang and Benveniste's network may be that their optimization is performed for all translation, scale, and weight coefficients a t the same time. This may trap their network in a local minimum error and lead to slow training. As opposed to their optimization method, the two-stage optimization scheme used for our wavelet network adjusts the inside and then outside parameters in an iterative fashion. Since the least-squares estimation method finds the weights (outside parameters), the gradient method has a reduced parameter searching space as it only finds the translation and scale parameters (inside parameters). Example 2 Mathematical Equations. Three mathematical equations are simulated:
where eqs a-c contain linear, nonlinear, and a mixture of linear and nonlinear terms, respectively. Each of them has three inputs. The inputs are random sequences uniformly distributed within the interval of [-1, 11. A total of 100 sets of training data are generated from the simulations for each case. An additional 100 sets of data are also used as validation data to measure the goodness of fit (R2)after the models are trained. Table la-c illustrates the simulation results of eqs a-c, respectively. From Table l a , it is shown that all the algorithms have very good fits. Among them, WaveARX is a little better than the nonlinear models, BPN and RBFN, and it achieves a perfect fit (R2= 100%) as does the linear modeling technique, MLR. Using the wavelet network alone a fit of 97.84% is reached. Besides, the coefficients ( c )of the WaveARX network are 0.8,0.4, and 0.2, which equal those in eq a. The weight (w) for the neuron in indicating that no the WaveARX net is -1.67 x nonlinear contribution is found. WaveARX's good fit and its exact match with the equation coefficients suggest that the WaveARX network is capable of learning the system's linear characteristics and placing it appropriately in the ARX structure. Table l b demonstrates the ability of these algorithms to mimic the nonlinear eq b. It is not surprising t o find that the linear model, MLR, has a poor fit and cannot learn the nonlinearity very well. PolyReg is not good a t fitting the nonlinearity either, probably because it lacks cross terms. On the contrary, the fits of the nonlinear methods, BPN, RBFN, wavelet network, and WaveARX, look a good deal better. Using the wavelet network alone with 400 training epochs (R2= 92.68%)
4430 Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995 Table 1. Comparison among Five Identification Algorithms Using Mathematic Models That Contain Linear andor Nonlinear Characteristics
R2
model
neuron
epoch
+
(a)y = 0 . 2 ~ 1 0.422 + 0 . 8 ~ 3 MLR 100.00 BPN 99.85 5 RBFN 97.74 20 wavelet network 97.84 7 1 WaveARXa 100.00 (b)y = X ~ / ( X Z x3)& sin[(x3 4)xd MLR 46.75 PolyReg [3Ib 69.70 BPN 93.19 5 RBFN 94.03 25 wavelet network 92.68 5 Waveme 93.70 5
+
+
250 140 0
+
1000
400 50
+
(c)y = 0 . 6 ~ 1 0 . 4 ~ 2 3x1x32.5sin(x3 - 1) MLR 87.76 PolyReg [31 95.40 BPN 99.48 5 RBFN 96.25 25 wavelet network 96.97 5 5 WaveARXd 99.51
1000
200 60
a Linearity: 100.0%. Nonlinearity: 0.0%. Represents a polynomial of the third degree. Linearity: 3.3%. Nonlinearity: 96.7%. Linearity: 70.7%. Nonlinearity: 29.3%.
is almost as good as BPN with 1000 epochs. This suggests that it takes a lot longer to train BPN than the wavelet network. It is also interesting to note that it takes far more neurons for the RBFN (25 neurons) t o achieve a similar approximation performance as the wavelet network (5 neurons). With five hidden neurons, BPN and WaveARX achieve an R2 of about 93%. However, WaveARX with 50 epochs can obtain a similar approximation result as BPN which is trained by 1000 epochs. Obviously, the local nature of the Wavewavelet network and the initial values for its parameters obtained from the design procedures save computation time compared with the conventional global BPN technique. Being faced with an equation containing boh linear and nonlinear terms as in eq c, WaveARX and the wavelet network still produce a good performance. In Table IC,WaveARX, wavelet network, BPN, RBFN, and PolyReg with fits above 95% all look promising, whereas the MLR fit is relatively poor at 88%. The approximation abilities of WaveARX and BPN are the best, but BPN, again, experiences a long time training. The coefficients ( c )of the linear part of the WaveARX net are 0.6003 and 0.4038, which are very close t o those of the linear terms of eq c (0.6 and 0.4). The ARX model of the WaveARX network is able t o isolate the linear characteristics of the first two inputs and automatically leave out the nonlinear third input. In this sense, the WaveARX network can display the linear and nonlinear contributions separately and provide an accurate description of the relationship between the inputs and the desired output. Furthermore, the WaveARX network is able to quantify how linear and how nonlinear a system is. Given eq a, we get 0% of nonlinear contribution and 100% of linear contribution through the WaveARX network, which matches the real characteristics of the equation. From eq c, the WaveARX network is also able to quantify the input to output behavior as 70.7% of linearity and 29.3% of nonlinearity. As for eq b, which is a nonlinear system visually, the WaveAFtX network essentially contains 3.3% of linear contribution under the operation condition of interest.
Figure 7. ARMA model as a two-layer neural network.
4.2. Dynamic Models. A dynamic model refers to a system whose output value is always dependent on time. That is, the current output is dependent not only on the current input but also on past inputs and outputs. In order to get an accurate dynamic model, the process input data should cover enough of the expected operation region and persistently excite the characteristics of the output data set (Slotine and Li, 1991). Although the algebraic structure of the WaveARX network may appear static, it is actually dynamic. For simple illustration of this point, an ARMA model which is often used for model-based control is shown to be analogous to the WaveARX network. A typical ARMA model is denoted as n.,
n,.
In eq 51, the model is dynamic due to the summations over past inputs and outputs. It is also equivalent to a simple two-layer (consisting of one input and one output layers) neural network without a hidden layer and activation functions. The equivalence is depicted by comparing Figures 3 and 7. In Figure 3, the vector x contains u(k),u(k - 11, ..., u(k - nu),y ( k ) ,y ( k - 11, ..., y ( k - ny)and the model output corresponds t o y ( k 1). The three-layer WaveARX neural network with wavelet functions on the hidden neurons is more sophisticated than the ARMA model and should be capable of achieving better results and more general applications. Therefore, the WaveARX network structure can be employed for dynamic modeling. In the following simulation studies, suppose y ( k ) is the current time series value. The idea is t o use the model
+
j(K
+ 1)= F(x(k))
(52)
to predict the process behavior one-step ahead for y ( k 11, where the inputs to the model
+
x(k)= r.Y(K), y ( k - 11, ...,y(K - ny), ~ ( k )u(k , - l),..., u(k - nu)]T (53) are current and past input/output observations of the series. In eq 53, n, and nu represent the maximum lags on all output and input, respectively. Two examples are tested here to compare the dynamic modeling ability among several algorithms. Example 3: pH CSTR System with Single Input Variable. This example for dynamic modeling is based
Ind. Eng. Chem. Res., Vol. 34, No. 12,1995 4431 Table 2. Identifying pH CSTR System Using Five Aleorithms model
neumn
R2
epoch
(a) Operating Region: pH around 9 ARMA 92.22 PolyReg [21 93.09 9 BPN 96.98 RBFN 94.05 8 wavelet network 96.77 8 wavema 93.71 6 Wa"eARXb 97.15 6
1000
25 10
(b)Operating Region: pH around 7 ARMA PolyReg [21 BPN RBFN wavelet network wavemc
55.98 71.85 79.46 79.07 77.01 82.41
9 8 8 6
WaveARX: before optimization. Linearity: 46.1%. Nonlinearity: 53.9%. Linearity: 64.3%. Nonlinearity: 35.7%.
Figure 8. (a) Input o f training data set for pH CSTR. (b) Output of training data set for pH CSTR (pH around 9).
9. Predicting PH network (pH around 9).
on the pH CSTR simulation case of Bhat and McAvoy (Bhat and McAvoy, 1990). The dynamic mathematical model is for the response of pH to changes in NaOH flow rate. To train the prediction techniques, a total of 1000 sets of inputloutput data are used. Another 500 sets of new data are employed for validation. They are produced by adding 3%pseudorandom amplitude step changes around the steady-state NaOH flow rate of 515 L'min and pH of 9. It is found from our experiment that using the two time lags of the pH and NaOH flow rate achieved the best results (i.e., n, = 2 and nu = 2). The sampling period is 0.4 min. The mathematical model and parameters are the same as those used in Bhat and McAvoy's study. PRBS signals were used by Bhat and McAvoy to generate the training set. They produced a uniform distribution over all the operation region. However, in reality the operating point of the reactor is more likely to be around the desired operating point. Therefore, the input data for this study are generated around the operating point using the Gaussian distribution. This data generating method also creates more of a challenge for generalization because of the uneven data distribution. The approximation models are first trained around the pH value of 9 (Figure 8). The results of learning
the continuous pH CSTR system are shown in Table 2a and pictured in Figure 9. It is found that, prior to optimization, the WaveARX network with six hidden neurons can perform almost as good as the other algorithms in terms of fitness (Table 2a). After 10 epochs of optimization, WaveARX surpasses the others and reaches an R2 of 97% for approximating the dynamic pH CSTR. In fact, 10 epochs of optimization provide an opportunity to adapt the inside parameters of the initial WaveARX network. Using the wavelet network with 8 neurons alone can also get an R2 of 96.77% after 25 epochs of optimization. It achieves similar approximation results as BPN, but BPN requires many more training epochs. This demonstrates that a small effort to optimize the WaveARX/wavelet network in the training process can enhance modeling accuracy. Although all the algorithms predict the dynamic pH CSTR system well when the pH value is around 9, their generalization ability is tested using validation data around the pH value of 7. No new training is done for the new operation region. Table 2b shows the approximation outcome using 250 sets of new input and output pairs produced by adding 1%pseudorandom amplitude step changes around the steady state of NaOH flow rate = 513 Umin and pH = 7. Under such circumstances, ARMA (Table 2b) is rather poor a t
using
and
4432 Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995
Q
j
a 5 ..,...........+ Actual&.............................. x
8
Wav@W=-. ARM+
j
o
:............,.. :.... ..........
j
..............:...,...............,,......... :...........,..,, ..............
,
. “0
20
60
40
80
100
120
tirne(rnin)
Figure 10. Predicting pH CSTR using ARMA and WaveARX network (pH around 7).
modeling the reactor’s response (R2= 56%) due to the system characteristic differences between the steady states and perhaps to the nonlinearity in the operation region around pH = 7. Figure 10 shows the prediction of ARMA and the WaveARX in comparison with the actual response. Obviously the performance of the nonlinear model, WaveARX, is superior. The four neural networks, MLP, RBFN, wavelet network, and WaveARX network, are better than ARMA and PolyReg in terms of generalization. The possible advantage of those network functions may lie in their exponential terms which cover more than the second-arder polynomials in the sense that an exponential can be expanded into Taylor’s series. WaveARX also performs a little better than PolyReg, BPN, and RBFN (Table 2b). Using the wavelet network alone can also produce the similar approximation as BPN and RBFN. Compared with the ,first training results given in Table 2a, they seem not t o generalize to the new inputs as well as might be expected. This is probably caused by the fact that the previous training does not cover enough data information around the pH value of 7. In spite of insufficient data information, WaveARX still achieve R2 of 82%, getting the best approximation result among the tested algorithms. Example 4: CSTR System with Multi-Input Variables. In this example, a CSTR with nonisothermal exothermic reaction is presented using the parameters of You and Nikolaou’s simulation study (You and Nikolaou, 1993). In this example, four input and three coupled dependent variables are employed. The input variables are the input temperature (Ti), input concentration ( C k ) ,input flow rate (FJ, and input heat ratio ( u ) , while the output variables are liquid level (L), output concentration (CA),and output temperature (T). The output temperature ( T ) is also the output variable that is to be approximated. A total of 500 sets of training and 500 sets of testing data are generated by adding 3% pseudorandom amplitude changes in all input signals around the steady-state feed conditions. We use 0.2 h as the sampling period. From our studies, we found that using one time lag for u and Ti and two time lags for CAi, Fi,L,CA, and T achieved the best results. Since it is difficult to present the relationships among the four independent and three dependent variables in a graph, there is little insight beforehand
160
140
180
1.-
. I
200
bHW
Figure 11. Linear contributions of nonisothermal CSTR extracted from ARX of the WaveARX network.
-
>
.f00
120
160
140
180
200
umc(hr)
Figure 12. Nonlinear contributions of nonisothermal CSTR extracted from wavelet net of the WaveARX network.
on whether the system’s behavior in the operation region around the steady state is linear or nonlinear. From the training results of Table 2, it is discovered that all methods work very well. During the neuron purifying procedure where the most important directions are identified in order (i.e., eq 35-42), most of the linear contributions come earlier than the nonlinear ones. That is, the ARX part of the WaveARX structure has a significant impact on the approximation model. The linear and nonlinear contributions can be separated according t o the two structures, ARX and wavelet net in the WaveARX net as shown in Figures 11 and 12. Note the scale in Figure 11 for the ARX contributions covers nearly 100 K, while Figure 12 for the nonlinear wavelet network covers less than 1 K. This shows that the system character for the selected variations around the steady-state operating point is highly linear. In order to partly verify this conclusion, we consider the results for ARMA shown in Table 3 and Figure 13. Table 3 shows using the linear modeling technique alone provides an excellent fit (R2 = 99.8%),while the two model predictions in Figure 13 look nearly identical. For further illustration, plots of the steady-state behavior around the steady-state operating point covering the
Ind. Eng. Chem. Res., Vol. 34,No. 12,1995 4433 Table 3. Comparison among Five Algorithms in Identifying CSTR with Nonisothermal Exothermic Reaction ~
model
R2
neuron
epoch
ARMA PolyReg [31 BPN RBFN wavelet network
99.83 99.83 99.73 99.72 99.10 99.84
11 60 6 2
2500
WaVeARXa
440 4
* Linearity: 98.8%. Nonlinearity: 1.2%
3-L
..............:..............:............... :...............: . . . 3
120
180
140
180
I
Mo
ame(hr)
Figure 13. Comparison between ARMA and WaveARX network in predicting nanisothermal CSTR.
amplitude of input variations are given in Figure 14. Each individual figure shows the variation of only one
input with the others held constant a t their steady-state values. Finally, we separate linear from nonlinear contributions using the WaveARX network. It is found that the CSTR reador mostly contains linearity (98.8%), while its nonlinearity accounts for 1.2% only. These confirm that the behavior is essentially linear. The WaveARX network demonstrates its ability to identify this behavior. The importance of including the linear modeling techniques in parallel with the neural network is again illustrated. This example has been used several times in the literature as a highly nonlinear problem. In fact, for the operation region treated, it is essentially a linear problem.
Figure 14. Steady-state relationships between T and Fi,Cfi, Tt,and u in nonisothermal CSTR.
4434 Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995 5. Conclusion and Future Study In summary, the WaveARX network is a promising system identification algorithm incorporating the MRA features of wavelet transforms and the traditional ARX or ARMA technique into a feedforward net. With the three-layer neural network formulated side by side with the linear ARX method, the parallel identification scheme captures the linear and nonlinear contributions separately without any prior knowledge of the system behavior. The value of this feature is emphasized by evaluating several literature examples in which we successfully distinguish the linear from the nonlinear characteristics of the predicted systems. Although one of the mathematical models used is very nonlinear due to the magnitude of the variations made in the system inputs, the relationship between the input and output data sets is essentially linear. The systematic design synthesis procedure eliminates many problems associated with ANN. The wavelet transforms offer a theoretical framework for constructing and training the network. For practical use, the localization property of the wavelet function brings the WaveARX network the benefits of fast convergence and easy training; MRA of the wavelet function systematically determines neuron candidates, and, finally, extending the wavelet function to a multidimensional space using a norm leads to approximation of multidimensional systems and saves computation time. The approximation ability of the WaveARX network is demonstrated through the simulation examples of both mathematical and physical systems. The simulation studies show that it has a better or at least similar fitting performance as other widely used linear and nonlinear modeling techniques. When comparing with two other existing waveletbased neural networks, the wavelet network we propose still achieves some better performance characteristics. Through the first simulation study, it is found that our wavelet network with an adaptive wavelet function can form a more compact structure than the Wave-Net based on the similar approximation result. Zhang and Benveniste’s wavelet network requires much more training epochs than ours, although they employ an adaptive wavelet function. This suggests that the twostage optimization scheme we use saves a lot of computation time. Although this paper has illustrated the feasibility of using the Wave= network as a process identification tool, this is just a preliminary study. Broader applications and evaluation of the WaveARX network is part of our research plan. I t is speculated that its advantages may be exploited for on-line adaptive network applications, extension to real process cases and extracting qualitative and quantitative behavior from chaotic systems. Also, whether the wavelet function incorporating a norm still satisfies the wavelet frame bound criterion and retains the wavelet approximation properties is a topic planned for discussion in our future work.
Nomenclature e = error vector of WaveARX, E Rp f = function to be approximated gi = inner product ratio of orthogonalized vectors hi = weight of the ith neuron of RBFN p = number of training pattern qi = column vector of Q
s = scale for mother wavelet function sk = discrete scale t = translation for mother wavelet function t k l = discrete translation w = coefficient vector of WaveARX, E RN+n+l wo= coefficient vector of WaveARX without bias
term, E RN-n wpur= coefficient vector of WaveARX from purified CP, E RNS x = input vector, E Rn y = output, E R1 9 = mean value of y j = predicted output, E R1 y = output pattern vector, E Rp Greek Symbols r = matrix from QR decomposition @ = [{Sk), { t k l l l CP = WaveARX function matrix, E Rpx(N+n+l) CPo = WaveARX function matrix without bias term, E Rp x (N+n1 CPpur = matrix from purified @, E RpxNa a+ = element of r A = learning rate a, = width of the ith neuron of RBF 4 = radial basis function = wavelet function q = Fourier transform of I$ q k l = discrete wavelet function qs= scale of mother wavelet function I$s,t = scale and translation of mother wavelet function qt = translation of mother wavelet function w = frequency V = gradient operator
Literature Cited Bakshi, B.; Stephanopoulos, G. Wave-net: A multi-resolution, hierarchical neural network with localized learning. M C h E J . 1993, 39, 57. Bakshi, B.; Koulouris, A.; Stephanopoulos, G. Learning at Multiple Resolutions: Wavelets as Basis Function in Artificial Neural Networks, and Inductive Decision Tree. Wavelet Applications i n Chemical Engineering; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1994. Barton, R. S.; Himmelblau, D. M. Rectification of Packed Distillation Column Data via Subset Identification. International Neural Network Society Annual Meeting, San Diego, CA, 1994; pp 1-167. Benedetto, J. J., Frazier, M. W., Eds. Wavelets: Mathematics and Applications; CRC Press: Boca Raton, FL, 1993. Bhat, N.; McAvoy, T. J. Use of neural nets for dynamic modeling and control of chemical process systems. Comput. Chem. Eng. 1990, 14 (4151, 573. Bhat, N. V.; McAvoy, T. J. Determining model structure for neural models by network stripping. Comput. Chem. Eng. 1992, 56 (21, 319. Chen, Q.; Weigand, W. A. Neural Net Model of Batch Processes and Optimization Based on an Extended Genetic Algorithm. IEEE 1992, 519-523. Chen;S.; Billings, S. A,; Luo, W. Orthogonal Least Squares Methods And Their Application to Non-linear System Identification. Int. J . Control 1989, 11 (7), 1873. Chui, C. K. A n Introduction to Wavelet; Academic Press: New York, 1992. Daubechies, I. Ten Lectures on Wavelets;Society for Industrial and Applied Mathematics: Philadelphia, 1992. Duffn, R. J.; Schaffer, A. C. A Class of Nonharmonic Fourier Series. Trans. Am. Math. SOC.1952, 72, 341. Eikens, B.; Karim, M. N. Identification of A Waste Water Neutralization Process Using Neural Networks. International Neural Network Society Annual Meeting, San Diego, CA, 1994. Howard, D. Neural Network Toolbox for Use with M A T U B : User’s Guide, Mathworks, Inc., 1993. Irie, B.; Miyake, S. Capabilities of Three-Layered Perceptrons. Proc. IEEE Int. Conf. Neural Networks 1988, 641.
Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995 4435 Johansson, R. System Modeling And Identification; Prentice-Hall Inc.: Englewood Cliffs, NJ, 1993. Kashyap, R. Bayesian Comparison of Different Classes of Dynamic Models Using Empirical Data. IEEE Trans. Autom. Control 1977,22,715. Mallat, S.G. A Theory for Multiresolution Signal Decomposition: the Wavelet Representation. ZEEE Trans. Pattern Anal. Mach. Zntell. 1989,31, 674. Milton, J . S.; Arnold, J. C. Introduction to Probability And Statistics; McGraw-Hill: New York, 1990. Moody, J.; Darken, C. J . Fast Learning in Networks of LocallyTuned Processing Units. Neural Comput. 1989,1, 281. Musavi, M. T.; Ahmed, W.; Chan, K. H.; Faris, K. B.; Hummels, D. M. On the Training of Radial Basis Function Classifiers. Neural Networks 1992,5,595. Narendra, K. S.; Parthsarathy, K. Identification and Control of Dynamic System Using Neural Networks. ZEEE Trans. Neural Networks 1990,1 (11,4. Pati, Y. C.; Krishnaprasad, P. Analysis and synthesis of feedforward neural network using discrete affine wavelet transformation. Technical Report src. tr90-44; System Research Center, University of Maryland; College Park, MD, 1990. Pati, Y. C.; Krishnaprasad, P. S. Analysis and Synthesis of Feedforward Neural Networks Using Discrete M i n e Wavelet Transformations. IEEE Trans. Neural Networks 1993,4 (11, 73. Pottmann, M.; Unbehauen, H.; Seborg, D. E. Application of A General Multi-Model Approach for Identification of Highly Nonlinear Processes-A Case Study. Znt. J . Control 1993,57 (11, 97. Pro11, T.; Karim, M. N. Model-Predictive pH Control Using RealTime NARX Approach. AlChE J. 1994,40 (21, 269. Psichogios, D. C.; Ungar, L. H. A Hybrid Neural Network-First Principle Approach to Process Modeling. AZChE J . 1992,38 (lo), 1499.
Rioul, 0.;Vetterli, M. Wavelets and Signal Processing. IEEE S P Mag. 1991,14. Rumelhart, D. E.; McClelland, J. L. Parallel Distributed Processing: Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, 1986; Vol. I. Slotine, J.-J.E.; Li, W. Applied Nonlinear Control; Prentice-Hall: Englewood Cliffs, NJ,-i991. Solowav. D. I.: Bialasiewicz. J. T. Neural Network Modeline of Nonfinear Systems Based on Volterra Series Extension i f A Linear Model. Technical Report, NASA Technical Memorandum; Langley Research Center, Hampton, VA, 1992. Su, H. T.; McAvoy, T. J. Integration of Multilayer Perceptron Networks and Linear Dynamic Models: A Hammerstein Modeling Approach. Znd. Eng. Chem. Res. 1993,32, 1927. Szu, H. H.; Telfer, B.; Kadambe, S. Neural Network adaptive wavelets for signal representation and classification. Opt. Eng. 1992,31 (9), 1907. You, Y.; Nikolaou, M. Dynamic Process Modeling with Recurrent Neural Network. AIChE J . 1993,39 (101, 1654. Young, R. K. Wavelet Theory And Its Applications; Kluwer Academic: Dordrecht, The Netherlands, 1992. Zhang, Q. Regressor Selection and Wavelet Network Construction. Proc. 32nd Conf. Decis. Control 1993,3688. Zhang, Q.;Benveniste, A. Wavelet networks. ZEEE Trans. Neural Networks 1992,3 (6), 889.
Received for review October 20, 1994 Revised manuscript received May 12, 1995 Accepted July 25, 1995@ IE9406097 Abstract published in Advance A C S Abstracts, November 1, 1995. @