Ind. Eng. Chem. Res. 2002, 41, 751-759
751
A Fast and Efficient Algorithm for Training Radial Basis Function Neural Networks Based on a Fuzzy Partition of the Input Space Haralambos Sarimveis,* Alex Alexandridis, George Tsekouras, and George Bafas National Technical University of Athens, Department of Chemical Engineering, 9 Heroon Polytechniou str., Zografou Campus, Athens 15780, Greece
The popular radial basis function (RBF) neural network architecture and a new fast and efficient method for training such a network are used to model nonlinear dynamical multi-input multioutput (MIMO) discrete-time systems. The proposed training methodology is based on a fuzzy partition of the input space and combines self-organized and supervised learning. The algorithm is illustrated through the development of neural network models using simulated and experimental data. Results show that the methodology is much faster and produces more accurate models compared to the standard techniques used to train RBF networks. Another important advantage is that, for a given fuzzy partition of the input space, the proposed method is able to determine the proper network structure, without using a trial and error procedure. 1. Introduction The dynamic representation of a process by a set of mathematical equations is essential to the development of a control system. However, in many cases it is difficult to obtain accurate models for chemical engineering systems due to the inherent nonlinearity, complexity, and uncertainty of chemical processes. A substantial amount of interest has been focused on utilization of artificial neural networks (ANNs), which are multilayered perceptrons, also used for process fault diagnosis and control.1-5 Training of a neural network consists of two stages: In the first stage, the structure of the network, that is, the number of hidden layers and nodes, is selected. In the second stage, the network parameters associated with the neurons and/or the interconnection links are determined using an optimization algorithm, which minimizes the errors between the true outputs and the network predictions over a set of training examples. This is the training procedure, during which the network learns the relationships between the input and output variables. Most ANN learning algorithms are based on nonlinear optimization methods and require a lot of computational time. Radial basis function (RBF) neural networks form a class of ANNs, which has certain advantages over other types of ANNs, including better approximation capabilities, simple network structures, and faster learning algorithms. Not surprisingly, RBF networks are becoming increasingly popular in many scientific areas. Especially in chemical engineering, a number of RBF network applications have been reported in solving system identification6,7 and process control8-10 problems. RBF neural network training algorithms are split into two basic categories: the ones where the number of hidden nodes is predetermined and the learning algorithms which involve structure selection mechanisms. The algorithms of the first category are time-consuming, since they obviously require a trial and error procedure * To whom all correspondence should be addressed. Telephone: +30-1-7723236. Fax: +30-1-7723155. E-mail:
[email protected].
to find the proper number of hidden nodes. A popular training technique that belongs to this category selects the centers of the nodes using the k-means clustering methodology.11-13 In a second step the algorithm uses linear regression to determine the rest of the network parameters.14,15 The second category of algorithms determines both the network structure and the network parameters. A number of techniques, which belong to this category, have been proposed by several researchers, including an orthogonal least squares algorithm,16 individual training of each hidden unit based on functional analysis,17 or initial selection of a large number of hidden units which is reduced as the algorithm proceeds.18 Most of these algorithms require a lot of computational time and may be trapped in local minima. In a recent publication,19 genetic algorithms are employed to select the structure of the network and the parameters simultaneously. However, using this approach, the centers of the hidden nodes are selected from the set of training examples. In this article, we propose a new algorithm for training RBF networks, which determines the proper number of hidden nodes and calculates the model parameters, given a fuzzy partition of the input space into a number of fuzzy subsets. After the centers of the hidden nodes have been selected, the widths of the nodes are determined by the P-nearest neighbor heuristic, and the weights between the hidden layer and the output layer are calculated by linear regression. The methodology is illustrated through the application to two chemical engineering systems: a simulated continuous stirred tank reactor (CSTR) and a real continuous pulp digester, which is a complicated reactor used in the pulp and paper industry. A number of advantages of the proposed learning strategy are identified, and the results are compared with those produced by the standard learning methodology, which is based on the k-means clustering. The rest of the paper is structured as follows: The RBF network basic characteristics and topology are first introduced, followed by a short description of the standard training methodologies. A discussion on the fuzzy partition of the input space is presented in section
10.1021/ie010263h CCC: $22.00 © 2002 American Chemical Society Published on Web 01/22/2002
752
Ind. Eng. Chem. Res., Vol. 41, No. 4, 2002 L
yˆ m(k) )
∑ l)1
x∑ N
wm,lfl(
(xn - xˆ l,n)2)
(3)
n)1
where f is the radial basis function, xˆ l is the center of the lth unit, and wmT ) [wm,1, wm,2, ..., wm,L] is the vector of weights, which multiply the hidden node responses in order to calculate the mth output of the network. A typical choice for the radial basis function, which was also used in this work, is the Gaussian function:
( )
f(ν) ) exp -
ν2 σ2
(4)
where σ is the width of the node. 2.2. Standard RBF Network Learning Methodology. The standard formulation of the training algorithm involves a set of input-output pairs (x(k), y(k)), k ) 1, ..., K, where x(k) is the input vector, y(k) is the corresponding target or desired output vector, and K is the number of training examples. Assuming that the structure of the network is selected by trial and error, for a given number of hidden nodes the training algorithm should minimize the objective function
E(xˆ 1,σ1,xˆ 2,σ2,...,xˆ L,σL,w1,w2,...,wM) )
Figure 1. Standard topology of an RBF neural network.
1 3, and the proposed training procedure is described in section 4. Section 5 discusses the application of the new technique in the development of dynamic neural network models for the CSTR and the pulp digester. The paper ends with the concluding remarks, where the advantages of the proposed methodology are highlighted. 2. Radial Basis Functions: Overview 2.1. RBF Models for Dynamical Systems. An RBF neural network can be considered as a special threelayer network (Figure 1), which is linear with respect to the output parameters after fixing all the radial basis function centers and nonlinearities f(‚) in the hidden layer. In a dynamic RBF neural network model, at time point k, past values of the process input and output variables constitute the input layer to the network. Therefore, a dynamic RBF network is a special type of nonlinear autoregressive with exogenous inputs (NARX) model. Given a process with R process inputs, M process outputs, and L hidden nodes, at time point k the input vector to the network can be written as follows:
xT(k) ) [x1(k), x2(k), ..., xN(k)] ) [u1(k - 1), ..., u1(k - p1), ..., uR(k - 1), ..., uR(k - pR), ..., y1(k - 1), y1(k - q1), ..., yM(k - 1), yM(k - qM)] (1) where pR is the number of past values of process input uR and qM is the number of past values of process output yM. The output of the network contains the estimated values of the current process outputs and is given by
yˆ T(k) ) [yˆ 1(k), yˆ 2(k), ..., yˆ M(k)] with
(2)
K
∑ |y(k) - yˆ (k)|2
2 k)1
(5)
with respect to the centers and widths of the hidden nodes and the weights between the hidden layer and the output layer. The standard algorithm decomposes the problem of training, and determines the parameters of the RBF network unit in two steps: In the first step the centers of the nodes are obtained using the k-means clustering algorithm11-13 and in the second the rest of the network parameters are calculated. An adaptive formulation of the k-means clustering algorithm can be described as follows: the center of each cluster is initialized to a different, randomly chosen, data point. Each new exemplar (training example) x(k) is assigned to the closest cluster vector xˆ closest, which is modified according to
∆xˆ closest ) η(x(k) - xˆ closest)
(6)
where η is the learning rate. None of the other centers is modified by this exemplar. A simple way to determine the closest cluster center to the current exemplar is to find the center which, if subtracted from the current training example, gives the lowest squared Euclidean norm. Several passes of the training examples are needed until the algorithm converges.14 After the receptive field centers have been determined, the “widths” of each hidden node are obtained using the P-nearest neighbor heuristic:14
σl )
(
1
P
)
|xˆ l - xˆ j|2 ∑ P j)1
1/2
(7)
where xˆ j are the P-nearest centers to xˆ l. Finally, the output weights are calculated by linear least squares regression, since the output nodes are simple summation units. 3. Fuzzy Partition of the Input Space The standard algorithm described in the previous section is much faster than most of the training algo-
Ind. Eng. Chem. Res., Vol. 41, No. 4, 2002 753
rithms used for other types of ANNs, but it has two important drawbacks: (i) In the selection of the hidden node centers, several passes of all the training examples are required. This iterated procedure increases the computational effort, especially when a large database of training data is available. The problem propagates due to the fact that the algorithm must run several times, since many different network structures should be examined. (ii) The standard methodology depends on an initial random selection of centers, so that different sets of centers are obtained for different runs of the same network structure. It is clear that the development of a faster and more efficient algorithm, which can determine both the network structure and parameters, would be very important for building dynamical models using the RBF network architecture. The proposed training methodology overcomes all the major drawbacks mentioned in the previous paragraph. On the basis of a fuzzy partition of the input space, the structure and the centers of the nodes are selected in a single step using only one pass of the training examples. Furthermore, for a given fuzzy partition of the input space, the algorithm always gives the same set of centers, since no random center selection is involved in the algorithm. Before we describe the proposed methodology, a definition of the fuzzy partition is necessary and is presented next: Fuzzy logic theory is used to describe systems with great uncertainty.20,21 The basis of all fuzzy systems is the partition of the input variables into a number of fuzzy sets. In the proposed methodology we evenly partition the universe of discourse (domain) of each input variable xi (i ) 1, 2, ..., N) into ci triangular fuzzy sets Ai1, Ai2, ..., Aici with membership functions of the form
µA(x) )
{
10,
|x - a| , if x ∈ [a - δa, a + δR] δa otherwise
(8)
(9)
If we define the fuzzy partition of the domain of each input variable as
Ti ) {Ai1, Ai2, ..., Aici}, 1 e i e N
(10)
a fuzzy partition of the entire input space X can be defined20 by dismembering it into C fuzzy subspaces A1, A2, ..., AC, with N
C)
ci ∏ i)1
Figure 2 shows a fuzzy partition of the two-dimensional input space, where both universes of discourse are evenly dismembered into 5 triangular fuzzy sets. This partition defines 25 fuzzy subsets in the input space. In the same figure, a particular fuzzy subspace A ) [A13, A23] in the two-dimensional input space is specified. Employing expression 9 and defining the vectors
rl ) [a1l, a2l, ..., aNl]T, δr ) [δa1, δa2, ..., δaN]T (13) the subspace Al in eq 12 can be written as
Al ) {rl, δr}
where R is the center element at which the membership value of unity is assigned, and δR is half of the respective width, which is selected so that the two vertexes of the triangle lie at the centers of the two adjacent fuzzy sets. It follows that a fuzzy set A can be fully described by the corresponding center and width elements
A ) {a, δa}
Figure 2. Fuzzy partition and definition of a fuzzy subspace in the two-dimensional input space.
(11)
Expression 14 shows that Al can be viewed as a hyperrectangle within the input space, where vector rl is the central point and vector δr is its side. The side vector δr is the same for all the fuzzy subspaces, since we have assumed even partitions of the universes of discourse. 4. Fuzzy Logic for RBF Network Configuration Assuming that the input space has been partitioned as described in section 3, the determination of the appropriate node centers is reduced to the problem of generating a set of fuzzy subspaces, which uniformly cover the input data distribution. To solve this problem, we first introduce the notion of the multidimensional membership function µAl(x(k)) of an input vector x(k) into Al, which is defined as22
µAl(x(k)) )
{
1 - rdl(x(k)), if rdl(x(k)) e1 0, otherwise
(15)
where rdl(x(k)) is the Euclidean relative distance between Al and the input data vector: N
[
Al
The lth fuzzy subspace (1 e l eC) is obtained as a combination of N particular fuzzy sets A1l ∈ T1, A2l ∈ T2, ..., ANl ∈ TN and can be represented as
Al ) [A1l, A2l, ..., ANl]T
(14)
(12)
rdl(x(k)) )
(ail - xi(k))2]1/2 ∑ i)1 N
(δRi)2]1/2 ∑ i)1
[
(16)
754
Ind. Eng. Chem. Res., Vol. 41, No. 4, 2002
Algorithm 2: Proposed Algorithm for the Selection of the RBF Network Structure and the Determination of the Hidden Node Centers. Take the first input data vector x(1). Set L ) 1. Apply algorithm 1 to the input vector x(1): { FOR i)1 TO N
Ai1 ) {ai1, δai} r max [µAij(xi(1))] 1ejeci
(17)
END. Generate the first fuzzy subspace
A1 ) {r1, δr} ) {[a11, a21, ..., aN1]T, [δa1, δa2, ..., δaN]T} (18) Figure 3. Selection of the appropriate fuzzy sets for a twodimensional input vector.
Obviously, the fuzzy subspace that describes best an input data vector is the one with the smallest Euclidean relative distance, since this subspace assigns to the vector the greatest membership degree. To obtain this subspace for a given input vector, we can use the following simple procedure,23 assuming that all the input universes of discourse have been dismembered into a number of fuzzy sets. Algorithm 1: Determination of the Closest Fuzzy Subspace to a Given Input Vector. Suppose we are given an input data vector x(k), with x(k) ∈ RN, and all the input universes of discourse have been dismembered into fuzzy sets. For example, in Figure 3, where a system with two inputs is depicted, the fuzzy sets for the variables x1, x2 are
} FOR k)2 TO K
rdl0(x(k)) ) min [rdl(x(k))] 1eleL IF
rdl0(x(k)) > 1
THEN
(19) (20)
L)L+1 Apply algorithm 1 to the input vector x(k): { FOR i)1 TO N
AiL ) {aiL, δai} r max [µAij(xi(k))] 1ejeci
(21)
END
T1 ) {A11, A12, A13, A14, A15}
Generate the Lth fuzzy subspace
T2 ) {A21, A22, A23, A24, A25}
AL ) {rL, δr} ) {[a1L, a2L, ..., aNL]T,
(Step 1) Determine the membership values of x1(k), x2(k), ..., xN(k) in the respective fuzzy sets. For example, in Figure 3a, x1(k) has membership degree 0.7 in A12, 0.3 in A13, and zero degrees in all other fuzzy sets. Similarly, in Figure 3b, x2(k) has membership degree 0.75 in A21, 0.25 in A22, and zero degrees in all other fuzzy sets. (Step 2) Assign to x1(k), ..., xm(k) the fuzzy sets with the maximum membership degree. For example, in Figure 3a, A12 is assigned to x1(k), and in Figure 3b, A21 is assigned to x2(k). (Step 3) For the given input vector x(k), the closest fuzzy subspace Aj0 is given as the combination of the fuzzy sets selected in step 2. For example, in Figure 3, Aj0 ) [A12, A21] ) {[a12, a21], [δa1, δa2]}. Following this analysis, a new training algorithm is proposed to determine the structure and the centers of the hidden layer of an RBF neural network. The algorithm assumes the availability of K training examples, which correspond to K input data vectors xT(k) ) [x1(k), x2(k), ..., xN(k)] (k ) 1, 2, ..., K) and a fuzzy partition of the universe of discourse of each input variable xi (i ) 1, 2, ..., N) into ci symmetric triangular fuzzy sets.
[δa1, δa2, ..., δaN]T} (22) } ENDIF END After the end of the algorithm the hidden layer is formulated by assigning each generated subspace to a hidden node. The centers of the subspaces become the centers of the radial basis function hidden units. The algorithm starts with the first data point x(1) and generates the first fuzzy subspace, by applying algorithm 1: For i ) 1, 2, ..., N, it selects the fuzzy set that assigns the maximum membership degree to xi(1). This selection procedure is represented by eq 17. The combination of the N selected fuzzy sets formulates the fuzzy subspace which describes best the input vector x(1) (eq 18). For all the remaining input examples x(k), k ) 2, 3, ..., K, the algorithm determines the fuzzy subspace which is the closest, among the L generated subspaces, to the input vector x(k). More specifically, the algorithm computes the Euclidean relative distances rdl(x(k)) (l ) 1, 2, ..., L) between x(k) and the L fuzzy subspaces which have been generated so far. If the minimum of
Ind. Eng. Chem. Res., Vol. 41, No. 4, 2002 755
these distances (eq 19) is greater than unity (eq 20), then x(k) is not sufficiently described by any of the fuzzy subspaces, and the algorithm generates a new subspace, by applying algorithm 1. If inequality 20 is not satisfied, a fuzzy subspace, which has already been generated, can be assigned to the particular input example. Therefore, there is no need to generate a new subspace and the algorithm proceeds to the next training example. The methodology offers an alternative to the standard methodology, regarding the selection of the number of hidden nodes and the unit centers, which is the most crucial and time-consuming step in the development of an efficient RBF model. For an easier visualization, we can consider that the method defines a multidimensional grid on the input space and selects some of the knots as cluster centers. However, only the knots that are close to the training examples are examined. This reduces significantly the required number of calculations, even for problems with many input variables and dense fuzzy partitions of the input space. As far as the hidden node centers have been determined, the algorithm uses the standard techniques to calculate the rest of the network characteristics, that is, the widths of the nodes and the coefficients, which connect the hidden to the output layer. The improvements are due to the new clustering technique for estimating the number and initial locations of the cluster centers, which is based on a fuzzy partition of the input space and can be itemized as follows: (1) The methodology increases substantially the speed of the training algorithm, since it needs only one pass of the training data, while the standard technique requires several iterations to converge. This happens because the proposed clustering methodology does not solve an optimization problem, whereas in the k-means clustering algorithm a nonlinear cost function is minimized. For a more quantitative comparison, we can compute the number of required distance calculations, which is the most time-consuming part for both clustering algorithms. Keeping the notation used so far (denoting the number of training examples by K and the number of hidden nodes by L) and using the symbol I for the iterations in the k-means algorithm, it can easily be derived that the proposed algorithm requires only
L(L - 1) L +L + (N - L)L ) NL 2 2 2
distance calculations in the worst case, while, in the k-means algorithm, INL distance calculations are needed. (2) It does not involve any random selection of clusters and produces the same results each time we run it for the same problem, in contrast to the k-means approach, which is based on an initial random selection of the cluster centers. (3) Given a fuzzy partition of the input space, the method obtains both the dimension of the hidden layer (i.e. the number of nodes) and the hidden node centers. Contrary to that, in the k-means algorithm, the number of clusters must be predefined. The proposed methodology also has certain advantages, compared to a number of different clustering techniques: The c-means24 clustering method is slow, since, similarly to the k-means algorithm, it minimizes a nonlinear cost function and requires many iterations. The mountain method25 is based on defining a grid on
Table 1. Process Parameter Values in the CSTR Example process parameter
value
V UA P Cp -(∆H)R k0 E R
100 L 20 000 J/s‚K 1000 g/L 4.2 J/g‚K 596 619 J/mol 6.85 × 1011 L/s‚mol 76 534.704 J/mol 8.314 J/mol‚K
the data space, but examines all the knots one by one, so that computation grows exponentially with the dimension of the problem. Some other clustering techniques, such as the subtractive clustering method26 and the nearest neighborhood scheme,27 are fast, but only the training input data can be selected as cluster centers. This selection may cause problems in the training procedure and the prediction capabilities of the produced RBF network, especially when different radial basis functions, such as the thin plate splines, are utilized. The results, which are shown in the next section, verify the fast computational times of the proposed algorithm and show that with the new technique we can also improve the accuracy of the produced RBF models. 5. Results Using RBF Network Models 5.1. Prediction of Concentration and Temperature in a Continuous Stirred Tank Reactor. The proposed methodology was used to model a nonisothermal continuous stirred tank reactor (CSTR), where the following exothermal irreversible reaction between sodium thiosulfate and hydrogen peroxide is taking place:
2Na2S2O3 + 4H2O2 f Na2S3O6 + Na2SO4 + 4H2O (23) This process is characterized by the following dynamic equations:28
dCA F E C 2 ) (CA,in - CA) - 2k0 exp dt V RT A
(
)
(-∆H)R dT F k0 × ) (Tin - T) + 2 dt V Fcp E UA exp C 2(T - Tj) (24) RT A VFcp
(
)
where V is the volume of the CSTR; (∆H)R is the heat of the reaction; and -E/R, k0, cp, and F are constants of the reaction and the reactants. The variables F, CA,in, Tin, and Tj are considered as inputs to the system, while CA and T are the outputs. F is the flow rate into the reactor, CA,in is the inlet concentration of the reactant Na2S2O3, CA is the concentration of Na2S2O3 inside the reactor, Tin is the inlet temperature, T is the temperature inside the reactor, and Tj is the temperature of the coolant. The values of the process parameters are shown in Table 1. Using a sampling period of 1 s, we created a set of 2000 data examples by adding random number signals to the steady state values of the input variables, which are shown in Table 2. The variables were scaled between -1 and 1, so that all the input and output values used to train the network were of the same order of magni-
756
Ind. Eng. Chem. Res., Vol. 41, No. 4, 2002
Table 2. Steady State Values of the Input Variables in the CSTR Example input variable
steady state value
F Ti Tj CA,in
20 L/s 275 K 250 K 1 mol/L
Table 3. Performance of a Linear Model in the CSTR Example SSE training
SSE validation
CA
T
CA
T
0.2854
5198
0.3665
6759
tude. The data were then partitioned into two subsets. The first 1000 points were used for training the network, and the rest of the data were used for validation. The input vector to the RBF network consisted of 20 past values of each process input, a total of 80 variables. No values of the process outputs (concentration of Na2S2O3 and temperature) were used as additional inputs to the network. The data were first used to develop a simple linear model, where the coefficients were determined by linear regression. The respective sums of squared errors (SSEs) for the training and validation data are shown in Table 3. Then, we developed a number of RBF neural network models, using different initial fuzzy partitions of the input variables, as shown in Table 4. Each fuzzy partition corresponds to the formulation of a number of subspaces in the input space. For example, the fuzzy partition of each input variable into 4 fuzzy sets produces 480 subspaces in the input space. The proposed technique was applied in each case to select the appropriate fuzzy subspaces, and define their centers as centers of the hidden layer nodes. The second column in Table 4 shows that the selected subspaces constitute only a small percentage of the total number of fuzzy subspaces. However, the dimension of the hidden layer increases, when we use a more dense fuzzy partition of the input space. The rest of the neural network parameters were determined using the standard techniques. With the exception of the first two simulations, which resulted in very small network structures for the size of the identification problem, the models proved to be very successful in predicting the dynamic behavior of the concentration of Na2S2O3 and the temperature inside the reactor. The training algorithm not only produced very accurate models but also completed all the calculations in small computational times, as shown in Table 4, where the SSEs for the training and validation data are depicted, as well as the computational times using a PC with an 800 MHz Pentium III processor. The observations were in line with the neural network theory, which states that utilization of too many hidden nodes overfits the system and deteriorates its predictive capabilities. The predictions of the best network configuration, which was the result of the fuzzy partition
Figure 4. Actual values and RBF neural network predictions in the CSTR example, using the proposed training algorithm: (a) sodium thiosulfate concentration; (b) temperature. Table 5. Effect of P on the Performance of the RBF Neural Network SSE training
SSE validation
P
CA
T
CA
T
20 40 60 90 120 150
0.0050 0.0048 0.0046 0.0045 0.0044 0.0045
261.9 255.4 240.0 224.4 212.8 220.2
0.0068 0.0065 0.0063 0.0062 0.0061 0.0062
555.3 598.2 464.2 429.7 404.2 415.3
of each input variable into four sets, along with the actual data for the concentration of Na2S2O3 and the temperature inside the reactor are shown in Figure 4 for the first 500 validation data. In the above simulations, P in eq 7 was chosen so that the calculated unit widths provide a balance between matching and generalization. The width of a unit must be greater than the distance to the closest neighbors, so that whenever an input vector fires the unit, a number of close hidden nodes are also activated to a significant degree. However, the unit width should not be very high, to restrict the influence of the unit and prevent it from having a high activation far from the training data it represents. Following the above rules, the values of P were selected to be equal to about 20% of the number of hidden nodes but were not optimized. To examine the importance of this parameter, we chose the best network structure and trained the system using different values for P. The calculated SSEs, which are given in Table 5, verify the previous discussion, since the accuracy of the produced model is optimized for a certain value of P, which is between 1 and the number of hidden nodes.
Table 4. Different Runs of the New Training Algorithm in the CSTR Example fuzzy partition
no. of hidden nodes
P
3 sets for each variable 4 sets for temperaturess3 sets for other variables 4 sets for each variable 5 sets for temperaturess4 sets for other variables 5 sets for each variable
5 28 285 627 912
1 10 60 130 180
SSE training CA T 0.1407 0.0517 0.0046 0.0009 0.0005
19437 5318 240.0 71.4 53.8
SSE validation CA T 0.1296 0.0485 0.0063 0.0062 0.0082
20132 5327 464.2 609.1 685.2
training CPU time (s) 3 8 40 144 212
Ind. Eng. Chem. Res., Vol. 41, No. 4, 2002 757 Table 8. Performance of a Linear Model in the Digester Example SSE training
SSE validation
1104
1482
Table 9. Different Runs of the New Training Algorithm in the Digester Example fuzzy partition no. of (no. of sets for hidden each variable) nodes 3 4 5 6 7
3 16 26 35 48
P 1 3 5 7 10
SSE SSE training training validation CPU time (s) 1097 926 798 703 655
1042 843 725 754 892
4 7 9 12 14
Table 10. Different Runs of the k-Means Training Algorithm in the Digester Example
Figure 5. Actual values and RBF neural network predictions in the CSTR example, using the k-means training algorithm: (a) sodium thiosulfate concentration; (b) temperature. Table 6. Different Runs of the k-Means Training Algorithm in the CSTR Example SSE training
no. of hidden nodes
P
5 28 285 627 912
1 10 60 130 180
CA
T
0.1797 19189 0.0618 8901 0.0051 419.1 0.0008 75.9 0.0004 57.8
SSE validation CA
T
training CPU time (s)
0.1857 20689 0.0613 8174 0.0077 1063 0.0087 1107 0.0092 1211
30 127 1205 3179 5392
Table 7. Effect of η on the Performance of the RBF Neural Network SSE training
SSE validation
η
CA
T
CA
T
0.01 0.1 0.3 0.5 0.7 0.9 MacQueen k-means square root k-means
0.0057 0.0042 0.0050 0.0051 0.0047 0.0060 0.0045 0.0041
649.9 474.9 556.2 419.1 407 415.8 449.2 420.2
0.1122 0.0102 0.0087 0.0077 0.0095 0.0102 0.093 0.0082
1729 1585 1329 1063 1285.0 1316.7 1145 1102
The proposed methodology was compared to the standard learning technique by retraining the same network structures, using the k-means clustering algorithm. For a fair comparison, we used the initial guesses for P and not the optimized values, which produce lower SSEs for the proposed algorithm. However, in all the examples we applied many different techniques for the selection of the learning rate η, such as the MacQueen k-means,12 the square root k-means,12 and a number of constant values. The results shown in Table 6 are the ones which correspond to the lowest SSEs for the validation data as far as the learning rate schedule is concerned. For the third network structure, the model predictions along with the real values for the two output parameters and the first 500 validation data points can be found in Figure 5. For the same network structure the results we obtained using the different techniques for the selection of the learning rate are presented in Table 7. It is clear that, in all the cases we examined, the prediction capabilities of the RBF neural network
no. of hidden nodes
P
3 16 26 35 48
1 3 5 7 10
SSE training SSE validation 1245 1081 937 681 621
1193 1025 857 883 948
training CPU time (s) 28 51 68 95 131
models trained with the proposed methodology outperformed the ones of the networks trained by the standard methodology. Moreover, the training procedure based on the k-means clustering methodology needed much more time to determine the network parameters. 5.2. Prediction of Kappa Number in a Continuous Pulp Digester. The second example, which was used to illustrate the new training algorithm, is taken from the pulp and paper industry. More specifically, a number of RBF networks were trained to simulate the dynamic behavior of a continuous pulp digester, where the kappa number of the produced pulp is the only output, while past values of several process inputs and one past value of the kappa number form the input layer of the network. The above selection of input variables corresponds to 32 input nodes. Because of confidentiality reasons, the list of input variables to the network cannot be cited. The kappa number is a very common index in the pulp and paper industry and represents the residual amount of lignin in the produced pulp. We used 584 data samples for training and 452 samples for validation. The data samples were collected during regular operation of the process with a time step of 1 h and were scaled so that the values of all the input and output variables were between -1 and 1. After the development of a simple linear regression model, the results of which are depicted in Table 8, the proposed methodology was implemented to a number of fuzzy partitions of the input space. The results we obtained for all the fuzzy partitions are listed in Table 9. The lowest SSE was achieved by partitioning each input variable into 5 fuzzy sets, thus forming 532 fuzzy subspaces. Application of the proposed procedure to this particular fuzzy partition selected only 26 of these subspaces and generated an equal number of hidden nodes. In all the simulations, the values of P were chosen following the rules presented in the previous subsection (P was set equal to about 20% of the number of hidden nodes). The SSEs calculated for the validation data show that the RBF neural network models are clearly more accurate, compared to the performance of the linear model.
758
Ind. Eng. Chem. Res., Vol. 41, No. 4, 2002
Figure 6. Actual values for kappa number and RBF neural network predictions in the digester example, using the proposed training algorithm.
(iv) It produces hidden node centers, which are different from the training examples. Simulations and experimental data were used to test the ability of the proposed algorithm to learn the dynamics of nonlinear systems. The results show that the methodology can be very successful when it is applied to system identification problems. The methodology was compared to the standard learning technique, which is based on the k-means clustering. In both examples, the proposed technique surpassed the standard learning algorithm in speed and accuracy of predictions. However, the most important advantage of the proposed algorithm is the ability to determine both the network structure and the network parameters using a very limited computational time. Literature Cited
Figure 7. Actual values for kappa number and RBF neural network predictions in the digester example, using the k-means training algorithm.
The proposed methodology was also compared to the standard k-means algorithm, which was used to retrain the same network structures. Table 10 summarizes the important observations we obtained using the standard technique. The comparison favors again the proposed methodology, since it is faster and produces more accurate models. The kappa number predictions along with the experimental data for the best (third) network configuration are shown in Figures 6 and 7, respectively, for the two training algorithms and the set of validation data. 6. Conclusions In this work a new algorithm was proposed for training RBF neural networks. The algorithm first determines the centers of the nonlinear internal units in a fast self-organizing manner based on a fuzzy partition of the input space. In a second stage the algorithm calculates the widths of the Gaussian functions. Finally, the connection weights between the hidden and the output layers are computed, by solving a simple quadratic minimization problem, which minimizes the errors between the desired and predicted outputs. The proposed methodology offers four important advantages compared to other learning techniques: (i) It requires only one pass of the training examples, so it is very fast. (ii) It does not depend on an initial random selection of the hidden nodes, so the same results are obtained, whenever we train the same network structure. (iii) It determines both the network structure and the network parameters and minimizes the number of trials. In fact, given a fuzzy partition of the input space, the widths of the hidden nodes are the only parameters, which have to be determined by trial and error. If a different radial basis function, such as the thin plate spline, is selected, then no trial and error is needed for a given fuzzy partition.
(1) Bhat, N.; McAvoy, T. J. Use of Neural Nets for Dynamic Modeling and Control of Chemical Process Systems. Comput Chem. Eng. 1990, 14, 573. (2) Hoskins, J. C.; Himmelblau, D. M. Process Control Via Artificial Neural Networks and Reinforced Learning. Comput. Chem. Eng. 1992, 16, 241. (3) Narendra, K. S.; Parthasarathy, K. Identification and Control of Dynamical Systems Using Neural Networks. IEEE Trans. Neural Networks 1990, 1, 4. (4) Ungar, L. H.; Powell, B. A.; Kamens, S. N. Adaptive Network for Fault Diagnosis and Process Control. Comput. Chem. Eng. 1990, 14, 561. (5) You, Y.; Nikolaou, M. Dynamic Process Modeling with Recurrent Neural Networks. AIChE J. 1993, 30, 1654. (6) Luo, W.; Karim, M. N.; Morris, A. J.; Martin, E. B. Control Relevant Identification of a pH Wastewater Neutralisation Process using Adaptive Radial Basis Function Networks. Comput. Chem. Eng. 1996, S1017. (7) Bomberger, J. D.; Seborg, D. E. Determination of Model Order for NARX Models directly from Input-Output Data. J. Process Control 1998, 8, 459. (8) Pottmann, M.; Seborg, D. E. A Nonlinear Predictive Control Strategy based on Radial Basis Function Models. Comput. Chem. Eng. 1997, 21, 965. (9) Knapp, T. D.; Budman, H. M.; Broderick, G. Adaptive Control of a CSTR with a Neural Network Model. J. Process Control 2001, 11, 53. (10) Bhartiya, S.; Whiteley, J. R. Factorized Approach to Nonlinear MPC using a Radial Basis Function Model. AIChE J. 2001, 47, 358. (11) MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In Fifth Berkeley Symposium on Math. Statistics and Probability; Proceedings; U. C. Berkeley Press: 1967; p 281. (12) Darken, C.; Moody, J. Fast Adaptive K-Means Clustering: Some Empirical Results. 1990. IEEE INNS International J. Conference On Neural Networks; Proceedings; San Diego, CA, 1990; Vol. II, p 233. (13) Moody, J.; Darken, C. Fast Learning in Networks of Locally-Tuned Processing Units. Neural Computation 1989, 1, 281. (14) Leonard, J. A.; Kramer, M. A. Radial Basis Function Networks for Classifying Process Faults. IEEE Control Systems 1991, 31. (15) Powell, M. J. D. Radial Basis Functions for Multivariable Interpolation: A Review. In Algorithms for Approximation; Mason, J. C., Cox, M. G., Eds.; Oxford: 1987; p 143. (16) Chen, S.; Billings, S. A.; Cowan, C. F. N.; Grant, P. W. Practical Identification of NARMAX Models using Radial Basis Functions. Int. J. Control 1990, 52, 1327. (17) Houlcomb, T.; Morari, M. Local Training for Radial Basis Function Networks: Towards Solving the Hidden Unit Problem. American Control Conference; Proceedings; Boston, MA, 1991; p 2331. (18) Musavi, M. T.; Ahmed, W.; Chan, K. H.; Faris, K. B.; Hummels, D. M. On the training of Radial Basis Function Classifiers. Neural Networks 1992, 5, 595.
Ind. Eng. Chem. Res., Vol. 41, No. 4, 2002 759 (19) Billings, S. A.; Zheng, G. L. Radial Basis Function Network Configuration Using Genetic Algorithms. Neural Networks 1995, 8, 877. (20) Sugeno, M.; Yasukawa, T. A Fuzzy-Logic-Based approach to Qualitative Modeling. IEEE Trans. Fuzzy Systems 1993, 1, 71. (21) Raptis, C. G.; Siettos, C. I.; Kiranoudis, C. T.; Bafas, G. V. Classification of Aged Wine Distillates using Fuzzy and Neural Network Systems. J. Food Eng. 2000, 46, 267. (22) Nie, J. Fuzzy control of Multivariable Nonlinear Servomechanisms with Explicit Decoupling Scheme. IEEE Trans. Fuzzy Systems 1997, 5, 304. (23) Wang, L. X.; Mendel, J. M. Generating Fuzzy Rules from Numerical Data with Applications. IEEE Trans. Systems Man Cybernet. 1992, 22, 1414. (24) Bezdek, J. Cluster Validity with Fuzzy Sets. J. Cybernetics 1974, 3, 58.
(25) Yager, R. R.; Filev, D. P. Approximate Clustering via the Mountain Method. IEEE Trans. System Man. Cybernet. 1994, 24, 1279. (26) Chiu, S. L. Fuzzy Model Identification Based on Clustering Estimation. J. Intelligent Fuzzy Systems 1994, 2, 267. (27) Wang, L. X. Adaptive Fuzzy Systems and Control; Prentice Hall: Upper Saddle River, NJ, 1994. (28) Kazantzis, N.; Kravaris, C. Synthesis of State Feedback Regulators for Nonlinear Processes. Chem. Eng. Sci. 2000, 55, 3437.
Received for review March 23, 2001 Revised manuscript received June 25, 2001 Accepted October 18, 2001 IE010263H