Improved Training Rules for Multilayered Feedforward Neural Networks

We propose an improved supervisory training rule for multilayered feedforward neural networks (FNNs). The proposed method analytically estimates the o...
0 downloads 0 Views 74KB Size
Ind. Eng. Chem. Res. 2003, 42, 1275-1278

1275

RESEARCH NOTES Improved Training Rules for Multilayered Feedforward Neural Networks Su Whan Sung,* Tai-yong Lee, and Sunwon Park Department of Chemical and Biomolecular Engineering and Center for Ultramicrochemical Process Systems, Korea Advanced Institute of Science and Technology, 373-1 Guseong-dong, Yuseong-Gu, Daejeon 305-701, Korea

We propose an improved supervisory training rule for multilayered feedforward neural networks (FNNs). The proposed method analytically estimates the optimal solutions for the output weights of FNNs. Also, using the optimal solutions, it reduces the searching space as much as the output weights in the iterative high-dimensional nonlinear optimization problem for the supervisory training. As a result, we can secure a much faster convergence rate and better robustness compared to the previous full-dimensional training rules. 1. Introduction Artificial neural networks (ANNs) have been widely used in many research fields.1-10 One of the most popular ANNs for the modeling of dynamic systems is the multilayered feedforward neural network (FNN). Even though FNN is popular and capable of describing any nonlinear dynamics by increasing the hidden nodes,7 its application is limited by the difficulties in the training step: One difficulty is in determining the parsimonious mode structure of the neural network, and the other is in shortening the computation time for estimating the optimal model parameters. Several authors have discussed the former and developed good methodologies to identify and eliminate redundant model parameters.8,9 The focus of this research is on the latter. The computation load to train the multilayered FNN is seriously heavy because a high-dimensional nonlinear optimization problem should be solved. To overcome the shortcoming of FNN, many researchers have exerted their efforts in developing efficient training rules such as the steepest decent, Levenberg-Marquardt, quasi-Newton, and conjugate direction methods.9,10 In this paper, we propose an improved training rule that reduces the searching dimension in the highdimensional nonlinear optimization problem as much as the output weight parameters. It shows much better robustness and a significantly faster convergence rate than the previous full-dimensional approaches because of the reduction of the adjustable parameters subject to the nonlinear iteration. This paper is organized as follows: In section 2, we obtain the optimal solution analytically for the output weight matrix in FNN. Section 3 derives an improved training rule for FNN on the basis of the derived optimal solution in section 2. We confirm the merits of the * To whom all correspondence should be addressed. Tel: +82-42-866-5785. Fax: +82-42-861-3647. E-mail: [email protected].

proposed methods compared with previous full-dimensional approaches through simulation studies in section 4. Section 5 concludes the paper. 2. Supervisory Training Rule for Multilayered FNNs In this section, we find a useful feature in multilayered FNNs to construct the basis for developing an improved supervisory training rule. The three-layered FNN is shown in Figure 1. The model output is mathematically formulated as follows:

Z ˆ aug ) V × Xaug hˆ j(t) )

(1)

1 , j ) 1, 2, 3, ..., nj 1 + exp(-zˆ j(t)) Y ˆ aug ) W × H ˆ aug

(2) (3)

where the matrix V ∈ Rnj×ni and W ∈ Rnk×nj are the input weight matrix and the output weight matrix, respectively. hˆ j(t) and yˆ k(t) denote the output of the jth hidden node and the kth output node, respectively. In this research, we use the sigmoidal function of (2) as the nonlinear activation function. However, other various activation functions such as the tangent hyperbolicus can also be used. Augmented matrices Xaug, Z ˆ aug, H ˆ aug, and Y ˆ aug have the following respective forms:

X(t) ) [x1(t), x2(t), ..., xni(t)]T

(4)

Xaug ) [X(1), X(2), ..., X(N)]

(5)

T

Z ˆ (t) ) [zˆ 1(t), zˆ 2(t), ..., zˆ nj(t)]

(6)

H ˆ (t) ) [hˆ 1(t), hˆ 2(t), ..., hˆ nj(t)]T

(7)

Y ˆ (t) ) [yˆ 1(t), yˆ 2(t), ..., yˆ nk(t)]T

(8)

10.1021/ie020663k CCC: $25.00 © 2003 American Chemical Society Published on Web 02/13/2003

1276

Ind. Eng. Chem. Res., Vol. 42, No. 6, 2003

|

∂V(V) ∂vj,i V)V(m-1) j ) 1, 2, ..., nj; i ) 1, 2, ..., ni (12)

vj,i(m) ) vj,i(m-1) - η

where m is the iteration number and vj,i is the element of the input weight matrix V corresponding to the jth row and the ith column. η is to adjust the convergence rate. The first derivative of the cost function in (12) with respect to the input weight matrix can be calculated as follows: From (1)-(3), (10), and (11), we derive Figure 1. Three-layered FNN.

∂E(V) Z ˆ aug

where xi(t) denotes the input of the ith input node. ∈ Rnj×N, H ˆ aug ∈ Rnj×N, and Y ˆ aug ∈ Rnk×N are defined in the same way as Xaug ∈ Rni×N in (5). The objective of the training rule is to find the input weight matrix V and the output weight matrix W minimizing the following modeling error: N

min{E(V,W) ) 0.5 V,W

[Y(t) - Y ˆ (t)] [Y(t) - Y ˆ (t)]} ∑ t)1 T

(9)

subject to (1)-(3) where Y(t) ) [y1(t), ..., ynk(t)]T and Y ˆ (t) ) [yˆ 1(t), ..., yˆ nk(t)]T are the measured process output vector and the model output vector, respectively. From (3), we can easily find the following optimal output weight matrix W minimizing (9) for a given input weight matrix V.

ˆ aug]T × (H ˆ aug × [H ˆ aug]T)-1 W ) Yaug × [H

(10)

Now, we can reduce the searching dimension as much as the output weight matrix W using (10). Remarks. The neural network in Figure 1 and the above driven equations are for only linear output nodes. However, we can directly use them for the case that the output nodes have the nonlinear activation function because we can obtain yk(t) corresponding to the process output using the inverse of the nonlinear activation function. 3. Improved Supervisory Training Rules In the previous section, we analytically derived the optimal solutions for the output weight matrix of the multilayered FNN. In this section, we derive an improved training rule on the basis of the previously driven optimal solutions. There are many available training rules for multilayered FNNs such as the steepest decent, LevenbergMarquardt, quasi-Newton, and conjugate direction methods.9,10 We can improve all of the previous methods by reducing the searching dimension with the driven optimal output weight matrix of (10). As an example, consider the steepest descent method. The objective function is as follows: N

min{E(V) ) 0.5 V

[Y(t) - Y ˆ (t)]T[Y(t) - Y ˆ (t)]} ∑ i)1

(11)

subject to (1)-(3) and (10) The steepest descent method solves the optimization problem by repeating the equation

N

[Y(t) - Y ˆ (t)]T ∑ i)1

)-

∂vj,i

∂Y ˆ (t) (13) ∂vj,i

∂Y ˆ (t) ∂W ∂H ˆ (t) ) H ˆ (t) + W ∂vj,i ∂vj,i ∂vj,i

{ ( ) [ ( ) ( ) ]}

∂H ˆ aug ∂W ) Yaug ∂vj,i ∂vj,i

T

-W H ˆ aug

∂H ˆ aug ∂vj,i

T

(14)

+

∂H ˆ aug (H ˆ aug)T [H ˆ aug(H ˆ aug)T]-1 (15) ∂vj,i

where (13) and (14) are derived from (11) and (3), respectively. (15) and (16) come from (10) and (2). In summary, the proposed method repeats (12) until the parameters converge together with calculation of the first derivative vector of the cost function using (10) and (13)-(16). Up to here, we derived the proposed supervisory tuning rule for multilayered FNNs. It analytically estimates the output weight matrix and optimally cuts off the searching space in the nonlinear iterative optimization as much as the elements of the output weight matrix. Then, obviously, its convergence rate is much faster and the robustness would be enhanced compared to the previous approaches. Also, setting initial values of the weights is significantly simpler because the proposed method does not need to initialize the output weight matrix. Remark. If the number of hidden nodes is too large (that is, overparametrization), the condition number of ˆ aug)T becomes very big. In this case, we should H ˆ aug(H reduce the number of hidden nodes or we can use the pseudo-inverse instead of the inverse in (10) and (15). 4. Guidelines for Weight Initialization We need to pay attention to the condition number (or the minimum singular value) of the matrix H ˆ aug(H ˆ aug)T for a systematic initialization of the input weight matrix V. In general, a small condition number is preferred because the small condition number means that all orthogonal directions are evenly informative while the large condition number means that some orthogonal directions (equivalently, some portion of the weights) are useless. We can use this criterion to initialize the input weight matrix systematically. For example, if we use uniformly distributed random numbers for the weight initialization, we may choose the initial weight matrix of which the condition number is the smallest

Ind. Eng. Chem. Res., Vol. 42, No. 6, 2003 1277

Figure 2. Scheme of the pH process.

Figure 4. Comparison of the process output and the model output (training data from t ) 1 to 150 and validation data from t ) 151 to 268) for (a) the proposed method and (b) the previous method.

many data sets (N) are used. If the magnitude of the random numbers for the initial input weight matrix is very small, the small numbers may activate the small region of the nonlinear activation function. Then, the behavior of the activation function is nearly linear, resulting in no fully informative data sets as discussed above. So, we recommend using a large magnitude of random numbers to activate a wide range of the nonlinear activation functions. However, the magnitude should not be too large to prevent the saturation of the hidden node. Figure 3. Error convergence of (a) the proposed training rule and (b) the previous training rule.

among candidates, generated by changing the magnitude and the seed for the random number generation. If the behavior of the activation function is linear, we can infer from (1) and (2) that the rank of H ˆ aug is min(ni,nj) regardless of the total data number N. This means that we cannot create any informative data sets more than those corresponding to the rank of min(ni,nj) irrespective of how many hidden nodes (nj) and/or how

5. Simulation Results We simulated a pH process to confirm the merits of the proposed method compared to previous full-dimensional ones. Figure 2 shows the pH process where the weak acid influent of phosphoric acid (H3PO4) is titrated by the strong base of sodium hydroxide (NaOH) in a continuous stirred tank reactor. Here, the feed flow rate and the reactor volume are 1 L/min and 5 L, respectively. The total concentrations of the phosphoric acid in the influent stream and the sodium hydroxide in the titrating stream are 0.06 and 0.2 mol/L, respectively.

1278

Ind. Eng. Chem. Res., Vol. 42, No. 6, 2003

The initial total ion concentrations of the phosphoric acid and the sodium hydroxide in the reactor are 0.06 and 0.0 mol/L, respectively. For detailed material balance equations and an equilibrium equation, refer to Sung et al.11 We activated the pH process using the titrating stream of uniformly distributed random noises between 0 and 0.6. The input nodes of the neural network consist of u(t-1), u(t-2), pH(t-1)/12, pH(t-2)/12, and one bias. The output of the neural network is pH(t)/12. The sampling time is 1 min, and the number of hidden nodes is 12. Typical modeling error patterns of the proposed method and the previous full-dimensional approach during training are shown in parts a and b of Figure 3, respectively. Here, η values of the proposed method and the previous method are chosen as 0.001 and 0.1 by trial and error. Note that the error of the previous approach after 3000 iterations is much bigger than the error of the proposed approach after just 1 iteration because the proposed method estimates the optimal output weight matrix analytically. The computation time of the proposed approach is 20% longer than that of the previous approach for the same iteration. However, the convergence rate of the proposed method with respect to the iteration number is so fast that the overall rate of the proposed method is much better than those of the previous ones. Parts a and b of Figure 4 show the model performances of the proposed training method and the previous one after 3000 iterations. The obtained model by the previous one is poor because the iteration number is not enough, while the proposed method gives an acceptable model performance. We did simulate various cases with changing initial values and the number of hidden nodes and reached the same conclusion that the proposed method shows a much faster convergence rate and better robustness compared to the previous one. 6. Conclusions We found a useful feature in supervisory training of FNNs. An improved supervisory training rule has been proposed by optimally removing the adjustable parameters corresponding to the output weight matrix in FNN.

The proposed method shows significant improvements in the convergence rate and robustness compared to previous full-dimensional approaches. Also, it needs not to initialize the output weight matrix of V. Acknowledgment This work is supported by the BK21 Project and Center for Ultramicrochemical Process Systems sponsored by KOSEF. Literature Cited (1) Bhat, N.; McAvoy, T. J. Use of neural nets for dynamics modeling and control of chemical process systems. Comput. Chem. Eng. 1990, 14, 573. (2) Chen, S.; Billings, S. A.; Grant, P. M. Nonlinear system identification using neural networks. Int. J. Control 1990, 51, 1191. (3) Bhat, N. V.; Minderman, P. A.; McAvoy, T.; Wang, N. S. Modeling chemical process systems via neural computation. IEEE Control Syst. Mag. 1990, 1, 24. (4) Fukuda, T.; Shibata, T. Theory and application of neural networks for industrial control systems. IEEE Trans. Ind. Electron. Control Instrum. 1992, 39, 472. (5) Ydstie. B. E. Forecasting and control using adaptive connectionist networks. Comput. Chem. Eng. 1990, 14, 583. (6) Stevanovic, J. S. Neural networks for process analysis and optimization: modeling and applications. Comput. Chem. Eng. 1994, 18, 1149. (7) Hornik, K.; Stinchcombe, M.; White, H. Multi-layered feedforward neural networks are universal approximations. Neural Networks 1990, 2, 359. (8) Henrique, H. M.; Lima, E. L.; Seborg, D. E. Model structure determination in neural network models. Chem. Eng. Sci. 2000, 55, 5457. (9) Boozarjomehry, R. B.; Svrcek, W. Y. Automatic design of neural network structures. Comput. Chem. Eng. 2001, 25, 1075. (10) Derks, E. P. P. A.; Buydens, L. M. C. Aspects of network training and validation on noisy data Part 1. Training aspects. Chemom. Intell. Lab. Syst. 1998, 41, 171. (11) Sung, S. W.; Yang, D. R.; Lee, I. pH Control using an Identification Reactor. Ind. Eng. Chem. Res. 1995, 34, 2418.

Received for review August 26, 2002 Revised manuscript received January 13, 2003 Accepted January 31, 2003 IE020663K