Artificial Neural Network Meta Models To Enhance the Prediction and

To increase confidence in neural network modeling of multiphase reactor ... systems in order to build neural models having phenomenological consistenc...
0 downloads 0 Views 186KB Size
Ind. Eng. Chem. Res. 2003, 42, 1707-1712

1707

PROCESS DESIGN AND CONTROL Artificial Neural Network Meta Models To Enhance the Prediction and Consistency of Multiphase Reactor Correlations Laurentiu A. Tarca, Bernard P. A. Grandjean,* and Faı1c¸ al Larachi Department of Chemical Engineering & CERPIC, Laval University, Quebec, Canada G1K 7P4

To increase confidence in neural network modeling of multiphase reactor characteristics, we have to take advantage of some a priori knowledge of the physical laws governing these systems in order to build neural models having phenomenological consistency (PC). A common form of PC is the monotonicity constraint of a characteristic to be modeled with respect to some important dimensional variables describing the multiphase system. When the inputs of a neural model are functions (usually dimensionless) of the variables with respect to which monotonicity is expected, the monotonicity might not be guaranteed, but such a drawback is only observed after the training. A genetic algorithm based methodology was proposed to produce several highly accurate and nearly PC networks differing by their inputs and architecture. PC and accuracy were shown to be boosted up meaningfully by combining such networks in a linear meta model. A new optimality criterion for the meta-model parameter identification was proposed, and the results were compared with classical mean-squared error optimality criterion. The proof of the concept of the approach was illustrated in modeling the two-phase pressure drop in countercurrently operated randomly packed beds. 1. Introduction The design and efficient exploitation of multiphase reactors require knowledge of their hydrodynamics and mass- and heat-transfer characteristics, e.g., pressure drop, phase holdups, mass- and heat-transfer coefficients, etc. Rigorous treatment from first principles of multiphase flow problems remains a difficult task and has not yet attained sufficient maturity to take over the correlation-based approaches. Artificial neural networks (ANNs), as correlation tools, have gained wide acceptance in the field because of their inherent ability to map nonlinear relationships that tie up independent variables (either as dimensional inputs, e.g., pressure, diameter, etc., or as dimensionless inputs, e.g., Reynolds, Weber, and Froude numbers, etc.) to the reactor characteristics to be predicted, i.e., dimensional or dimensionless output.1-4 To increase confidence in these tools for yielding phenomenologically consistent (PC) models, some a priori knowledge emanating from the physical laws governing these systems must be taken into account during the model building. This knowledge is sometimes referred to as the modeler’s bias.5 A simple form of supplementary information about the function to be learned can be a set of “monotonicity constraints” of the output with respect to some of the process variables intervening as, or embedded in, some ANN inputs. Monotonicity asserts that an increase (or decrease) in a particular input cannot induce a decrease (or increase) in the output. ANN monotonicity constraints such as “flow monotonically increases with * To whom correspondence should be addressed. Tel.: (1-418) 656 2859. Fax: (1-418) 656 5993. E-mail: [email protected].

pressure” is encountered not only in chemical engineering6 but in other fields as well, e.g., human cognition, reasoning, decision making, etc.7,8 Guaranteed monotonicity of the ANN output with respect to some process variables may be obtained6 if the ANN is trained directly with these variables. On the contrary, when the ANN inputs are some functions of the process variables, e.g., viscosity embedded in an ANN input Reynolds group, the problem is no longer tractable to produce guaranteed monotonic ANNs, and so the degree of monotonicity has to be evaluated through numerical tests. In a heuristic approach proposed in ANN dimensionless correlations of multiphase reactors, monotonicity of the simulated output with respect to particular dimensional variables was evaluated a posteriori by simulating the network’s output in the vicinity of a reference point in the multivariate input space, usually the centroid of the experimental database serving for ANN learning and testing.1-4,9 In a recent work,10 we have shown that this approach is not always successful and may not yield monotonic ANNs across the whole database domain. In response, a combined genetic algorithm (GA) ANN methodology was proposed10 to develop PC ANN models that preserve, to a large extent, the monotonicity constraints across the whole database domain. In this approach, the GA searches the most adequate combination of dimensionless groups to be used in the ANN correlations, e.g., irrigated two-phase pressure gradient10 in packed towers. The adequacy of a given selection of dimensionless numbers and ANN architecture is judged based on the resulting ANN model accuracy and monotonicity in the vicinity of all of the data points available for training and testing. However, several accurate and PC ANN models can be

10.1021/ie020737s CCC: $25.00 © 2003 American Chemical Society Published on Web 03/19/2003

1708

Ind. Eng. Chem. Res., Vol. 42, No. 8, 2003

found to perform nearly equally well, with each exhibiting different architectures, i.e., types of inputs, numbers of inputs, and hidden nodes. No really decisive criterion can elect one at the expense of the other solutions, so that, as a subjective rule, modelers often choose the one with the least complex architecture.9,10 In this work, we studied the potential of combining several good ANNs in order to achieve better predictions than those with individual ANNs. The incentive behind combining ANNs stems from the fact that, although all of the models are individually equally good on average, they are not so on each individual point from the database. Some ANNs may be locally good, while others, because of different inputs’ set and architecture, may not be. Hence, combining simultaneously a cohort of ANNs could result in a synergistic effect, especially in the database regions where the contrast in performance between individual ANNs is large. This approach was investigated already in several research works11-15 and consists of feeding the predictions of several distinct networks, referred to here as base models (level 0), into an upper level model (level 1), referred to as the meta model, which is generally linear. What makes a meta model more robust than the individual base models is the fact that, as said earlier, the base models are different and thus may be specialized on different regions in the input space. Hopefully, they will not all be wrong at the same point in the database space. The base models may be different by the data representation; i.e., the learner may use different representations of the same inputs, by the training scheme, the initial weight sets, etc. A comprehensive classification of the differences that may exist between the base models, the functions, and the combinatorial strategies of their output into a meta model is discussed by Alpaydin.14 In the context of regression learning, Hashem and Schmeiser12 investigated the combination of a number of trained networks by performing weighted sums of the outputs of the base (component) networks. An unconstrained mean-squared error-optimal linear combination (MSE-OLC) of networks with constant terms was proposed, which theoretically yields the smallest MSE. Independently, Perrone16 developed the general ensemble method for constructing improved regression estimates, which is equivalent to the constrained MSEOLC.12 In these works, the weights in the meta model are determined by minimizing the MSE of the meta model on the same data on which the base models were trained. This has the advantage of being simple to use, but if the base models are highly cross-correlated, i.e., base models all weak in the same regions of the input space, the meta model will lack robustness.17 Breiman17 extended the Wolpert18 approach to stacking regressions by proposing to determine the meta-model regression coefficients (level 1 model) based on the performance of the base models (level 0 model) on generalization data and not on the training data as in ref 12. The major drawback of Breiman’s17 method is its computational heaviness because the base models need to be retrained on the cross-validation data. In addition, the common feature to almost all of the works11-18 on network combination is their focus on improving the prediction accuracy without being concerned with monotonicity constraints that the meta model must fulfill. 2. Related Work Piche´ et al.1 compiled experimental measurements of the pressure drop (∆P/Z) in countercurrently operated

Table 1. Candidate Dimensionless Input Variables for Pressure Drop Modeling dimensionless no. Ni where i )

candidate variable

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Reynolds (ReG) Blake (BlG) Froude (FrG) Galileo (GaG) modified Galileo (GaGm) Stokes (StG) modified Stokes (StGm) Reynolds (ReL) Blake (BlL) Froude (FrL) Weber (WeL) Morton (MoL) Eotvos (EoL) modified Eotvos (EoLm) Galileo (GaL) modified Galileo (GaLm) Stokes (StL) modified Stokes (StLm) Capillary (CaL) Ohnesorge (OhL) wall factor K1 wall factor K2 wall factor K3 correction number SB correction number SB (2) correction number SB (3) Lockart-Martinelli (χ) modified Reynolds (Rem)

packed beds to form a large database (5005 records) and proposed an ANN pressure drop correlation. This latter predicts dimensionless pressure drops using, as inputs, several dimensionless Buckingham Π groups computed from the physical properties’ vector p ) {UG, UL, FG, µL, aT, , φ, Z, DC, FL, σL, µG} recorded in the database. The type of neural network used was a multilayer perceptron with a sigmoidal transfer function in the hidden and output layers. The network was trained with a limited memory BFGS algorithm.19 The selection of the ANN input set was carried out by trial and error among a set of Ni candidates (i ) 1-28; see Table 1) with a goal to build an ANN correlation that is accurate and satisfies the following phenomenological rules:

∂(∆P/Z) ∂(∆P/Z) ∂(∆P/Z) > 0; > 0; > 0; ∂UG ∂UL ∂FG ∂(∆P/Z) ∂(∆P/Z) > 0; > 0 (1) ∂µL ∂aT Because eq 1 conditions were checked only in the vicinity of particular data points pk of the database, there was no guarantee that the retained ANN model still fulfilled them in most parts of the database domain. Such a violation of monotonicity constraints is due to the common overfitting problem and to the poor quality of some experimental measurements contained in the database. A more exhaustive test for ANN monotonicity was proposed10 recently which consisted of checking eq 1 in the neighborhood of all of the data points pk available for training. A phenomenological consistency error (PCE) was defined to quantify the proportion of data points pk in whose vicinity at least one of the eq 1 monotonicity constraints was violated. The new pressure drop ANN correlation yielded an accuracy of 20% (average absolute relative error, AARE) and a PCE of 17% on all of the 5005 data points versus an AARE of 19.4% and a PCE of 81% for the ref 10 correlation.

Ind. Eng. Chem. Res., Vol. 42, No. 8, 2003 1709 Table 2. Base Models Used To Build the Meta Model

model

inputs

no. of inputs

bm1 bm2 bm3

N10, N13, N14, N18, N23, N26, N27 N9, N10, N13, N14, N21, N24, N27 N10, N14, N17, N18, N24, N27

7 7 6

a

no. of hidden nodes

AARE [%]

STDEV [%]

MAXARE [%]

MSEa

PCE [%]

15 14 14

19.49 19.95 21.84

18.02 18.77 20.35

164 195 194

1.22e-2 1.29e-2 1.56e-3

21.6 17.3 20.2

MSE was computed on the log values of calculated and experimental y.

Definitely, better adherence to eq 1 conditions ensured that the network was able to retain the main trends conveyed by the data and was less prone to follow the noise in the training data. 3. Base Models and the Meta Model

AARE )

1



Nk)1

|

|

y(pk) - ycalc(pk) y(pk)

(2) ycalc(p

with y(pk) the experimental value of y and k) the predicted value of y for the data point pk. ii. Standard deviation of the absolute relative error (STDEV)

x∑[| N

STDEV )

i)1

calc

y(pk) - y

y(pk)

|

(pk)

model

β1

β2

β3

AARE + STDEV - optimal meta model MSE - optimal meta model

0.367 0.392

0.380 0.404

0.279 0.204

iv. MSE

Let us denote by y the dimensionless pressure drop, ∆P/FLgZ, that we wish to model. Let us also assume that y is contributed by a deterministic function g, such that y(p) ) g(p) + , where  is a normally distributed, zeromean random variable. Estimators of the function g may be some types of neural networks, e.g., multilayer perceptron, radial basis functions, trained with pairs [pk, y(pk)]. The inputs of such neural networks may be some dimensionless Buckingham Π groups “edited” from some of the dimensional variables contained in vectors pk. Using dimensionless numbers has the advantage of enlarging the applicability ranges for the model but at the same time brings out ambiguity and uncertainty as to how to choose the best selection of them. Depending on the pertinence of these inputs, the resulting networks may or may not be accurate, and PC, for example, will not exhibit low AARE and PCE values. The GA ANN methodology10 enabled identification of several networks exhibiting low AARE and PCE. Instead of retaining the best among them and discarding the others, we have chosen to retain the three best networks (referred to as bmr, r ) 1-3) and build a meta model. As shown in Table 2, these base models differ among themselves by the number and type of dimensionless numbers they use as their inputs, as well as by the number of hidden nodes. All of the other related training parameters were the same: bm1-bm3 were trained on 2/ of the available data (N ) 3503); the remaining 1/ 3 T 3 (NG ) 1502) was used to evaluate their generalization capabilities, as is standard in ANN modeling.20 The following statistics were computed: i. AARE

1

Table 3. Values of the Weighting Coefficients in the Meta Model

]

2

- AARE /(N - 1) (3)

MSE )

1

N

[y(pk) - ycalc(pk)]2 ∑ Nk)1

(5)

v. PCE, computed as the percentage of data points pk in whose vicinity at least one of the eq 1 monotonicity constraints was violated.10 The most popular measure to assess the accuracy of a model is the MSE on the generalization data, but it is not always sufficient to describe the quality of the model’s fitting. More informative and relevant measures include AARE, STDEV, and MAXARE. Among the series bm1-bm3 (Table 2), bm1 was the best model, with the lowest AARE on generalization data while its PCE was close to that of the other two models. The meta model to be constructed (Figure 1) using the n ) 3 base models bmr is a weighted summation of their outputs: n

meta(p) )

∑βrbmr(p)

(6)

r)1

Because y spans several decades, log values of the base-model outputs were taken as the linear regressors of the log value of y. However, when the statistics i-v are reported, actual y values are employed except for MSE. The simplest way to estimate β is by minimization of MSE for the meta model on the training data.12 This does not lead necessarily to the lowest AARE and STDEV for the meta model. To circumvent such a limitation, we defined in this work a different optimality criterion to determine the meta-model regression coefficients β. This AARE + STDEV criterion is simply the sum of AARE and STDEV that the meta model achieves on the NT training data:

NT

∑ i)1

1

NT

∑ N k)1

|

|

y(pk) - meta(pk)

+ y(pk) 2 N y(pk) - meta(pk) 1 T y(pk) - meta(pk) / NTk)1 y(pk) y(pk) (NT - 1) (7)

C(β) )

[|

T

|



|

|]

The proper value for β is obtained by minimizing criterion C using Newton’s method with the derivatives computed numerically.

iii. Maximum absolute relative error (MAXARE)

MAXARE ) max k

|

|

y(pk) - ycalc(pk) y(pk)

4. Results

(4)

The β regression (or weighting) coefficients were determined according to ref 12 (MSE) and eq 7 (AARE

1710

Ind. Eng. Chem. Res., Vol. 42, No. 8, 2003

Figure 1. Meta-model construction: original variables converted in different combinations of dimensionless numbers become the inputs of the base models whose outputs are further fed in the meta model which predicts a dimensionless pressure drop. Table 4. Performances of the Meta Model Compared with the Best and Simple Average Models model

AARE [%]

STDEV [%]

MAXARE [%]

MSEa

PCE [%]

AARE + STDEV - optimal meta model MSE - optimal meta model simple average best base model

17.28 17.97 18.05 19.49

15.00 16.51 16.69 18.02

133 148 156 164

1.17 × 10-2 1.08 × 10-2 1.09 × 10-2 1.22 × 10-2

7.9 9.1 8.0 21.6

a

MSE was computed on the log values of calculated and experimental y.

+ STDEV) optimality criteria. The obtained weighting coefficients (Table 3) can be interpreted as the certainty of a network in its output.11 All coefficients are significant and are somehow close to each other. However, as discussed elsewhere,12,17 this is not enough to guarantee the robustness of the resulting meta model because a potential problem that affects the estimation of the β coefficients is collinearity among base-model outputs. Collinearity has a chance to occur because all bmr models are trained to approximate the same function. Let us denote by X the matrix whose columns are the log values of the outputs of the bmr models (which are linear regressors in the meta model) for each point pk in the training set. Eigenvectors (also called principal components) of the scaled and noncentered X′X matrix as well as condition indices and variance-decomposition proportions for individual variables (regressors in the meta model) are computed using the SPSS software. According to ref 21, a collinearity problem is to occur when an eigenvector associated with a high condition index contributes strongly to the variance of two or more variables. This was not the case in our three base models bmr. An ultimate test that the models complete each other and that the resulting meta model is robust is to compare its performance with the one from the best base model and the simple average model. The statistics i-v presented above were used in the comparisons, and the results are summarized in Table 4. As can be seen, both MSE and AARE + STDEV optimal meta models outperform the best and the simple average models. Even the simple average model is already a major improvement over the best base model. Moreover, AARE,

Figure 2. Meta model showing monotonicity with respect to UL and accuracy in the prediction for data point p, while the base models show either imprecision or a lack of phenomenological consistency.

STDEV, MAXARE, and PCE are respectively ameliorated by 13%, 20%, 23%, and 173% if the AARE + STDEV optimal meta model is used instead of the best base model. To illustrate how a meta model achieves lower AARE and PCE than base models, an experimental point p was chosen, and the base models and meta model were use to simulate an output for a range of liquid velocity values, UL, in the vicinity of p (Figure 2). This example shows that, although the base model bm2 predicts the

Ind. Eng. Chem. Res., Vol. 42, No. 8, 2003 1711

pressure drop value at point p very well, it failed one of the eq 1 monotonicity constraints, which would cause misprediction of an eventual point with lower UL. On the other hand, although the other two models, bm1 and bm3, exhibit monotonically increasing trends, their predictions are not as accurate, overestimating (bm1) or underestimating (bm3) the pressure drop at point p. Conversely, the meta model not only is very accurate near p but also adheres to the monotonicity constraint with respect to the liquid velocity. 5. Conclusion An AARE + STDEV or MSE optimal linear combination of several neural networks, trained with different representations of the same raw data, may be successfully used in modeling multiphase reactor characteristics. This approach improves the prediction quality (assessed by means of AARE, STDEV, and MAXARE) and the phenomenological consistency (judged by means of PCE) of the resulting meta model. The AARE + STDEV optimality criterion allows identification of more appropriate weighting coefficients than the MSE criterion. While in other works only improvement in the model’s prediction is targeted when combining several neural networks, in this work we stress also the gain in phenomenological consistency of the model. Acknowledgment Financial support from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Fonds pour la Formation de Chercheurs et d’Aide a´ la Recherche (Quebec) is gratefully acknowledged. Notation aT ) bed specific surface area (m2/m3) bmr ) an individual neural network model C(β) ) AARE + STDEV criterion depending on β coefficients DC ) column diameter g ) gravitational constant (m/s2) meta(pk) ) output of the meta model for the input point pk N ) number of samples in a data set Ni ) dimensionless group computed from the dimensional variables pk ) vector that contain the values of the dimensional variables recorded at the position k in the database U ) superficial phase velocity (m/s) y(pk) ) true, experimental value of y for the point pk ycalc(pk) ) predicted (calculated) value of y for the data point pk X ) matrix whose columns are the log values of the ouput of the bmr models for each point pk in the training set Z ) bed height Greek Letters β ) weighting coefficients vector in the meta model ∆P ) irrigated pressure drop (Pa)  ) bed porosity; normally distributed, zero-mean random variable µ ) phase viscosity (kg/m‚s) F ) phase density (kg/m3) σ ) phase surface tension (N/m) φ ) particle sphericity

Abbreviations ARE ) absolute relative error AARE ) average absolute relative error ANN ) artificial neural network GA ) genetic algorithm MAXARE ) maximum absolute relative error MSE-OLC ) mean-squared error-optimal linear combination PC ) phenomenological consisteny PCE ) phenomenological consistency error STDEV ) standard deviation of ARE Subscripts L ) liquid G ) gas, generalization S ) solid T ) training

Literature Cited (1) Piche´, S.; Larachi, F.; Grandjean, B. P. A. Improving the prediction of irrigated pressure drop in packed absorption towers. Can. J. Chem. Eng. 2001, 79, 584. (2) Jamialahmadi, M.; Zehtaban, M. R.; Mu¨ller-Steihagen, H.; Sarrafi, A.; Smith, J. M. Study of bubble formation under constant flow conditions. Trans. Inst. Chem. Eng. 2001, 79, 523. (3) Iliuta, I.; Larachi, F.; Grandjean, B. P. A.; Wild, G. Gasliquid interfacial mass transfer in trickle-bed reactors: State-ofthe-art correlations. Chem. Eng. Sci. 1999, 54, 5633. (4) Larachi, F.; Iliuta, I.; Chen, M.; Grandjean, B. P. A. Onset of pulsing in trickle beds: Evaluation of current tools and stateof-the-art correlation. Can. J. Chem. Eng. 1999, 77, 751. (5) Sill, J. Monotonic networks. Adv. Neural Inf. Process. Syst. 1998, 10, 661. (6) Kay, H.; Ungar, L. H. Estimating monotonic functions and their bounds. AIChE J. 2000, 46, 2426. (7) Wang, S. Learning monotonic-concave interval concepts using the back-propagation neural, networks. Comput. Intell. 1996, 12, 260. (8) Abu-Mostafa, Y. S. A method for learning from hints. Adv. Neural Inf. Process. Syst. 1993, 5, 73. (9) Tarca, L. A.; Grandjean, B. P. A.; Larachi, F. Integrated genetic algorithm-artificial neural network strategy for modeling important multiphase flow characteristics. Ind. Eng. Chem. Res. 2002, 41, 2543. (10) Tarca, L. A.; Grandjean, B. P. A.; Larachi, F. Reinforcing the phenomenological consistency in artificial neural network modeling of multiphase reactors. Chem. Eng. Process. 2002, accepted for publication. (11) Alpaydin, E. Multiple networks for function learning. Proceedings of the IEEE International Conference on Neural Networks; IEEE Press: New York, 1993; Vol. 1, pp 9-14. (12) Hashem, S.; Schmeiser, B. Optimal linear combinations of neural networks. Neural Networks 1997, 10, 599. (13) Benediktsson, J. A.; Seveinsson, J. R.; Ersoy, O. K.; Swain, P. H. Parallel consensual neural networks. Proceedings of the IEEE International Conference on Neural Networks; IEEE Press: New York, 1993; Vol. 1, pp 27-32. (14) Alpaydin, E. Techniques for combining multiple learners. Proceedings of Engineering Intelligent Systems; ICSC Press: 1998; Vol. 2, pp 6-12. (15) Ueda, N. Optimal linear combination of neural networks for improving classification performance. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 207. (16) Perrone, M. P. Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. Ph.D. Thesis, Department of Physics, Brown University, 1993. (17) Breiman, L. Stacked regressions; Technical Report 367; Department of Statistics, University of California: Berkeley, CA, 1992.

1712

Ind. Eng. Chem. Res., Vol. 42, No. 8, 2003

(18) Wolpert, D. H. Stacked generalization. Neural Networks 1992, 5, 241. (19) Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetterling, W. T. Numerical recipes: The art of scientific computing; Cambridge University Press: Cambridge, U.K., 1989. (20) Flexer, A. Statistical evaluation of neural network experiments: Minimum requirements and current practice; Technical Report; The Austrian Research Institute for Artificial Intelligence: Vienna, Austria, 1994.

(21) Belsley, D. A. Conditioning diagnostics: Colinearity and weak data in regression; John Wiley & Sons: New York, 1991; p 14.

Received for review September 18, 2002 Revised manuscript received January 20, 2003 Accepted February 10, 2003 IE020737S