5782
Ind. Eng. Chem. Res. 2008, 47, 5782–5796
Accounts of Experiences in the Application of Artificial Neural Networks in Chemical Engineering David M. Himmelblau Department of Chemical Engineering, The UniVersity of Texas at Austin, Texas 78712
Considerable literature describing the use of artificial neural networks (ANNs) has evolved for a diverse range of applications such as fitting experimental data, machine diagnostics, pattern recognition, quality control, signal processing, process modeling, and process control, all topics of interest to chemists and chemical engineers. Because ANNs are nets of simple functions, they can provide satisfactory empirical models of complex nonlinear processes useful for a wide variety of purposes. This article describes the characteristics of ANNs including their advantages and disadvantages, focuses on two types of neural networks that have proved in our experience to be effective in practical applications, and presents short examples of four specific applications. In the competitive field of modeling, ANNs have secured a niche that now, after two decades, seems secure. 1. Introduction The purpose of this article is to provide some candid advice for users of artificial neural networks (ANNs) based on two decades of experience. The article is directed to readers who know little or nothing about ANNs but are curious to know how they might be used in their work. Then again, even someone who has considerable experience in the use of ANNs will probably find some concepts to think about. The article is not a general review of ANNs, a tutorial, a literature survey, nor a historical record of progress. Instead, it is a report based on some of our experiences in a narrow range of applications of ANNs to some common chemical engineering problems. When I first looked into artificial neural networks, sometimes called connectionist nets, in the late 1980s, it was in the midst of (and because of) the evolution of considerable hyperbole in the literature. ANNs were said to be a form of artificial intelligence, they could model any process; in fact, they were the best thing since sliced bread. Of course, all this hyperbole has faded away, leaving ANNs as a class of empirical models quite akin to nonlinear regression rather than artificial intelligence. This article begins, for those not acquainted with ANNs, with a brief summary of the basic concepts of ANNs that can be of use to chemical engineers, including how they can be formulated. It then discusses some of the important characteristics of ANNs based on our experience that can serve as guides in applying ANNs. Finally, the article concludes with four short typical examples of applications: (i) fault detection, (ii) prediction of polymer quality, (iii) data rectification, and (iv) modeling and control. Because this article is not intended to be a literature review, I have prepared an Appendix A to the article (located in the Supporting Information) that contains over 200 references to the literature that pertain to the application of ANN to various problems that might be of interest to you in your work. A good start to review ANNs in general would be the Handbook of Neural Computiation1 and Statistics and Neural Net Users.2 Two books especially pertinent to chemical engineering are Neural
Networks in Bioprocessing and Chemical Engineering3 and Neural Networks for Chemical Engineers.4 What are some of the advantages in using artificial neural networks as models in contrast with using first-principles models, state-space models, rule-based models, splines, regression, or other empirical models? In developing a reasonable theoretically based model, you only have to estimate the values of the coefficients in the model. On the other hand, evolution of an empirical model can lead to frustration because of the seemingly overwhelming choice of structural options for a model in addition to estimating the values of the unknown coefficients in the model. Nevertheless, because numerous different models may agree reasonably with the data that is available, you can select the simplest model that is adequate and conforms to any constraints without being concerned with selecting the “perfect” (“ideal”, “true”) model. Some general advantages of using ANNs as models are as follows: (1) ANNs can represent a highly nonlinear process with a complex structure in some instances better than many other empirical models. (2) ANNs are quite flexible models. (3) ANNs are often robust with respect to input noise. (4) Once developed and their coefficients determined, they can provide a rapid response for a new input (as opposed to solving repeatedly a set of different equations). You can also include as input data non-numeric data such as the day of the week, higher or lower, or colors. I will mention some of the disadvantages later on. 2. What are Artificial Neural Networks (ANNs)? An ANN forms a mapping F between an input space U and an output space Y. In my view, you can distinguish three different kinds of mappings: (1) Both the input and output spaces are composed of continuous variables, a typical case for data analysis. Discontinuities such as isolated spikes or missing input values for the
10.1021/ie800076s CCC: $40.75 2008 American Chemical Society Published on Web 07/17/2008
Ind. Eng. Chem. Res., Vol. 47, No. 16, 2008 5783
Figure 1. Structure of a single processing node with the sequence of processing of information.
model cause problems in estimating the coefficients in an ANN (as well as other types of models). (2) The input space is composed of continuous variables whereas the output space is composed of a finite set of discrete variables as occurs in classification and fault detection. (3) Both the input space and the output space are composed of discrete variables representing discrete points that are mapped using difference equations in time. As the term artificial neural network implies, early work in the field of neural networks centered on modeling the behavior of neurons found in biological systems, especially in the human brain. Engineering systems are considerably less complex than the brain is; hence, from an engineering viewpoint, ANNs can be viewed as nonlinear empirical models that are especially useful in representing processes from input-output data, making predictions in time, classifying data, and recognizing patterns. In system modeling and identification as well as pattern recognition, the crucial steps are to (i) identify the form or structure of the model and (ii) determine the values of the adjustable parameters in the model. To read the literature on the theory and application of artificial neural networks, you have to become familiar with the prevalent jargon, a jargon that is somewhat foreign to outsiders but will be explained as we proceed. Figure 1 shows the basic structure of a single processing unit in an ANN, which will be referred to as a node in this article (a perception sometimes in the literature) and is analogous in concept to a single neuron in the human brain. The flow of information is as follows. A node receives one or more input signals yi, which may come from other nodes in the net or from some other source such as an external input of data. Each input is adjusted according to the value wi,j (not shown in Figure 1), which is called a weight and that I usually call a coefficient in the ANN. These weights are conceptually similar to the synaptic strength between two connected neurons in the human brain. The weighted signals to the node are summed, and the resulting signal a, called the actiVation, is sent to a transfer function, g, which can be any type of mathematical function but is usually taken to be a simple bounded differentiable function such as the sigmoid (Figure 2) or the arctangent function f(x) )
1 1 tan-1(x) + π 2
If the transfer function is active over the entire input space, it is termed a global transfer function. An underlying feature of empirical models involving several variables is the famous theorem given by Kolmogorov5 that states that any function of many variables can be represented by a superposition of functions of one variable and addition. A function is called separable if it can be decomposed as follows: f(x1, x2, ..., xn) ) f1(x1) + f2(x2) + ... + fn(xn)
(25)
Kolmogorov did not explain how univariate functions can be
Figure 2. Plot of a sigmoid transfer function.
Figure 3. Structure of a layered neural network.
used, but the nodes in ANN, which at first glance seems complex, form such sums. A collection of nodes connected to each other forms the artificial neural network. Cybenko6 and numerous subsequent researchers have shown that various networks of such functions (in theory) can approximate any input-output relation to the desired degree of accuracy (in the limit exactly). Of course, how many nodes to use in an ANN is up to the user, but the references in Appendix A are replete with ideas. Huang and Babri7 showed that boundedness and unequal limits as a f -∞ and a f +∞ are the sufficient conditions for a standard feedforward artificial neural network to uniformly approximate continuous functions in Rn. Figure 3 is an example of a small feedforward artificial neural network. You can find numerous other related architectures in the literature; some time ago, Lippman8 documented more than 50 other network configurations. I do not have the space here to describe hybrid nets, that is, nets composed of different or similar ANNs or nets connected to other types of models that are not ANNs, but a vast amount of literature exists for such architectures as well. In the feedforward (the information flows forward from the inputs) types of ANNs, a group of nodes called the input layer receives a signal from some external source. In general, this input layer does not adjust the signal unless it needs scaling. Another group of nodes, called the output layer, returns the signals to the external environment. The remaining nodes in the network are called hidden nodes because they do not receive signals from or send signals to an external source or location. The hidden nodes may be grouped into one or more hidden layers. Each of the arcs between two nodes (the lines between the circles in Figure 3) has a weight associated with it (not shown in the figure). Figure 3 shows a layered network in which the layers are fully connected from one layer to the next (input to hidden, hidden to hidden, hidden to output). Although this type of connectivity is frequently used, other patterns of connectivity are possible. Connections may be made between nodes in nonadjacent layers or within the same layer. As you can imagine, feedback connections from a node in one layer to a node in a previous layer can also be made. This latter type of
5784 Ind. Eng. Chem. Res., Vol. 47, No. 16, 2008
connection is called a recurrent connection that will be discussed in more detail in Section 9, and depending on the type of application for which the network is being used, such a connection may have a time delay associated with it. 3. Training the ANN (Optimization) Another part of the jargon associated with ANN models relates to model identification. Generally, there is no direct analytical method of calculating what the values of the weights are if a network is to model a particular behavior of a process. Instead the network must be trained on a set of data (called the training set) collected from the process to be modeled. Training is just the procedure of estimating the values of the weights and establishing the network structure, and an algorithm used to do this is called a “learning” algorithm. The learning algorithm is nothing more than some type of optimization algorithm, such as nonlinear programming, genetic programming, interval analysis, or simulated annealing, plus methods to substitute finite differences for derivatives. Once a network is trained, it can provide a response with a few simple calculations, which is one of the advantages of using an ANN instead of a first principles model in cases for which the model equations have to be solved rapidly over and over again. A key difficulty with optimization to determine the ANN weights is that multiple minima occur (see paper by Fukuoka et al.9) in the objective function that is used. Since most training procedures used for neural networks typically find local minima starting from randomly selected starting guesses for the parameters, an analyst should expect to find a variety of local minima of varying credence. While use of a global optimization procedure might eventually reach the global optimum, the execution time for such algorithms on serial computers usually expands to an unacceptable degree. Consequently, satisfactory representation of data rests on the use of the best local minimum that can be achieved in a reasonable time. Regardless of what algorithm is used to calculate the values of the weights, all of the training methods encompass the same general steps. First, the available data is divided into training and test sets. The following procedure is then used (called “supervised learning”) to determine the values of weights in the network: (1) For a selected ANN architecture, the values of the weights in the network are initialized often as small random numbers. (2) The inputs of the training set are fed to the network, and the resulting outputs are calculated. (3) Some measure of the error between the outputs of the network and the known (correct) values is calculated. (4) An optimization algorithm is executed in which the gradients of the objective function with respect to each of the individual weights may have to be calculated; the weights are changed according to the optimization search direction and step length to reach the initial stage of the next point. (5) The procedure returns to step 2. (6) Iteration terminates when the value of the error calculated using the data in the data set starts to increase, quite possibly from a local minimum. You can start with different sets of initial weights and different architectures to discover eventually a suitable net, the model you want. The criteria to determine the best weights and coefficients in functions in empirical models such as an ANN involve the notion of convergence in minimizing a norm, a topic discussed in many books on numerical analysis. Different norms lead to
Table 1. Possible Objective Functions for Determining the Values of Coefficients in a Model [(yˆ) is the Predicted Value from the Model (Could Be a Vector) and y is the Measured Value (a Vector)] type of criterion absolute error (L1 norm) squared error (L2 norm is the square root) mean squared error
function (for scalar) ∑in) 1 |yi - yˆi| ∑in) 1 (yi - yˆi)2 n
(1 ⁄ n) Σ
i)1
general power (Minkowski distance) Chebyshev norm
[∑in) 1 |yi - yˆi|p ]1/ p 1 e p e ∞ Min b
Akaike (AIC) (p is the number of coefficients in yˆi (but might be an arbitrary fixed or adjustable value)
(yi - y^1)2
Max t 1eten
|y (t, b) - y (t, b)| i
^ i
n[(∑in) 1 (yi - yˆi)2)/n)] + 2p
different approximations. Table 1 lists some useful types of objective functions that can be used in a training algorithm. The mean-squared error is the most commonly used, but other choices can be used depending on the optimization code employed. Also, depending on the code used, you can add to the objective function simple constraints such as bounds on the weights (upper and lower) or more complicated constraints such as equations or inequalities. Added penalties to the objective function to reduce the magnitude of the weights (1/2 Σ Σ wij2), called “weight decay”, can be used because large curvature requires large weights. One of the early myths associated with ANNs was that, because ANNs exhibited artificial intelligence, you could introduce raw observations/measurements as inputs to the net, and the net could be trained to learn (represent satisfactorily) a function or process. This myth was rapidly dispelled. One of the difficulties in using ANNs as models is that once the ANN is accepted as a valid model, the process from which the data came can change characteristics. The input and or output spaces may shift, leaving gaps in the coverage by the data set that was involved in the training. For example, a steady-state model may shift to a new steady-state operating point, meaning that an ANN trained for the previous point probably will not be valid at the new point. For a dynamic model, the dynamics of the process can change as well, with a similar outcome. For success in modeling, the input data should meet many of the conditions required by strategies using statistics for model building: (1) The variables in the response space are linear (in the coefficients) combinations of the input variables. (2) The variables in the input space are deterministic (hopefully orthogonal to each other) and uncorrelated in time. (3) The output variables are statistically independent. As you know, these criteria are only occasionally met by process measurements but are usually ignored in drawing conclusions about a model. Among the factors that cause difficulty are as follows: (1) There is cross-correlation among the input variables. (2) Input data collected in time will almost certainly be autocorrelated unless preprocessing occurs. (3) If more than one output node, and thus variable, is used in the ANN, the measured values output variables can be crosscorrelated as well as the values of the input variables. If the
Ind. Eng. Chem. Res., Vol. 47, No. 16, 2008 5785
input variables for a process are orthogonal, training an ANN will be facilitated. A dynamical system is said to be observable if, for any two different states, there exists an input sequence of finite length that produces a different value for each output in the output sequence. For models that are nonlinear in the coefficients as well as the values of the input variables, such as a multilayer feedforward ANN, Levin and Narendora10 showed that, for any system with a discrete input sequence 1 to p, for a scalar such as
Figure 4. Two different functions used to fit the same data.
yˆ(k + 1) ) F[y(k) · · · y(p + 1);u(k) · · · u(k + p - 1)] where yˆ is the predicted value of F, y is the measured value of the output of F, and u is the measured value of the input, if (2n + 1) observations of the output are made, the ANN is observable. For ANN for discrete values of the variables, with feedback, the conditions for observability as well as stability are not clear at present. The purpose of partitioning the available data into the training and test sets is to evaluate how well the network generalizes (predicts) to domains that were not included in the training set. For nontrivial problems with multiple inputs and outputs, you probably cannot collect all of the possible input-output patterns needed to span the input-output space for a particular behavior or process to be modeled. Therefore, you have to train the network with some subset of all of the possible input-output patterns. However, the training set must be representative of the domain of interest if you expect the network to learn (interpolate satisfactorily between measured points) the underlying relationships and correlations in the process that generated the data. If you do not select a suitable input set of data, the net may not predict well for similar data and may predict poorly for completely novel data. If you want to extrapolate, as for example to predict the net output from input data taken outside the space of the training set of data, you must be careful, because blind extrapolation using any model is hazardous, particularly with an ANN. 4. Smoothing Before presenting some specific examples of the application of ANNs, I have to mention smoothing. These ideas apply to most models, but are particularly important to consider for ANNs because ANNs are complex structures and involve numerous coefficients (weights, biases, etc.). Formulation of a model from a finite sample of measurements without any prior knowledge about the process is an ill-posed problem in the sense that a unique model may not exist and/or the model may not depend continuously on the measurements, that is, nothing is known about the process between the measurements. In practice, some prior knowledge comes into play. At a minimum, you know or hope the process has the property of smoothness so that the values of the process variables do not abruptly change in the vicinity of a measurement. Smoothing, often called regularization, of course, suppresses detail and eliminates outliers that may tell you something about the underlying process. Examine Figure 4. Would selection of the slightly curved line to represent the data be a better choice than the more complex function? Does the more complex function actually represent a real feature of the underlying process? Answers to such questions rest on knowledge about the character of the underlying process and can lead to the collection of additional data. In the limit in which the number of coefficients in the model equals the number of data points, with deterministic data, a complex model might go through
Figure 5. Change in f(x) as the smoothing parameter λ is reduced.
every data point. Note in Figure 5 how the turning points become greater as the smoothing parameter λ decreases. To damp out the frequent turning points in fitting data, a penalty to adjust nonsmoothness can be added to the objective function listed in Table 1, such as n (yi ″ )2 1 λ [(x - x(i - 1))] 2 i)1 [1 + (y ″ )2]3 i i
∑
The penalty term in the summation is the square of the curvature of the function (yi ) f(xi), and y′′ ) d2y/dx2, y′ ) dy/dx), and λ is a smoothing factor that weights (scales) the penalty term relative to the normal objective function. The penalty term is often simplified to just the square of the second derivative, deleting the denominator. Keep in mind that adding a nonlinear penalty to an objective function considerably complicates the optimization code in locating a global minimum. Next, let us look at a feature related to smoothing. Although not widely recognized, noise in the input variable x contributes a smoothing effect in modeling. As a simple example, suppose we minimize the following objective function E1 in which y (the true value) and x are deterministic scalars and ∈ is the noise associated with x: E1 ) [y(x) - f(x + ∈)]2 )y2(x) - 2y(x)f(x + ∈) + f2(x + ∈)
(1a) (1b)
If the noise ∈ is small, the residual can be approximated by a first-order Taylor series (note ∆x here is ∈): f(x + ∈) ≈ f(x) + f ′ (x)∈ (2) One should introduce eq 3 into eq 2 and suppress the arguments to avoid confusion to get E1 ≈ y2 - 2yf - 2yf ′ ∈ + f2 + 2ff ′ ∈ + (f ′ ∈)2
(3)
Next, take the expected value of E1 to get (< > denotes expected value)
5786 Ind. Eng. Chem. Res., Vol. 47, No. 16, 2008
〈E1 〉 ≈ y2 - 2yf - 2yf ′ 〈∈〉 + f2 + 2ff ′ 〈∈〉 + (f ′ )2〈∈2 〉
(4)
To simplify eq 4, assume ∈ has zero mean and is uncorrelated so that 〈∈〉 ) 0 and 〈∈〉2 ) σ2. Then 〈E1 〉 ≈ y2 - 2yf + f2 + (f ′ )2σ2
(5a)
≈(y - f)2 + σ2(f ′ )2
(5b)
The term involving (y - f) is the usual least-squares error that would be obtained for deterministic input, but the second term is a smoothing term involving the square of the first derivative of the model. Compare eq 5b with eq 1a. What eq 5b means is that, when noise exists in the inputs to a model and you apply least-squares to get the best coefficients in the model, you smooth whether you want to or not. Noise in the inputs has another effect in the identification of linear dynamic models, because when the effect of input noise is neglected, and it exists, prediction error methods can give biased mean values and parameter estimates. The same idea probably applies to nonlinear models. If the noise characteristics of the process measurements are known, this problem can be ameliorated to some degree, but how to resolve the problem in general is still unknown. 2
Figure 6. Illustration of how the error in prediction by a previously trained ANN using a test set of data increases if too many stages of optimization are used training the ANN.
nately, the predictions by the reserved test set of data can increase if the ANN is trained excessively. Look at the upper curve in Figure 6. (4) Train for some number of optimization iterations, then stop, and test the model on the test data. Do not make any weight changes in view of the results of the test data because making such changes will, in effect, add the test set to the training set. (5) Prune outliers from the collected data set insofar as possible.
5. Overfitting and Underfitting
6. Bias-Variance Issue
What is overfitting and how can you avoid it? Overfitting and underfitting are related to smoothing. The point of training an ANN on a training set of data is to subsequently represent process data not in the training set, i.e., the ANN is supposed to generalize for other similar data. ANNs, like other flexible nonlinear models, can suffer from either underfitting or overfitting depending on the degree of smoothing. A network that is not sufficiently complex can fail to detect fully the signal in complicated data, leading to underfitting. A network that is too complex, as easily occurs with an ANN, may fit the noise or measurement errors, not the “true” values of the variables, causing overfitting. Figure 4 represents both overfitting and underfitting depending on how you think your data should be modeled. Suppose that you have spectroscopic data with lots of detail. Then the rightside figure would be drastic underfitting; the model requires more detail (coefficients, structure, etc.). On the other hand, if the data should be represented by a smooth curve, the left-hand figure represents a bad case of overfitting; none of the predicted values between the experimental points are valid. Overfitting is annoying and embarrassing because it easily leads to predictions that are outside the range of the training data and may also produce wild predictions in a multilayer network even with noise-free data. For a good discussion of overfitting, see the article by Geman et al.11 on the bias/variance trade off. Underfitting produces excessive bias in the outputs, whereas overfitting produces excessive variance. How can you avoid overfitting? Several ways are worth trying: (1) Use lots of data relative to the number of coefficients in the ANN. (2) Select a satisfactory model that contains the fewest weights possible. (3) In training, do not train excessively. As the number of iterations of the optimization steps increases, the error in the predictions by the training set will continue to decrease ever more slowly as the fit of the ANN increases. Look at the lower curve in Figure 6. Overfitting may occur, or a bad local minimum may be engaged with excessive training. Unfortu-
Keep in mind how smoothing affects the bias-variance ratio. If the model was perfect and the data were deterministic and exactly agreed with the model, then neither bias nor variance would occur. However, in modeling process data, bias does arise because the expected value of an estimator of a model parameter or output, which is a statistic, does not equal the “true” value. Bias pertains to accuracy. Bias cannot be measured directly for real data. Variance refers to a measure of precision of a population or a sample; the sample variance for a scalar is the sum of the squares of the differences between replicate measurements and some reference value (usually the sample mean) divided by a number related to number of parameters involved in the estimate. The sum of the bias squared and the variance equals the total sample mean square error when using a particular objective function. Variance for multiple outputs from a model corresponds to a matrix (covariance matrix) if a model is linear in the parameters. Whether you prefer high accuracy (small bias) at the expense of precision, or the contrary, depends on the proposed use of the model. In the literature about ANNs, attention usually focuses on minimizing the mean-square error of the measurements, the third objective function in Table 1. Articles occasionally appear in the literature (see paper by Hong et al.12) that propose techniques to identify errors-in-variables data (data with errors in the inputs), but such strategies all make enough assumptions at the beginning to preclude their use for real process data. In general, the more data that is available, the greater is the accuracy, and the lower is the precision, of the ANN output(s). If an ANN has multiple outputs (output nodes), you may be able to improve the modeling results with ANN by employing a set of nets each with one output but using the same input data for each net. You may be able to tolerate the increased bias that results to gain a substantial decrease in variance in the model predictions. If you use a single net containing all of the output nodes, you have essentially assumed that the output variables have a similar degree of dependence on the input variables. What this statement means is that the errors in the outputs including noise yield about the same variance for each
Ind. Eng. Chem. Res., Vol. 47, No. 16, 2008 5787
output variable, that the output variables are uncorrelated, and that variables that are difficult to include in the modeling do not overly influence and perhaps dominate the data fitting. 7. Selection of an ANN The decision to select a particular ANN or some other model to represent a process involves model validation/verification. The existing literature exposes quite a variety of views on the difference in meaning between these two words that I will omit here. Because insufficient and inadequate data limit the ability of ANNs to interpolate (predict) system behavior without considerable uncertainty, different models can be chosen depending on the criteria used in their evaluation. Model validation is the procedure of evaluating and testing different aspects of the model for the purpose of refining, enhancing, and building confidence in the model predictions in such a way that allows for sound decision-making. Model validation is, thus, a process, not an end result by itself. It cannot ensure an acceptable model. Rather, it provides a safeguard against faulty models or inadequately developed and tested models. If the validation process indicates that major deficiencies exist in the model, you can contemplate a new round of data collection, characterization, conceptualization, calibration, modeling, and prediction. Some differences exist between the focus of models developed via statistical analysis versus those developed via ANNs: (1) Usually, but not always, fitting data by ANNs involves large amounts of training data whereas statistical methods treat smaller samples. (2) Usually ANN models are more complex and nonlinear and involve more parameters (weights) than typical empirical models that involve statistical analysis. (3) Usually the goal of statistical methods is to identify the effect of each variable on the response(s) so as to justify increasing or decreasing model components. The main objective of ANNs is to generalize or predict (both interpolate and extrapolate), but it is difficult to interpret the final ANN structure in terms of the components in a physical process. (4) Usually the computational details pertinent to an ANN are more complex than those based on statistical criteria, and the computation takes longer than for algorithms used in statistics. (5) Usually an approximate confidence region for the predictions is possible in statistical analysis, but this is rarely true in ANN applications. Even ad hoc estimates of a confidence region are hard to carry out for an ANN. As a result, answers to questions such as the following are hard to find: (1) How can you assess if the prediction errors are random from one data point to another? (2) How can you find whether or not any errors are distributed normally or by some specific sample distribution? (3) How can you test (without trial and error) to determine if any significant nodes should be omitted from the ANN or added to it? Replicate data from which to calculate confidence regions is not available, the undue influence of certain inputs is uncertain, the effect of input noise is unknown, particularly if not stationary, and so on. What can you do then to evaluate one ANN versus another to say nothing about comparing the “best” ANN with a different type of model (such as regression, state-space, rule-based, etc.). Because ANNs are empirical models, the question of the adequacy of a model has to be related to the process of interest
and the decision criteria employed. An appropriate network should exhibit the following: (1) Good “generalization” for new data, avoiding under- and overfitting. (2) Computational efficiency, which means that the smaller the network, the fewer the parameters, the less data is needed and the lower is the identification time. (3) Interpretation of the input-output relation in terms of the process operations, which is provisionally possible if independent ANN model units are combined in such a way that the elements of the ANN can be compared with the elements of a flowsheet. Because ANNs are not unique, that is many nets can produce essentially identical outputs for the same inputs, and many different goals can be deemed “best”, searching for the best net in some sense is rarely an efficient use of your time. A small “satisfactory” net may suffice to classify data but may not suffice in making predictions in time. One of the practical difficulties in selecting a criterion for choosing an adequate model is that a single criterion may be inadequate. A comparison between the output of a model and a set of data in terms of a single criterion, say, a distance, can prove to be quite unsatisfactory even if the distance is defined in terms of a suitable error function (maximum error, root-mean error, etc.). A number of criteria may be involved in the decision making, some of which may be incompatible, such as a model with desirable characteristics in the frequency domain that clash with those in the time domain. Two models can routinely have the same mean-square error in representing a process yet differ greatly in how satisfactorily they represent a process for new data. The conclusion is that simply minimizing the sum of the squares of the deviations between the responses of a proposed model and a data set may not be the best way to choose an adequate model nor the best model. Among the criterion of best, one or a combination of the following is commonly used: (i) fewest coefficients consistent with reasonable error, (ii) simplest form consistent with reasonable error and sensitivity, (iii) rationale based on physical grounds (“seems to follow...’s law”), (iv) minimum sum of squares of deviations between predicted and empirical values, and (v) satisfactory residuals (no trends, oscillations, etc.) What can you do in practice to evaluate a particular ANN as a model for your application? I’ll mention two procedures: crossvalidation and residual analysis. 7.1. Cross-validation. The simplest form of cross-validation is via data splitting, which consists of dividing the data set into two parts, the training set and the test set, as I mentioned previously. Clearly, if too small a set of data is used for initially fitting the model, the prediction performance degrades. For small data sets, leaVe-one-out is a scheme in which all of the data points except one are used in the estimation phase (by minimum least-squares), and the remaining data point is used for testing the merit of the model by predicting the response for the selected data point. Each data point is successively deleted from the set and then returned to it. The mean sum of the squares of the residuals of the entire sequence of predictions can be viewed as proportional to the error in fitting a model. Various schemes exist to delete more than one data point at a time. In so-called V-Fold-Cross-Validation, instead of deleting one observation at time t, the n data are divided in V groups (usually of the same size d), and the mean-squared error is minimized to fit the model using the (n - d) portion of the data. 7.2. Residual Analysis. The second way to evaluate your ANN is to use graphical residual analysis, with a residual being
5788 Ind. Eng. Chem. Res., Vol. 47, No. 16, 2008
the difference between the value predicted by the ANN and the measured value of a data point. Graphical methods have an advantage over numerical methods for model validation because they readily illustrate a broad range of complex aspects of the relationship between an ANN as a model and the data. Numerical methods for model validation using just a single number, such as in the final prediction error or a chi-squared value, tend to be narrowly focused on a particular aspect of the relationship between the model and the data and, by compressing information into a single descriptive number or test value, miss important details about the data. If the model fit to the data is satisfactory, the residuals for each output variable should approximate random errors uncorrelated with all linear and nonlinear combinations of past inputs and output (Billings and Varn).13 If nonrandom structure is evident in the residuals, it is a sign that the model fits the data poorly. You can use various kinds of plots to test different aspects of a model and give guidance on the interpretations of different results. Typical plots are for (i) nonconstant variance in an output variable; (ii) drift in the process (you usually assume that the process is ergodic and stationary so that time averages and sample averages at replicated data points are the same); (iii) biased disturbances and outliers; (iv) determination of whether random errors are independent; (v) determination of whether random errors are collinear (correlated); and (vi) degree of pruning or expansion needed for the ANN. With respect to expanding the ANN (adding nodes) or pruning it (deleting nodes), during training or afterward, just changing the network structure based on concepts of how the underlying parts of the physical process function is one possible guide. Another is just to try out ad hoc various network sizes and structure: add a node(s) to a small net or remove a node(s) from a big net. Because no unique model exists, either method can improve the starting model, but neither is a rigorous method. If you choose to start the training (identification) with more nodes and connections than you eventually plan to end up with, the net will contain considerable redundant information after the training terminates. What you should do then is prune the nodes and/or links from the network, without significantly degrading performance. Pruning techniques can be categorized into two classes. One is the sensitivity method (Lee14). The sensitivity of the error function is estimated after the network is trained. Then the weights or nodes that relate to the lowest sensitivity are pruned. The other class is to add terms to the objective function that prune the network by driving some weights to zero during training (Kamruzaman et al.15 and Reed16). These techniques require some parameter tuning, which is problem dependent, to obtain good performance. An alternate approach is to build a net (the “growing” technique) by starting with a small number of hidden nodes and add new nodes or split existing nodes if the performance of the net is not satisfactory and seemingly can be improved. Many authors emphasize that ANNs are limited to only interpolation or, at the best, one-step-ahead prediction, because they do not generalize sufficiently well to make extended forecasts. However, with the right architecture, even a feedforward ANN can forecast satisfactorily for physical processes (Sastri17). The next two sections, one on feedforward ANN and the other on recursive ANN, provide more detail about the two most commonly used ANNs in chemistry and chemical engineering. I would like to be able to cite references in which ANNs are compared with other forms of modeling for the same input data set. Although a few articles make such comparisons, because
Figure 7. Three-layer feedforward ANN including bias nodes.
the input data is usually not preprocessed in the same way nor is the same optimization algorithm used for the comparison, such tests really cannot be relied on to validate the comparison. If a known function (not measurements) is used in the comparison, the choice of the function is usually too simple, and the kind of error added to the function inputs and outputs is usually not the same. While the entire input space can be well-covered with a known function, such is not the case with an operating process. In the next two sections, I will describe in some detail the two most useful ANNs we have found for chemists and chemical engineers to apply to their problems: (1) feedforward ANN and (2) recursive ANN. 8. Feedforward Artificial Neural Networks Three-layer (sometimes called two-layer, ignoring the input nodes) feedforward artificial neural networks are common models in the literature (see paper by Fine18 and Appendix A in the Supporting Information). Computational nodes are arranged in layers, and information feeds forward from layer to layer via weighted connections as illustrated in Figure 7. While the neural network literature uses jargon such as training patterns, test sets, connections weights, and hidden layers for modeling involving ANNs, here I formulate ANN models in terms of classical deterministic, discrete nonlinear system identification. Graphs of the network information flow such as Figure 7 help one to understand the more complex equations. An analogous group of equations can be formulated for ANNs in continuous time. Mathematically, the typical three-layer feedforward network can be expressed as yi ) φ0(Cφh(Bui + bh) + b0)
(6)
where yi is an output vector corresponding to an input vector ui, C is the connection matrix (matrix of weights) encompassing all of the arcs from the hidden layer to the output layer, B is the connection matrix from the input layer to the hidden layer, and bh and b0 are the bias vectors for the hidden and output layers, respectively; φh(*) and φ0(*) are the vector value functions corresponding to the actiVation (transfer) functions of the nodes in the hidden and output layers, respectively. Thus, feedforward neural network models have the general structure of yi ) f(u)
(7)
where f(*) is nonlinear mapping. Hence, feedforward neural networks are structurally similar to nonlinear regression models, and eq 7 represents a steady-state process. To use feedforward ANN for identification of dynamic (unsteady-state) systems or predictions in time, a vector composed of a moving window of past input values (delayed coordinates) must be introduced as inputs to the net. This
Ind. Eng. Chem. Res., Vol. 47, No. 16, 2008 5789
Figure 8. Simple recurrent ANN showing the feedback of delayed information (bias nodes not shown).
procedure yields a model analogous to a nonlinear finite impulse response model where the output is still yi, and ui ) [ui, ut-1,..., ut-m], or yt ) f([ut, ut-1, ..., ut-m])
(8)
The selected length of the moving window of input data must be long enough to capture the system dynamics for each variable. In practice, the duration of the data windows are determined by trial and error (or cross-validation), and each individual input variable might have a separate length for its data window to achieve optimal performance. If you use windows of past inputs and outputs in feedforward neural network models for dynamic modeling, the nets tend to be quite large with the result that they include hundreds of parameters whose values have to be estimated. Each additional input variable to the neural net model adds greatly to the size of the network and the number of parameters that must be estimated. As a specific example, if the input vector at time t consists of 4 different variables, and the number of past values of each is selected to be 6, the net must contain 24 input nodes. If this hypothetical network were to have 12 hidden nodes and 2 output nodes, the total number of parameters to be estimated, including the bias terms, would be 326. The large number of parameters necessitates large quantities of training or identification data and causes much slower times for identification. Net size for unsteady-state processes can be reduced by using recurrent ANNs, described in the next section. 9. Recurrent Artificial Neural Networks Recurrent neural networks (RNNs) have architectures similar to standard feedforward artificial neural networks with layers of nodes connected via weighted feedforward connections, but also with added weighted recurrent (feedback) connections, possibly time delayed. Examine Figure 8. Recurrent neural network models have the same relationship to feedforward ANNs as autoregressive infinite impulse response models have to moving-average finite-impulse response models. RNN provide a more parsimonious model structure of reduced complexity because the feedback connections largely obviate the necessity of data windows of time-lagged inputs. RNNs also have a direct nonlinear state-space interpretation that is useful in optimal estimation, as discussed subsequently. Two individual variations of recurrent neural network architectures are commonly employed. The first is called an internally recurrent network (IRN), that is characterized by time-delayed feedback connections from the outputs of hidden nodes back to the inputs of hidden notes. Examine the connections in the hidden layer in Figure 8. The remainder of the ANN is
composed of standard feedforward architecture. This structure is also known as an Elman19 network. Externally recurrent networks (ERNs), on the other hand, contain time-delayed feedback connections from the output layer to a hidden layer. You can also envision a hybrid recurrent network that contains both types of recurrent connections and might be described as an internal-external recurrent network (IERN). Figure 8 shows such a network. Simulation studies, both published and unpublished, have indicated no clear advantage in the accuracy of prediction by an IRN versus an ERN, or even an IERN, for dynamic modeling. Both IRN and ERN models seem to be equally satisfactory in most processmodeling applications. Another possibility is to include a moving window of past outputs along with the past inputs to the network, yt ) f([yt-1, yt-2, ..., yt-n ;ut, ut-1, ...ut-m])
(9)
analogous to a more general nonlinear time-series model. If you assign the vectors ut, xt, and yt to denote the vector outputs of the input, hidden, and output nodes, respectively, at time t, you can formulate an IRN network as a deterministic discrete-time model xt+1 ) φh(Axt + But + bh)
(10)
yt+1 ) φo(Cxt+1 + b0)
(11)
where φh(*) and φo(*) are the vector valued functions corresponding to the activation functions in the hidden and output layers, respectively. In most applications, the scalar elements of the activation function for each hidden node might be
( )
φi(Vi) ) exp
-V2i 2
(12)
where Vi is the total input to each node. Usually all the activation functions are made identical for simplicity. Linear activation functions are typically used in the output layer. The matrices, A, B, and C are the matrices of connection weights for the hidden to hidden recurrent, input to hidden, and hidden to output connections, respectively, and the vectors bh and bo are the bias vectors for the hidden and output layers, respectively. By posing the IRN model in the above form, you will note that this type of recurrent neural network is a nonlinear extension of the standard linear state-space model in which the outputs of the hidden layer nodes, xt, are the states of the model. In a similar fashion, you can write nonlinear state-space equations for the ERN. Whereas in the IRN model the states are the outputs of the hidden nodes, in the ERN model the states are the outputs of the nodes in the output layer so that the statespace equations are xt+1 ) φ0(Cφh(Dxt + But + bh) + bo)
(13)
yt+1 ) xt+1
(14)
where the matrices B and C and vectors bh and bo have the same meaning for the ERN as for the IRN, and the matrix D is the matrix of weights for the recurrent connections from the output layer at time t - 1 to the inputs of the hidden layer at time t. Although the ERN and IRN can exhibit comparable modeling performances, they have different features that may make one more desirable than the other for a particular process. Just like a conventional linear state-space model, the IRN does not have any structural limit on the number of model states because the number of hidden nodes can be freely varied. The ERN, however, can only have the same number of states as model
5790 Ind. Eng. Chem. Res., Vol. 47, No. 16, 2008 Table 2. Prespecified Data for the Normally Operating Heat Exchangera
temperature, T (K) flow rate, F (kg/h) pressure, P (kPa)
Figure 9. Simplified sketch of a shell-and-tube heat exchanger.
outputs because the outputs are the states. The IRN, thus, tends to be more flexible in modeling. The ERN has the advantage that the model states have a clear physical interpretation because they are the variables of the process itself, whereas the states of the IRN are hypothetical and neither unique nor canonical. Since both types of models have been posed as difference equations in time (rather than differential equations), to complete the model, a vector of initial values of the model states must be specified. Initialization of ERN models is simple because the user can observe the current values of the process output and use those values to initialize the states. Just as with linear state-space models, IRN models are more difficult to initialize because the states of the ANN lack physical meaning. In applications, you usually initialize the states of IRN models with the median value of the activation function of the hidden nodes (0.5 if the activation function ranges from 0 to 1.0). Inaccuracies in the state initialization typically result in initial inaccuracies in the model predictions, but these die out in a time of the order of the dominant time constant of the process being modeled. Such startup transients can be minimized by holding the network inputs constant using the initial input vector and cycling the IRN model until the states and, hence, the output of the network become constant. This is equivalent to assuming that ut ) u0 for all t < 0. 10. Example Applications In this section, I describe four examples of the use of ANNs to model (i) pattern classifications, specifically fault detection; (ii) forecasting in time by an IRN versus a linear steady-state model of a reactor; (iii) data rectification using an IRN; and (iv) model predictive control of a packed column using an ERN. Portions of these examples have appeared in the Korean Journal of Chemical Engineering,20 while details come from the references cited for each example. 10.1. Example 1: Fault Detection Using an ANN. This example from Suewatanakal21 demonstrates the use of a feedforward ANN to detect faults in a heat exchanger. Figure 9 is a sketch showing the input and output measurements for a simple heat exchanger. The temperature and flow rate deviations from normal were deemed to be symptoms of the physical causes of faults. The physical causes considered faults here were tube plugging and partial fouling in the heat exchanger internals of the tubes and shell. The diagnostic ability of a feedforward neural network is compared with Bayesian and KNN classifiers to detect the internal faults. Rather than using data from an operating heat exchanger, the Simulation Sciences code PROCESS was used to generate data for clean and fouled conditions for a steady-state countercurrent heat exchanger. To generate the data for both the clean and faulty conditions, a data file for each faulty (and normal) condition was prepared. Information about the thermodynamic properties of the streams, the fluid and exchanger physical properties, the configuration of the heat exchanger, the number
stream 1
stream 2
stream 3
stream 4
533 9080 207
977 4086 345
708.2 9080 148
599.8 4086 236
a Tube-side specifications: feed is a mixture of water, ethyl benzene, and styrene of composition (in weight percent) ) 55:25:20; number of tubes ) 108; length ) 4.88 m; outside tube diameter ) 3.18 cm; thickness ) 0.26 cm; tube arrangements ) square tube pitch: 3.97 cm; and tube-side fouling-layer thickness ) 0 cm. Shell-side specifications: feed ) water; inside diameter ) 54 cm; shell-side fouling)layer thickness ) 0 cm. Baffles: number of cuts (segments) for each baffle ) single.
Figure 10. Network architecture used for fault detection (bias nodes are not shown) in the heat exchanger.
of tubes, the sizes of the tubes and shell, and the fouling layer thicknesses in the tube and shell sides (for fouled conditions) was prespecified. Table 2 lists the physical data and the normal parameters for the heat exchanger. For the fault of tube plugging (tube-sided only), the degree of the fault was classified into four casessthe equivalent number of totally plugged tubes was 5, 3, 2, and 1. In the study of fouling (both for the tube-side and the shell-side), the degree of fouling was expressed as a function of the decreased crosssectional area, which was also classified into four cases, namely, a decrease of 8%, 5%, 3%, and 1%, respectively. To make the simulated measurements more realistic, two different levels of normally distributed noise were added to the deterministic flows, pressure, and temperatures so that the coefficients of variation of the noise were 0.02 and 0.01. Figure 10 illustrates a typical feed used to classify the respective faults. To train the ANN, each measurement, namely, the temperatures of all four streams (T1, T2, T3, and T4), the two flow rates (F1 and F2), and the pressure drops in the tube and shell sides (∆Ptube and ∆Pshell), for both the normal (clean) and faulty states, had to be linearly scaled to be between the range of -1 and 1. The network was trained so that 0.9 represented the normal state (yes) while 0.1 represented the faulty state (no). For example, a target output pattern of [0.9, 0.9, 0.9] represented a pattern in which the heat exchanger was clean. The target pattern from a state in which there was one or more plugged tubes was [0.1, 0.9, 0.9]. The target output pattern of [0.9, 0.1, 0.9] represented a state in which fouled tubes existed while a target output pattern of [0.9, 0.9, 0.1] represented a state in which fouling existed on the shell side. The training data set contained 80 patterns each for the four different classes (a total of 320 training patterns). Another set of 20 patterns for each class was used for testing the classification capability of the ANN (a total of
Ind. Eng. Chem. Res., Vol. 47, No. 16, 2008 5791 Table 3. Classification Rates for the Neural Network (When the Noise Coefficient of Variation was 0.02)
Table 4. Comparison of Classification Rates Yielded by Traditional Methods (When the Noise Coefficient of Variation was 0.02)
tube plugging
tube plugging % correctly classified
number of plugged tubes
training
testing
5 3 2 1
100 100 100 100
100 100 100 95
tube-side fouling
Bayes procedure training % correct
testing % correct
training % correct
testing % correct
5 3 2 1
100 100 100 100
100 100 98 93
100 100 100 100
98 95 92 89
tube-side fouling
% correctly classified % area was decreased 5 3 2 1
training 100 100 98 97
Bayes procedure
testing 95 94 93 86
shell-side fouling
training
testing
5 3 2 1
100 100 98 97
96 95 94 93
80 patterns). NPSOL was the optimization code used in the training of the ANN. In testing the ANN, a threshold value of 0.5 for the output of an output node was used as the discrimination criterion for classification. If the activation of an output node was >0.5, then that node was deemed activated and represented a faulty state of the exchanger. If the node value was