Ind. Eng. Chem. Res. 2008, 47, 8713–8723
8713
Dealing with Irregular Data in Soft Sensors: Bayesian Method and Comparative Study Shima Khatibisepehr and Biao Huang* Department of Chemical and Materials Engineering, UniVersity of Alberta, Edmonton, AB, Canada T6G 2G6
The main challenge in developing soft sensors in process industry is the existence of irregularity of data, such as measurement noises, outliers, and missing data. This paper is concerned with a comparative study among various data-driven soft sensor algorithms and the Bayesian methods. The algorithms to be considered for a comparative study in this paper include ordinary least-squares, robust regression, error-in-variable methods, partial least-squares, and the Bayesian inference algorithms. Methods for handling irregular data are reviewed. An iterative Bayesian algorithm for handling measurement noise and outliers is proposed. Performance of the Bayesian methods is compared with other existing methods through simulations, a pilot-scale experiment, and an industrial application. Introduction Models that are entirely based on physical principles such as mass, components, and energy balances are known as first principles models.9,10,14 Since complete knowledge of process behavior is required to build first principles models, they are often expensive and time-consuming to build. Black-box modeling has been proposed for situations in which physical understanding of the process is absent.20,36 However, black-box models do not use any information about the process and their applications are limited. Gray-box models,2,23,33 also called hybrid models, are a useful alternative in situations where insight into a system is required but a complete first principles model is difficult to construct. Depending on applications, a priori knowledge in gray-box modeling varies. In this paper, the a priori knowledge is limited to the structural relation between variables. One of the key applications of gray-box modeling is in soft sensor development.1,5 Soft sensors are mathematical models that provide online estimates of difficult-to-measure variables from readily available variables. These sensors are often needed in chemical processes because some important process variables or modes are difficult or expensive to measure online. Modeling methods applied in soft sensor development can be divided into two main groups: (1) ordinary least-squares (OLS) regression and (2) multivariate statistical techniques. OLS regression is a simple and straightforward approach to explain the relationship between variables. If the model error term is normally, independently, and identically distributed, OLS yields the most efficient unbiased estimation for the coefficients. It suffers from numerical problems, however, when the input variables are strongly correlated. As a result, multivariate statistical methods such as partial least-squares (PLS) 19,35 have attracted great interest. PLS is a regression technique applied to linear or linear-in-parameters models to simultaneously explain variations in both input and output variables and to maximize their covariance. This approach has been used for many industrial applications in the chemical process, such as estimating distillation compositions and polymer quality variables.28,31,34 Typically, a soft-sensor model is built to facilitate predictions of future responses in which only some variables are measured * To whom correspondence should be addressed. E-mail:
[email protected].
online.5,10,23,34 Usually, the results of conventional model estimation techniques are condensed into a set of weights on a given set of variables that can be used for making further predictions. For this reason, most of the estimation-based softsensor approaches require complete data samples in order to work or produce valid prediction results; however, it is common to have blanks in industrial data27,34 because of sensor failure or different sensor acquisition rates. Another issue related to soft sensors is measurement noises and outliers.24,34 Outliers, which can be regarded simply as the data points that are located far from the rest of the data, are almost the same kind of problem as measurement noises but their effects are devastating if not detected. Outliers occur frequently in industrial data sets and are harmful to the soft-sensor models derived by regressions. There are a variety of methods to handle incomplete and inconsistent data,4,6,24 but many of them are problem specific. Applying the Bayesian methods19 to soft sensors enables us to improve prediction results and overcome the problem of outliers and missing values in real-world data, as will be demonstrated in this article. In contrast to classical statistical estimation techniques, the Bayesian methods result in a posterior distribution over network weights or model parameters.3,32 If the model input variables are set at the values of a new case (or new input), the posterior distribution will give rise to a distribution over the predicted output of the model, which is known as the predictive distribution of this new case. Although the mean of the predicted distribution could be used as a single-valued prediction as the conventional soft sensors do, the full predicted distribution reveals how uncertain this prediction would be. Additionally, Bayesian methods handle incomplete data sets with ease because the dependencies among all variables and among all data are discovered through the Bayesian modeling process.8,12,19,26 This paper begins with an introduction to Bayesian inference methods. Then, existing treatments for missing values and measurement noises/outliers are reviewed briefly. We propose to use the Bayesian approach to solve modeling and prediction problems in soft sensors, including practical issues, such as treatments of measurement noises, outliers, and missing values. An iterative Bayesian algorithm for handling measurement noise and outliers is developed. Finally, effectiveness of the Bayesian methods in comparison with other methods is studied through simulations and a pilot-scale experiment.
10.1021/ie800386v CCC: $40.75 2008 American Chemical Society Published on Web 10/18/2008
8714 Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008
1. Introduction to the Bayesian methods 1.1. Notation. A Bayesian model or network is a graphical model to represent probabilistic relationships between a set of random variables in a system. We denote random variables by upper case letters (e.g., X, Y), and signify the actual value of these variables by the corresponding lower case letters (e.g., x, y). Sets of variables are symbolized by bold-face upper-case letters (e.g.,X, Y) and their values are represented by the corresponding bold-face lower-case letters (e.g., x, y). We use the notations P(X) and P(X ) x) to denote the probability distribution for X and the probability that X takes the value x, respectively. As a convention, P(X|Y) refers to the conditional distribution of X given Y and P(X|Y ) y) refers to the conditional distribution of X given Y ) y. Finally, E(X) stands for the expectation of X when the context has made the corresponding distribution clear. 1.2. Representation. Bayesian networks are directed graphical models, in which nodes represent random variables and arcs represent conditional dependence/independence. Let X ) {X1,..., XN} be a set of random variables, where we topologically order the nodes (parents before children) as 1,..., N in the network. Each node, Xi, directly depends on its parents Par(Xi), and a set of conditional probability distributions (CPDs) parametrizes this dependency. In the case of discrete variables, this distribution is often stored as the conditional probability table (CPT), that is, a table where the probabilities are given for all the combinations of values that the variable and its parents can take. According to an alternative definition of independency for Bayesian networks known as the directed local Markov property, Par(Xi) provides a set of parents of Xi that render Xi independent of all its other parents. After giving these specifications, the joint probability distribution can be calculated as follows: P(X1, ..., XN) ) P(X1)P(X2|X1)P(X3|X1, X2)...P(XN|X1, ..., XN-1) N
)
∏ P(X |X
1:i-1)
i
i)1 N
)
∏ P(X |Par(X )) i
i
(1)
i)1
The first line of eq 1 is defined via the Chain Rule of probability, and it is rewritten in the form of the second line. Finally, the third line follows because node Xi is independent of all its ancestors, X1: i-1, given its parents. The last equation holds only if the network is arranged according to the Pearl’s network construction algorithm, which is discussed below. 1.3. Network Construction. The condition that Par(Xi) ⊆ {X1,..., Xi-1}allows us to construct a network from a given ordering of nodes using Pearl’s network construction algorithm: 19
Figure 1. Types of reasoning.
2. Bayesian Inference The most common task using Bayesian networks is probabilistic inference followed by computing the posterior probability distribution for a set of query nodes, given values for some evidence nodes. The different types of reasoning that can be performed by Bayesian inference are shown in Figure 1 and outlined below. 1. Diagnostic reasoning: Moving in the opposite direction of the network arcs, this typically infers causes of problems from evidence and symptoms. 2. Predictive reasoning: Following the direction of the network arcs, this forecasts future beliefs about effects on the basis of new information about causes. 3. Intercausal reasoning: When there are exactly two possible causes for a particular effect, represented by a v-structure in the BN, with knowledge of the effect the presence of one explanatory cause renders an alternative cause less likely. 4. Combined reasoning: Performing diagnostic and predictive reasoning simultaneously, this predicts a query node whenever both its parents and children are observed. Inference, or evaluation, is the process of updating probabilities of outcomes based on model parameters and the values of the measured variables. Given a set of independently and identically distributed variables, Ð ) {X1,..., XN}, we partition the variables into Ð ) (XE, XQ) to distinguish between evidence and query variables. Once we have observed a set of evidence variables, XE, Bayesian inference uses the predictive distribution to predict the variables of interest, XQ: P(XQ|XE) )
∫ P(X |Θ, X ) P(Θ|X ) dΘ E
E
(2)
that is, a complete distribution of the predicted variable is provided. To illustrate this inference problem, we consider the relatively simple, but widely studied problems of linear regression for independently and identically distributed data. For simplicity, we shall consider a single output system as follows: XN ) f(X1, ..., XN-1 ;Θ) + ε
(3)
M
f(X;Θ) ) Algorithm 1 (Pearl’s Network Construction Algorithm). 1. Choose the set of relevant variables, X, that describe the domain. 2. Choose an ordering for the variables, X ) {X1,..., XN}. 3. While there are variables left: a. add the next variable, Xi, to the network; b. add arcs to the Xi node from some minimal sets of nodes already in the network, Par(Xi), such that the following conditional independence property is satisfied: P(Xi|X1,..., Xi-1) ) P(Xi|Par(Xi)), where X1,..., Xi-1 are all the variables preceding Xi, including Par(Xi;) c. define the CPT for Xi. Pearl’s network construction algorithm satisfies the Markov property and expresses conditional dependencies in probability distributions.
Q
∑ θ φ (X) ) Θ Φ(X) T
i i
(4)
i)1
where ε ∼ N(0,σ2) is the independently and identically distributed noise, Θ ) (θ1,..., θM) is the set of parameters, X ) (X1,..., XN-1 ), and Φ(X) ) (φ1(X),..., φM(X)) includes the fixed nonlinear basis functions. For a set of n observations of XN, {x1N, x2N,..., xnN}, least-squares regression is the classical approach to obtain single-valued estimates for the parameters that minimize the error function defined by 1 n i Σ |f(xi , ..., xN-1 ;Θ) - xiN|2 (5) 2 i)1 1 new To predict new values for a given x* ) {x1new,..., xN-1 }, the estimated parameter set, ΘLS, is used to evaluate f(x*;ΘLS). Err(Θ) )
Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008 8715
The maximum likelihood estimation is another important parameter estimation technique which maximizes P(xN1 , xN2 ,..., xNn |Ð,Θ,σ2). If we assume a normal distribution, ε ∼ N(0,σ2), for the noise term, the likelihood of all the data is given as n
i , Θ, σ2) P(x1N, x2N, ... , xnN|Ð, Θ, σ2) ) Π P(xiN|xi1, ..., xN-1 i)1
n
1
i)1
√2πσ2
)Π
[
exp -
i ;Θ) - xiN}2 {f(xi1, ..., xN-1
2σ2
]
(6)
Thus, the negative logarithm of the likelihood is -log P(x1N, x2N, ... , xnN|Ð, Θ, σ2) )
n log(2πσ2) + 2
1 n i Σ {f(xi1, ... , xN-1 ;Θ) - xiN}2 (7) 2 i)1 2σ Since the first term on the right-hand side of eq 7 is independent of Θ, minimizing the negative logarithm of the likelihood function is equivalent to minimizing the squared error function in the case of a normally distributed error. By taking the prior distribution of parameters into account, maximum A posteriori (MAP) Bayesian estimation calculates the most probable values for Θ under the posterior distribution P(Θ|Ð,σ2). In terms of the prediction for a given x*, one may 2 make use of ΘMAP to determine P(xnew N |x*,ΘMAP,σ ). It has been 32 shown that the mean of this distribution is identical to the solutions of the penalized least-squares approach. This classical prediction method does not take into consideration the uncertainty over the parameters when making the prediction. On the other hand, in the predictive distribution for xnew N obtained from the Bayesian method, the uncertainties in the estimation of Θ are integrated (marginalized) out: * 2 P(xnew N |x , Ð, σ ) )
∫ P(x
new * 2 N |Θ, x , σ )
P(Θ|x*, Ð, σ2) dΘ (8)
To conclude this section, we summarize the steps in Bayesian inference considered necessary in order to make a prediction: 1. specifying the prior distribution of parameters (uniform distribution if no prior is available); 2. computing the data likelihood; 3. computing the posterior distribution of parameters having observed the training data set; 4. computing the predictive distribution which will marginalize any uncertain parameters. 3. Dealing with Missing Data 3.1. Nature of Missing Data. Let Ð ) {X1,..., XN} denote a data set in which some of the values are missing. The observed and missing parts of D are represented by Do and Dm, respectively. For any data set, a matrix, M ) {Mil}, indicates whether Dil is observed (Mil ) 1) or missing (Mil ) 0). The missing data mechanism is described by the conditional distribution, P(M|Ð,Θ), where Θ denotes a set of parameters. Based on different conditionality, several mechanisms of missing data have been defined:29 1. Missing completely at random (MCAR): In this case, the probability that data is missing does not depend on any part of D: P(M|Ð, Θ) ) P(M|Θ) (9) 2. Missing at random (MAR): In MAR, the probability that data is missing depends on observed data: P(M|Ð, Θ) ) P(M|Do, Θ)
(10)
3. Not missing at random (NMAR): In this case, the missing mechanism depends on both observed and missing data. 3.2. Treatments of Missing Data. There are two different ways of handling missing data. The most common way is to simply exclude the cases with missing values from the analysis. If we do not want to lose data and, implicitly, information, we may try to guess the missing items. This second way is generally called imputation. Six techniques belonging to these classes are presented. 1. Casewise deletion: This method selects only cases that do not contain any missing values for any of the variables. Under the MCAR assumption, this method leads to unbiased estimates of parameters.25 Nevertheless, a lot of nonmissing content will be thrown out resulting in losing pieces of informative data. 2. Mean substitution: A natural method of imputation is to replace all missing values in a variable by the mean of that variable. Although mean substitution preserves the sample size, it may considerably change the values of variance, correlations and regression coefficients. This method assumes MCAR missing mechanism. 3. The LOCF method: In the last observation carried forward (LOCF) method, the last measured observation before the missing one is imputed. This approach is applicable only to situations in which measurements are expected to be constant over time. 4. Regression imputation: A regression model is built to predict the missing value based on complete cases. This approach is less likely to produce bias, but may still underestimate the variance.25 The regression imputation assumes that data are MAR. 5. NIPALS algorithm: Consider a data matrix, X, having the structure, X ) TP′, where T is a matrix of scores and P is a matrix of loadings. Treatment of missing values with PLSNIPALS can be implicitly considered as a simple imputation method, in which PLS loadings and components are iteratively calculated as slopes of least-squares lines passing through the origin of the available data. It is equivalent to setting the residuals for all missing elements in the least-squares objective function to zero in each iteration step. 6. EM-based Bayesian algorithm: The expectation-maximization (EM) is a well-known algorithm for estimating the parameters of the probability density function of an incomplete sample. At each iteration step, the missing values are replaced by the expected values from the conditional distribution given the present data and the current estimates of the means and covariances. In addition to the fact that this algorithm can be used for imputing missing values, it can also directly estimate their probability distributions. 4. Dealing with Measurement Noises and Outliers Measurement noises and outliers are common in industrial data sets. Outliers are observations far from most others in a set of data, which may be considered as significant measurement noises. Typically, these observations represent a large random error that we would like to eliminate to obviate the possible harmful effects. Some possible approaches to deal with outliers are listed as follows. 4.1. Box Plot. The box plot is a helpful graphical tool for determining how severe the outlier observations are. A box plot is constructed by drawing a box between the upper and lower quartiles with a solid line drawn across the box to locate the median. Defining Q0.25 and Q0.75 to be the first and third quartiles, outlier detection criteria are characterized by the following fences:
8716 Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008
lower inner fence: Q0.25 - 1.5 × (Q0.75 - Q0.25)
Zi ∼ N(µi + Wi Par(Zi), σ2i
upper inner fence: Q0.75 - 1.5 × (Q0.75 - Q0.25) lower outer fence: Q0.25 - 3 × (Q0.75 - Q0.25) upper outer fence: Q0.75 - 3 × (Q0.75 - Q0.25) By definition, a mild outlier is a point beyond an inner fence on either side, while an extreme outlier is a point beyond an outer fence. Having detected the outliers, data transformation or deletion is applied to soften their impact. 4.2. Robust Regression. Robust regression is a regression technique designed to limit the effects of outliers. The idea is to assign a weight to each observation so that outliers are given reduced weight. The most common method of robust regression is M-estimation.17 This solution is called the iteratively reweighted least-squares (IRLS), because the estimated parameters depend upon the weights, while the weights depend upon the residuals, and the residuals depend upon the estimated parameters. 4.3. Bayesian Method. Consider that outliers are equivalent to random measurement noise of large magnitude. The Bayesian method can be used to recover the noise-free variables. The procedure proposed is (I) using the Pearl’s network-construction algorithm to build a Bayesian framework for error-in-variable application (Algorithm 2) and (II) applying an iterative algorithm to make inference of the noise-free variables, and subsequent modeling based on the inferred “noise-free” variables (Algorithm 3). Algorithm 2. An example is shown in Figure 2. 1. Denote the original variable (noise-free variables) by Xi° and measured variable (noise-contaminated) by Xi. 2. While there are variables left: a. add the next variable, Xi and Xi°, to the network; b. Par(Xi) ) Xi°; and then c. add arcs to the Xi° node from some minimal sets of nodes already in the network, Par(Xi°), such that the following conditional independence property is satisfied: P(X°i |X′1,..., X′i-1) ) P(Xi°|Par(Xi°)), where X′1,..., X′i-1 are all the variables preceding X°i , including Par(X°i ). To complete the definition of the network model, we now need to define the conditional probability distributions (CPDs) for each of the nodes. As a priori, the following standard assumptions about the underlying probability distributions are made: ei ∼ N(0, εi)
(11)
(12)
where W is the regression matrix. The outliers are those that have small probability of occurrence, described by the tails of the distribution. Combining eq 11 and 12, the general form of CPDs for measured and hidden (noise-free) variables can be expressed as follows:
{
Xi ∼ N(x◦i , εi) X◦i ∼ N(µ◦i + Wi Par(X◦i ), σ◦i )
(13)
Applying this framework, the original variables can be predicted through the iterative application of Algorithm 2. Once the Bayesian network model is built, Bayesian inference and parameter estimation can proceed according to the following algorithm: Algorithm 3. 1. Start with a set of measured variables, Ð ) {X1,..., XN}, to learn the parameter of the Bayesian model. 2. Bayesian diagnostic inference: Predict the original variables from the identified model as Ð ) {Xˆ1° ,..., XˆN° }. 3. Bayesian learning: Determine the Bayesian model parameters (such as conditional variances of the noisy measured variables) from the predicted data set Ð ) {Xˆ1° ,..., XˆN° }. Steps 2 and 3 are repeated until parameters of the model converge. Algorithm 3 enables us to detect outliers/noises and recover the original variables. 5. Comparative Study of Soft-Sensor Algorithms The main objective in soft-sensor development is to find a model that gives the best prediction in the applications. Since the success of an empirical model depends on the quality of the process data, pretreatment of data is crucial. Thus, noise reduction, outlier removal, and missing values treatment need to be considered. In this section, Bayesian methods are compared with other existing methods for estimation and prediction in the presence of missing data and outliers. 5.1. A Simpler Case. Assume that we want to develop a soft sensor corresponding to the sequential model in Figure 2 in order to predict X2 whenever a set of {X1, X3} is available. Xi, X°i represent measured and noise-free variables, respectively. The true underlying models are given as follows: Xi ) X◦i + ei
e1 ∼ N(0, 0.1), e2 ∼ N(0, 0.1), e3 ∼ N(0, 0.25)
X◦2 ) a1X◦1 + a2 X◦3 ) b1X◦2 + b2
(14)
Four methods are applied to this problem.
Figure 2. The Bayesian EIV framework for a sequential model. Table 1. Estimated Parameters for Three-Variable Network Example coefficients method
a1
a2
actual OLS EIV Bayesian method
1 0.9095 0.9577 0.9825
1 1.0013 1.0681 0.9988
b1
b2
2 0 2.1076 -0.135 2.0646 0.0977 1.9810 0.0003
mean square error 0.0095 0.0050 1.6719e-004
Figure 3. The MSEs of the estimates of the simulated sequential model.
Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008 8717
(1) Ordinary least-squares (OLS) regression: The system is represented by two sequential models as follows: Xˆ2,X1 ) aˆ1X1 + aˆ2
(15)
bˆ2 1 Xˆ2,X3 ) X3 bˆ1 bˆ1
(16)
Each of these models takes a portion of the available measurements into account to estimate the entire set of parameters and predict the output. (2) Partial least-squares (PLS) regression: To make use of all information on hand in predicting the output variable, the PLS approach is used to model the system in a linear polynomial way for prediction purposes: j 1 + c2X j3 j 2,PLS ) c1X X
(17)
Both one-component and two-component PLS models are investigated in this example. Unless otherwise stated, it is assumed that data is either mean centered or autoscaled prior to the analysis in PLS. (3) The EIV Aapproach: Once again, parameters for eq 15 and 16 are estimated using EIV. Since EIV models are useful only when the primary goal is model parameter estimation rather than prediction, we do not consider them in our prediction performance comparison. (4) The Bayesian method: The EM algorithm is applied to estimate parameters for the Bayesian network model. Then Bayesian inference is used to infer the output variables, given new observations in the input variables. Using these methods, the parameters of this model are estimated and presented in Table 1. The mean square errors of the estimates are also given in Figure 3. Comparison of MSEs tells us that the Bayesian model fits the training data better than the other models do. Although X2 may be observed during model identification phase (in practice this is possible and typically achieved through laboratory analysis during the identification experiments), it may not be available as an online measurement. Thus, for soft-sensor applications the derived model is used to predict X2° given only the measurements for X1 and X3. Let, Xˆ2,X1 and Xˆ2,X3 denote the results obtained from OLS, Xˆ2,PLS1 from one-component PLS, Xˆ2,PLS2 from two-component PLS, and Xˆ2° from the Bayesian method; all are presented in Figure 4 and compared in Table 2. Since the classical regression models represent only parts of the system, they do not make use of all the available data. Hence, as expected their prediction performance is poorer than those of the two-component PLS and the Bayesian method, which take into account all the variables. According to Table 2, the solution given by the two-component PLS is the same as the solution given by the Bayesian method. These results are not surprising, because we have not quantified any a priori knowledge about parameter values. As mentioned before, the MAP predictions are identical to the maximum likelihood predictions with noninformative (or uniform) prior. It is notable that the two-component PLS performs better than onecomponent PLS; this is because it captures a greater percentage of the variances. We have just seen that the Bayesian method and PLS perform the same with regard to the predictions. The advantages of the Bayesian method will become apparent, however, in treatments of missing values and outliers as well as the non-normal errors in the examples to be presented shortly. In this example, the information provided is sufficient to recover noise-free variables, which are given as X1° and X3° ;
Figure 4. Scatter plot comparison of different estimated X°2 in the simulated sequential model. Table 2. Comparison of Predicted X2° in the Simulated Sequential Model
OLS regression (based on X1) OLS regression (based on X3) PLS regression (1 component) PLS regression (2 components) Bayesian method
absolute error average
standard deviation
mean square error
0.2184
0.9422
0.0779
0.1962
1.0056
0.0599
0.1758
0.9897
0.0485
0.1461
1.0040
0.0342
0.1461
1.0040
0.0342
this is considered as a type of diagnostic reasoning. Figure 5 and Figure 6 present the scatter plot comparison of estimated noise-free variables (Xˆ1° and Xˆ3° ) and measured variables (X1 and X3) in relation to the original noise-free variables (X1° and X3° ). The ideal case would be for all data points to lie exactly on the diagonal, indicating that measurement errors are totally captured. Figure 5 and Figure 6 reveal that the Bayesian method is able to capture the corresponding noise-free variables to some extent, even though only the values for X1 and X3 are available. We evaluate the accuracy of Xˆ°1 and Xˆ°3 in comparison with X1 and X3 in Table 3. Taking the actual noise-free values (X1° and X3° ) as references, accuracy is represented by the averaged absolute errors. Smaller values of absolute error averages denote the smaller amount of measurement error associated with data points. Shown in Table 4, much more noise can be captured if X2 has also been observed in addition to X1 and X3.
8718 Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008
X◦1 ∼ N(0, 1)
X◦2 ∼ N(0, 2) X◦3 ) 3 + 2X◦1 + X◦2
Figure 5. Scatter plot comparison of measured and recovered X1° in the simulated sequential model.
e1 ∼ N(0, 0.1) e2 ∼ N(0, 0.25)
(19)
e3 ∼ N(0, 0.4)
First, we generate a complete data set and then randomly hide a predetermined percentage of values to simulate partially observed measurements. In keeping with the MAR mechanism, data are deleted randomly with missing value probabilities of 5%, 10%, 20%, 30%, and 50% and the same missing mechanism is applied equally to the network input variables. This procedure is repeated five times for each level of missing data. Applying the Bayesian method, we can fit the specified model and estimate its parameters in the maximum likelihood sense by using the EM algorithm. At each iteration step, the missing values are replaced by the expected values from the conditional normal distribution given the present data and the current estimates of the means and covariances. To illustrate: consider having to estimate Y ) aX + b and then use X to estimate Y wherever it is missing. We would first take estimates of the variance, covariance, and mean, perhaps from casewise deletion. We would then use these estimated parameters to solve for the regression coefficients a and b. Having filled in missing data with these estimates, we would then use the complete data to recalculate the regression coefficients. We would continue this process until the parameters converged. The Bayesian approach is compared with the following methods: (1) OLS regression, (2) EIV, (3) the partial least-squares regression-NIPALS algorithm. Since incomplete data are not allowed in the OLS regression and EIV, these methods handle missing values by using casewise deletion, mean substitution, and LOFC technique. PLS regression in its standard form involving the use of the NIPALS algorithm can deal with missing values. Figure 8 and Figure 9 show the performance of all approaches on the basis of mean square relative error (MSRE) of the estimated parameters. The Bayesian method performs the best for all levels of missing
Figure 6. Scatter plot comparison of measured and recovered X3° in the simulated sequential model. Table 3. Comparison of Measured and Recovered Noise-Free Variables in the Simulated Sequential Model (X1 and X3 Are Measured) X1 Xˆ1° X3 Xˆ3° absolute error average
0.2501
0.1531
0.3958
0.3029
Figure 7. The Bayesian EIV framework for a multiple linear model.
Table 4. Comparison of Measured and Recovered Noise-Free Variables in the Simulated Sequential Model (X1, X2, and X3 Are Measured) X1 Xˆ1° X3 Xˆ3° X2 Xˆ2° absolute error average 0.2501 0.1332 0.3958 0.2627 0.2469 0.132
5.2. A More Complex Case. 5.2.1. Missing Values. In softsensor applications, it is common to represent a model in a multiple linear regression model as follows: Xn ) a0 + a1X2 + a2X2 + · · · + amXm
(18)
We shall demonstrate how the Bayesian method can be applied to handle the practical issues that come up in deriving a multiple linear regression model. As an example, consider the network presented in Figure 7, with the following model expressions:
Figure 8. The mean square relative errors (MSREs) of the estimates of the multiple linear model for different levels of missing values.
Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008 8719
Figure 9. 100 × MSRE of estimates by various techniques under different levels of missing data. Table 7. Parameters of Lognormal Measurement Errors
Table 5. Criteria for Outlier Detection specification outlier detection criteria
X1
X2
X3
median lower quartile upper quartile lower inner fence upper inner fence lower outer fence upper outer fence
0.0850 -0.6511 0.7788 -2.7961 2.9238 -4.9411 5.0688
-0.0268 -1.0214 1.0608 -4.1447 4.1840 -7.2680 7.3073
3.1158 1.4284 4.9519 -3.8568 10.2371 -9.1420 15.5223
Table 6. Parameters of the Simulated Multiple Linear Model in the Presence of Outliers coefficients method
a0
a1
a2
mean squared relative error
true OLS PLS-1 comp PLS-2 comp EIV OLS-box plot robust Bayesian
3 3.0142 3.0137 3.0142 4.2064 3.0883 3.0023 2.9731
2 1.2019 1.2202 1.2019 1.4261 1.7491 1.7822 1.8695
1 0.5347 0.5091 0.5347 0.596 0.8258 0.8366 1.0754
0.1253 0.1310 0.1253 0.1358 0.0157 0.0129 0.0033
measurement error
µ
σ
e1 e2 e3
0.001 0.001 0.001
0.35 0.35 0.4
Table 8. Parameters of the Simulated Multiple Linear Model with Lognormal Errors coefficients method
a0
a1
a2
mean squared relative error
true OLS EIV PLS Bayesian
3 6.6191 23.4718 6.6232 3.6118
2 1.7262 0.6584 1.7319 2.0795
1 0.9132 0.3143 0.9071 1.0166
0.4839 15.829 0.4951 0.0145
NIPALS algorithm is replaced with casewise deletion or mean substitution, PLS estimates converge to the OLS regression solution. LOCF also performs poorly in both OLS and EIV. The performance of casewise deletion and mean substitution are acceptable and comparable. This is not surprising, because the measured variables are distributed normally. As a result, it is more likely that the values for missing observations will be equal to the population mean. The main drawback of casewise deletion is that, with a large number of variables and missing data presented in each variable, the probability of complete data is low, and therefore Xo may be empty. One of the interesting features of Figure 9 is the analogous trend of MSREs when parameter estimates are obtained by OLS and EIV. It is noted, however, that the overall performance of EIV is better than that of the OLS regression for each of the presented missing values processing techniques. 5.2.2. Outliers. Let us now compare the Bayesian method with conventional methods for noisy measurement with outliers. We first generate a complete data set from the model described by eq 18. The outliers are simulated according to the six-σ rule: |xoji - µoi | > 3√σoi
Figure 10. The MSREs of the estimates of the multiple linear model in the presence of outliers.
data. OLS, EIV, and PLS are comparable for complete data sets; however, the performance of PLS deteriorates significantly as the percentage of missing values increase. Clearly, the NIPALS algorithm is quite sensitive to missing values. If the
°
°
(20) °
where µi and σi are the mean and variance of the Xi variable so that Xi° is distributed as Xi°˜∼ N(µi°,σi°). We evaluate the performance of the following parameter estimation methods in the presence of outliers: (1) OLS regression; (2) PLS regression; (3) EIV; (4) OLS regression and Box plot technique. We first remove the outliers from training
8720 Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008
data set followed by the box plot outlier detection criteria. Next, the OLS regression is performed on the new set of data. The fences used to investigate the presence of outliers in our data set are presented in Table 5. (5) robust regression; and (6) Bayesian method (algorithm 3). Table 6 reports the numerical values of estimated parameters and MSREs of the estimates. The MSREs are also plotted in Figure 10. It is obvious that OLS and PLS are not efficient parameter estimation techniques in the presence of outliers; however, if we first use Box plot criteria to detect outliers and then remove them from the training data set, the performance of OLS improves. Since robust regression automatically detects outliers and downweighs them, the associated MSRE of the estimates is reasonable. EIV and Bayesian inference are suitable algorithms to compensate for the noise. Yet, Figure 10 shows that EIV fails in parameter estimation, if a data set contains significant outliers. Applying the Bayesian method not only captures the outliers, but also preserves the size of the training data set. Comparing MSREs of the different methods included in this study, it confirms that the Bayesian method is able to deal with measurement noises and outliers better than other methods. 5.2.3. Lognormal Errors. So far, we have considered only the case where measurement errors are normal-distributed. The normal distribution assumption may not always be realistic for industrial data. To argue against unconditional reliance on the assumption of error term’s normality in OLS and PLS regressions, we assume that measurement errors have a log-normal distribution. The log-normal distribution has the probability density function of f(x|µ, σ) )
(
[ln x - µ]2 1 exp 2σ2 xσ√2π
)
for x > 0 (21)
Given eq 21, it follows that the expected values and variance are obtained from:
(
E(X) ) exp µ +
σ2 2
)
Var(X) ) [exp(σ2) - 1] × exp(2µ + σ2) Suppose the model structure as X◦1 ∼ N(10, 1) X◦2 ∼ N(10, 2) X◦3 ) 3 + 2X◦1 + X◦2
(22) (23)
(24)
The prior probabilities used to simulate log-normal distributions for measurement errors are given in Table 7. The estimated model parameters from OLS regression, EIV, PLS regression, and Bayesian method are compared in Table 8. The MSRE of estimates provides an illustration of the potential of the Bayesian method to improve parameter estimation when the measurement noise is not normal-distributed. In conclusion, Bayesian methods help us to attain increased performance by improving the efficiency in the treatment of measurement noises, outliers, missing values, and non-normal errors. The price to be paid is increased computational time due to iterative nature of Bayesian learning and inference algorithms. It is noteworthy that the performance of the Bayesian methods can be further improved if informative priors are used, for example, knowing the variances of measurement errors.26 6. Experimental Evaluation: A Three-Tank System In this section, we perform an experimental evaluation of the Bayesian methods and then briefly discuss the results. Consider the multitank system shown in Figure 11, which consists of three tanks placed above each other. Liquid is pumped into the
Figure 11. The configuration of the three-tank system.
upper tank from the supply tank by the pump driven by a DC motor. The liquid outflows the tanks only as a result of gravity. The output orifices act as flow resistors. Assuming the laminar outflow of an “ideal fluid” for this three-tank system, the system of equations describing the process can be obtained as follows: dV1 ) q - C1√H1 dt dV2 ) C1√H1 - C2√H2 dt dV3 ) C2√H2 - C3√H3 dt
(25)
where, Vi is the fluid volume in the ith tank (i ) 1,2,3); q is the inflow to the upper tank; Hi is the fluid level in the ith tank, (i ) 1,2,3); and Ci is resistance of the output orifice of ith tank. Liquid levels in the tanks are the state variables of the system. For the tank system there are four inputs: liquid inflow, q, and valve settings, C1, C2, C3. Therefore, several models of the tanks system can be analyzed. Let us consider the steady state operations. The basic equation in this study is the mass balance equation: q ) C1HR1 1 ) C2HR2 2 ) C3HR3 3
(26)
Sometimes, turbulence and acceleration of the liquid in the tube should be taken into account; therefore the general flow coefficients (Ri’s) are applied in eq 26. In the case of laminar flows, we assume Ri ) 0.5 according to Bernoulli law. It is, therefore, possible to transform the nonlinear system to a linear one: C21H1 ) C22H2 ) C23H3
(27)
To formulate the relation between H2 and the measured variables, eq 27 can be further rewritten as
{
H2,H1 ) a1 + a2H1 H2,H3 ) b1 + b2H3
(28)
Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008 8721
Figure 12. Trend comparison of prediction results obtained from the PLS and Bayesian models using a completely observed testing data set.
Figure 13. Trend comparison of prediction results obtained from the PLS and Bayesian models using the first partially observed testing data set.
Consider the linear model representing the liquid level in the middle tank of the three-tank system as a variable to be predicted. Once the model has been identified, the aim is to predict H2° from a new set of observations. To investigate the prediction performance of the Bayesian method for real data, it is compared with PLS for both complete and incomplete samples. Since the PLS has been shown to be the best on the basis of the study in the previous sections, we compare the Bayesian approach only with the PLS regression. A training data set, consisting of 851 steady-state points, is used to derive the Bayesian and PLS models. The PLS formulation is presented in eq 29 as j 1 + 0.0887 × H j3 j 2,PLS ) 0.6061 × H H
(29)
The fitted models are then applied to perform prediction on 52 different new cases. The prediction results for completely measured testing data are plotted in Figure 12. The pump used for the process is quite noisy, and the achieved prediction performance has been considered reasonably well based on our experience with this setup of experiment. As mentioned earlier, PLS and the Bayesian methods have similar prediction performance under complete data conditions. In the next step, we remove part of the measurements from the testing data set as if they had not been measured. Two partially measured testing data sets are constructed. The testing data matrix D can be considered to be a collection of top and bottom liquid level measurements, Ð ) {H1,H3}. Without loss of generality, the missing values for the upper tank level can be taken as being the first half-elements of the corresponded data vector, H1. In the same time, the last half-elements of the data vector associated with lower tank level, H3, are also deleted. In the second data set, the last half of the upper tank level measurements (H1) and the first half of the lower tank level measurements (H3) are taken out. For each data set, the predicted values obtained from both models are shown in Figure 13 and Figure 14. When the upper tank level is not measured, it is obvious that the prediction performance of the PLS model deteriorates. On the other hand, missing the lower tank level does not seriously affect the prediction performance of PLS. By studying the PLS model, we see that very little weight is given to the lower tank level in comparison with the upper tank level. As a result, the fact that PLS fails to handle the missing values is much clearer when the upper level has missing data. Consequently, PLS does not
Figure 14. Trend comparison of prediction results obtained from the PLS and Bayesian models using the second partially observed testing data set.
capture the significant changes of the level in the middle tank from incomplete data set. In contrast, Figure 13 and Figure 14 reveal that the absence of either upper or lower tank level measurements does not significantly influence the performance of the Bayesian model in predicting the level of the middle tank. 7. Industrial Evaluation We now turn to the application and evaluation of Bayesian inference in the context of soft-sensor implementation for an industrial process. For proprietary reason, the actual process and tag names are not disclosed and all data are normalized. 7.1. Existing Measurement. In a separation vessel, it is very important to maintain the component A to component B ratio (A:B) in feed streams at certain levels so as to achieve effective and efficient separation at an affordable cost. A:B is a key quality variable that is not available “on demand” and is available only after several hours of laboratory analysis. For this reason, refractometer and calculation tags have been used to provide online A:B estimation and increase control efficiency where the calculation tag was developed using conventional data fitting methods. Laboratory data are used to evaluate the
8722 Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008
Figure 15. Scatter plot comparison of existing A:B measurements.
accuracy of A:B ratios given by refractometers and the existing calculation tag. Outputs of refractometer and calculation tags are compared with laboratory data (laboratory data are sampled infrequently) in Figure 15. It is clear that existing measurements are not accurate enough. Specifically, measurements from the calculation tag are lower than the laboratory data, and smaller A:B ratios than true values were used in A:B ratio controllers. This under-estimation resulted in the addition of more component A than required, and consequently increased the operating cost for this plant. As a result, there is a need to improve the accuracy of online A:B estimation. 7.2. Soft-Sensor Development. Our overall purpose is to increase production of component B while diligently maintaining environmental compliance through reduction of usage of component A via better A:B measurements. To achieve this objective, we are interested in deploying soft-sensor technology to this problem. Presently, however, we do not have a complete model of the process. Thus, we are interested in the gray models that use first principles to search for an appropriate model structure, while historical data are used to reveal the parametric relationship between A:B ratio and online measurable process variables to develop soft sensors. The A:B estimation model is expressed as follows: X0 + θ1X1 DB ) θ2X2 + θ3X3 + θ4X4 + θ5X5 + θ6X6
(30)
where Xi (for i ) 0-6) are the measured input variables and θi (for i ) 0-6) are the model parameters. To train and evaluate our soft sensors, altogether 2385 data points have been used. Formulating a soft-sensor problem in a probabilistic framework based on eq 30, a Bayesian soft sensor is developed. The performance of the Bayesian soft sensor is validated using a set of cross validation data and compared with old soft sensor (non-Bayesian soft sensor). Scatter plot comparisons of each A:B soft sensors vis-a`-vis the laboratory data are presented in Figure 16. It is obvious that unreliable measurements result in overestimation of A:B values in nonBayesian soft sensor. Figure 17 depicts a comparison of the Bayesian soft-sensor and refractometer outputs. This figure demonstrates that the Bayesian soft sensor provides reasonably successful prediction as a result of capturing changes in both measured and quality variables.
Figure 16. Scatter plot comparison of Bayesian (new) and non-Bayesian (old) A:B soft sensors.
Figure 17. Scatter plot comparison of refractometer and Bayesian A:B soft sensor.
8. Conclusion In this work, the practical issues of soft-sensor development have been studied. To derive a soft-sensor model from industrial data, it is important to deal with missing values, noisy measurements, and outliers. Several approaches for handling irregular data sets have been reviewed: casewise deletion, mean substitution, LOFC, regression imputation, NIPALS algorithm, and EM algorithm. Robust regression and box plot criteria were also presented as the most commonly used techniques for detecting and dealing with outliers. Bayesian inference theory and algorithms have been discussed in detail. Since the Bayesian method enables us to cope with missing values, it is tempting to formulate the soft-sensor problem in a Bayesian framework. A Bayesian learning/ inference algorithm (algorithm 2) has also been presented for dealing with measurement noises and outliers. Simulated data sets, designed specifically to accentuate the presence of outliers, missing values, and log-normal errors, have been used to assess the performance of each different approach in soft sensor development and implementation. Experimental and industrial data have also been analyzed to demonstrate the effectiveness of the Bayesian method on real data sets.
Ind. Eng. Chem. Res., Vol. 47, No. 22, 2008 8723
Nomenclature Ai ) measure input variable in DB estimation model Ci ) resistance of the output orifice of ith tank Ð ) set of independent and identically distributed observations Dm ) unobserved variables in a data set Do ) observed variables in a data set E(X) ) expectation of X Err(X) ) error function of X Hi ) fluid level in the ith tank M ) model structure N(0, εi) ) normal Gaussian distribution P(X) ) probability distribution of X P(X|Y) ) conditional distribution of X given Y P(X|Y ) y) ) conditional distribution of X given Y ) y. P(X ) x) ) probability that X takes the value x Par(X) ) set of parents of X Q ) covariance of the state noise Q0.25 ) first quartiles Q0.75 ) third quartiles R ) covariance of the observation noise T ) total feed flow rate to IPS II Vi ) fluid volume in the ith tank W ) regression matrix X° ) noise-free X ei ) noise associated with the measured variable Xi q ) inflow to the top tank Θ ) model parameters εi ) variance of Xi µi ) mean of Xi θi ) DB estimation model parameters
Literature Cited (1) Aaron, L. Z.; Shan, T. Y. DeVelopment of an Industrial Soft Sensor; Department of Chemical and Biomolecular Engineering, National University of Singapore: Singapore, 2004 (2) Abony, J. ; Madar, J. , Szeifert, F. Combining First Principles Models and Neural Networks for Generic Model Control; 6th On-line World Conference on Soft Computing in Industrial Applications (WSC6); Internet, September 2001. (3) Bishop, C. M.; Tipping, M. E. Bayesian Regression and Classification, In AdVances in Learning Theory: Methods, Models and Applications, Suykens, J.A.K. et al., Eds.; NATO Science Series III: Computer and Systems Sciences, IOS Press: Amsterdam, The Netherlands, 2003; Vol. 190 (4) Brandel, J. Empirical Bayes Methods for Missing Data Analysis; Technical Report 2004:11; Department of Mathematics, Uppsala University: Sweden, June 2004. (5) Champagne, M.; Amazouz, M. ; Platon, R. The Application of Soft Sensors in the Pulp and Paper and Cement Manufacturing Sectors for Process and Energy Performance Improvement; Technical Report CETC 2005-100(TR); Effective Assets Inc.: June, 2005. (6) Chen, C. Robust Regression and Outlier Detection with the ROBUSTREG Procedure. Proceedings of the 27th Annual Users group International Conference; SAS Institute: Cary, NC, 2002. (7) D’Agostini, G. Bayesian Inference in Processing Experimental Data: Principles and Basic Applications; Reports on Progress in Physics, Vol. 2003, 66, 1383–1420. (8) Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society Series B, Vol. 1977, 39, 1–38. (9) Esmaily Radvar, G. Practical issues in Non-linear System Identification, M.Sc. Thesis, Department of Chemical and Materials Engineering, University of Alberta, Spring 2002. (10) Friedman, Y. Z.; Neto, E. A.; Porfirio, C. R. First principles distillation inference models for product quality prediction. Hydrocarbon Process. 2002, 81 (2), 53–58. (11) Friedman, N. The Bayesian Structural EM Algorithm. Proceedings of the Fourteenth Conference on IUA; Madison, WI, 1998; Morgan Kaufmann: 1998; pp 129-138.
(12) Ghahramani, Z. Learning Bayesian Networks. Lecture Notes in Artificial Intelligence, Department of Computer Science; University of Toronto: Toronto, Canada, 1997. (13) Graebe, S. F.; Goodwin, G. C. Control Design and Implementation in Continuous Steel Casting. IEEE Control Syst. 1995, 64–71. (14) Grantham, S. D.; Ungar, L. H. A First Principles Approach to Automated Troubleshooting of Chemical Plants. Comput. Chem. Eng. 1990, 14, 783–798. (15) Heckerman, D. A Tutorial on Learning Bayesian Networks. Technical Report MSR-TR-95-06, Microsoft Research, November, 1996. (16) Helsper, E. L.; Van der Gaag, C. Building Bayesian networks through Ontologies; Proceedings of the 15th European Conference on Artificial Intelligence; Lyon, France, 2002, IOS Press: France, 2002; pp 680-684. (17) Huber, P. J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 101, 35–73. (18) Johnson, R. A.; Wichern, D. W. Applied Multivariate Statistical Analysis; Prentice Hall: Upper Saddle River, NJ, 1998. (19) Korb, K. B.; Nicholson, A. E. Bayesian Artificial Intelligence; Chapman & Hall/CRC: Boca Raton, FL, 2004. (20) Lakshminarayanan, L.; Tangirala, A. K.; Shah, S. L. Soft Sensor Design Using Partial Least Squares and Neural Networks: Comparison and Industrial Applications; AIChE Annual Meeting, Miami Beach, FL, November, 1998. (21) Leray, P.; Francois, O. Bayesian Network Structural Learning and Incomplete Data; Proceedings of AKRR’05 International and Interdisciplinary Conference, Helsinki, Finland, June 2005. (22) Lewicki, P.; Hill, T. Statistics Methods and Applications; StatSoft, Inc.: Tulsa, OK, 2005. (23) Li, B. Project Review: Soft Sensors; Technical Report, University of Alberta, Alberta, Canada, February, 2005. (24) Lin, B.; Recke, B.; Knudsen, J. K. H.; Bay Jørgensen, S. A Systematic Approach for Soft Sensor Development. Comput. Chem. Eng. 2007, 31, 419–425. (25) Magnani, M. Techniques for Dealing with Missing Data in Knowledge Discovery Tasks. Department of Computer Science, University of Bologna, June, 2004; http://magnanim.web.cs.unibo.it/data/pdf/missingdata.pdf (accessed May 2007). (26) Murphy, K. P. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Dissertation, University of California, Berkeley, CA, 2002. (27) Nelson, P. R. C. Taylor, P. A. MacGregor, J. F. Missing Data Methods in PCA and PLS: Score Calculations with Incomplete Observations. Chemom. Intell. Lab. Syst. 1996, 35, 45-65. (28) Park, S.; Han, C. A Nonlinear Soft Sensor Based on Multivariate Smoothing Procedure for Quality Estimation in Distillation Columns. Comput. Chem. Eng. 2000, 24, 871–877. (29) Rubin, D. Inference and Missing Data. Biometrika, 1976, 63, 581– 592. (30) Shrive, F. M.; Stuart, H.; Quan, H.; Ghali, W. A. Dealing with Missing Data in a Multi-Question Depression Scale: A Comparison of Imputation Methods. BMC Med. Res. Method. 2006, (31) Skagerberg, B.; MacGregor, J. F.; Kiparissides, C. Multivariate Data Analysis Applied to Low-Density Polyethylene Reactors. Chemom. Intell. Lab. Systems 1992, Vol. 14 (1-3), 341–356. (32) Tipping, M. E. Bayesian Inference: An Introduction to Principles and Prediction in Machine Learning. In AdVanced Lectures on Machine Learning; Bousquet, O., von Luxburg, U., Ra¨tsch, G., Eds.; Springer: New York, 2004; pp 41-62. (33) Van Lith, P. Hybrid Fuzzy-First Principles Modeling, Twente University Press: Netherlands, 2002. (34) Wang, D.; Srinivasan, R.; Liu, J.; Guru, P. N. S.; Leong, K. M. Data-driVen Soft Sensor Approach For Quality Prediction in a Refinery Process; 4th International IEEE Conference on Industrial Informatics, Singapore, August 2006. (35) Wold, S.; Eriksson, L.; Trygg, J.; Kettaneh, N. The PLS methodpartial least squares projections to latent structures and its applications in industrial RDP; Institute of Chemistry, Umeå University, June 2004. (36) Zamprogna, E.; Barolo, M.; Seborg, D. E. Optimal Selection of Soft Sensor Inputs for Batch Distillation Columns Using Principal Component Analysis. J. Process Control 2005, 15, 39–52.
ReceiVed for reView March 9, 2008 ReVised manuscript receiVed August 3, 2008 Accepted September 5, 2008 IE800386V