Nonlinear System Identification with Selective Recursive Gaussian

Nov 7, 2013 - The Gaussian process (GP) model has been applied to the identification of a process model. The GP model can be represented by its mean a...
9 downloads 10 Views 2MB Size
Article pubs.acs.org/IECR

Nonlinear System Identification with Selective Recursive Gaussian Process Models Lester Lik Teck Chan,† Yi Liu,‡ and Junghui Chen*,† †

R&D Center for Membrane Technology, Department of Chemical Engineering, Chung-Yuan Christian University, Chung-Li, Taiwan 320, Republic of China ‡ Engineering Research Center of Process Equipment and Remanufacturing, Ministry of Education, Institute of Process Equipment and Control Engineering, Zhejiang University of Technology, Hangzhou 310014, Zhejiang, P. R. China ABSTRACT: The Gaussian process (GP) model has been applied to the identification of a process model. The GP model can be represented by its mean and covariance function. It provides predictive variance to the predictive distribution of the output and estimate of the variance of its predicted output. The GP model based method has shown to be successful but it can encounter a high computation load because of the inversion of matrix. In this work, a method which recursively updates the covariance matrix is proposed. The update scheme is selective by admitting data to the region requiring improvement. This enables the model to be updated without placing a high computation demand. In addition, the process is augmented by a pruning process which removes redundant data from the model to keep the data size compact. A mathematical example as well as an application to an industrial plant is presented to illustrate the applicability of the proposed method as well as to demonstrate its effectiveness in the prediction performance.

1. INTRODUCTION In chemical plants, processes are normally subject to a change in the process operation or the unexpected disturbances that enter the process. Thus, process models are not static, and they should be constantly improved during the plant operation. That is, the model representation supports modeling update in the sense that the original model with falsified or incomplete knowledge can be substituted with the improved model in a simple way. However, development of a mechanistic model in detail takes a long time. It turns out that a mechanistic model is infeasible for agile, responsive manufacturing because the operating regions need to be often changed with different required products. The development of computer and data storage facilities allows process data to be widely available in many chemical plants. Datadriven system identifications are popular in practical applications. System identification is a subject with a long history of research interest, and it still remains active because of the need to accomodate more requirements and improve operating processes. The objective of system identification is to characterize an operating process within a specified class to which the process under test is equivalent. A great review of how modeling combines concepts and features from other fields has been presented by Ljung et al.1 The choice of identification models depends on a number of factors, such as the desired accuracy, the ease with which it can be dealt, the flexibility, and techniques available, etc. In the past research, linear models are attractive for some reasons, but they have limitations.2 Most real-life processes show nonlinear dynamic behaviors; a linear model can only describe such a system for a small range of input and output values. Nowadays in the demanding world, the operating processes are pushed to the limits of their performance and accurate descriptions over large input and output ranges; linear models are not satisfactory anymore. Therefore, in recent years, there has been a considerable interest in identification methods © 2013 American Chemical Society

for nonlinear systems, like neural networks and fuzzy models. Himmelblau3 has given an account of the experience in applications of artificial neural networks (ANNs) in chemical engineering. ANNs are nets of simple functions and they can provide satisfactory, useful empirical models of complex nonlinear processes for a wide variety of purposes. They receive input−output data of the identified process as patterns to be recognized and identify a process accordingly. Fuzzy models are useful for an identified process when there are a lot of uncertainties, unpredictable dynamics, and other unknown phenomena that cannot be mathematically modeled at all. The advantages of using neural networks and fuzzy models lie in the fact that they can learn the characteristics of inputs and outputs and approximate nonlinear systems, but there are still several major disadvantages. First, the prediction models often supply a scalar prediction only at any sampling point without any measure of the confidence in that prediction, because the models do not take into account uncertainty of the model structure. More useful models would show error bars with each prediction, or even supply a full predictive distribution. Second, overfitting may occur when a statistical model describes random errors or noise instead of the underlying relationship. That is, the predicted models are only concerned with reducing model errors at the training input points. What the model does at points between inputs is inconsequential to the modeling process.4 For these two problems to be solved, Gaussian process (GP) models are ideally suitable. Recently, GP models have been increasingly applied to different nonlinear dynamic systems. The GP model approach to curve fitting was first introduced by O’Hagan5 and was later compared to widely used models by Received: March 21, 2013 Accepted: November 7, 2013 Published: November 7, 2013 18276

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research

Article

Rasmussen,6 leading to the rapid expansion of research into GP models since then. Rasmussen7 has applied the technique to process modeling by estimating the state transition function for nonlinear processes. Azman and Kocijan8 applied GP to the modeling of the biosystem whereas Lundgren and Sjoberg9 used the GP model for linear and nonlinear model validation. Sciascio and Amicarelli10 have developed a biomass concentration estimator for a batch biotechnological process based on the GP model. The GP model has also been utilized with other tools. Chen and Ren11 proposed the application of bagging to obtain more robust and accurate predictions based on GP models whereas Yan et al.12 applied the GP model to migration which adapt a base model from an old process to a new but similar one. GPs have several key properties, including fewer optimizing parameters and the capabilities to provide predictive variance and indicate the reliability of the output in the local stochastic region while traditional nonlinear models do not. This means that if the predictive distribution is tightly packed, the confidence level of the model prediction is high; on the other hand, the prediction distribution which is spread widely over a range of values indicates that the model is highly uncertain. The aforementioned methods are direct applications of the GP model to dynamic systems. However, real-time applications of identification of a GP model is more interesting for various purposes, such as tracking of time-varying systems for adaptive control, prediction and diagnosis, and filtering. Kadlec et al.13 reviewed and discussed algorithms for adaptive data-driven soft sensing, citing the importance of models with change in processes. The original identification methods based on a set of measurements for GP models are not suitable for real-time applications. The recursive updating capability in GP has not been much reported so far. Ni et al.14 proposed a recursive GP and its application to modeling a nonlinear system. In their work, a recursive GP algorithm was developed to automatically adapt to the process drifted in both samplewise and blockwise manners. The GP model is updated by down-weighing the previous samples by a forgetting factor, and the output is updated by adding a bias term. Ni et al.15 applied the same strategy with an addition of filter to the preprocessing of the output target and improved the accuracy of the prediction. Similar work can also be found in ref 16. In the literature, there is no selection of data when the GP model is updated, so the model would be updated using some redundant data. The training model, particularly in the real-time applications, would not be efficient. On the other hand, to overcome the limitation of the large data sets, various sparse representations of the GP models have been proposed in literature. The aim is to concentrate efforts on a subset of the full data set. The most crucial step in these sparse methods is the selection of subsets.17 Csato and Opper18 proposed a method based on the combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the GP models. The online algorithm allows the GP training to speed up by sweeping through a data set only once. Seeger et al.17 presented a novel, heuristic method for very fast forward selection, leading to sufficiently stable approximation of the log marginal likelihood of the training data. It can be optimized by adjusting a large number of hyper-parameters automatically. The hyper-parameter adaption is done by maximizing an approximation to the marginal likelihood of the data. Keerthi and Chu19 presented a new basis selection criterion for building sparse GP regression models, providing promising gains in accuracy as well

as efficiency. Ranganathan et al.20 presented a GP inference algorithm called online sparse matrix Gaussian processes. It is based upon the fact that for kernels with local support, the Gram matrix is typically sparse. Consequently maintaining and updating the sparse Cholesky factor of the Gram matrix can be done efficiently using given rotations. In the work of Petelin et al.,21 the GP model is constantly adapted based on the new characteristics evolved from the basis vectors set and hyperparameter values. The covariance matrix is updated only when the hyper-parameter values have changed. The consequences of these facts motivate us to exploit the adaptive GP model, an adaptive technique for real-time identification. With the estimate of the variance from the GP model as a level of confidence, the scheme of the updated model is proposed. The objectives of the adaptive GP formalism when applied to a given process are (a) to provide a dynamic structure for online applications [more specifically, dynamic architecture means GP can self-adjust the required measurement data and update the parameters with the changes of the process behavior]; (b) to produce a compact form to effectively predict models [in other words, the adaptive GP can not only select the active data for the training model but also hold the training data at the minimum required size when modeling the dynamic process behavior]; (c) to be able to predict the responses of processes with the change of the operating environment. The rest of this paper is organized as follows. The GP model definition and the training method for a GP model are presented in section 2. In section 3, the recursive update algorithm is detailed with growing as well as pruning of the covariance matrix. The proposed concept is also illustrated. This is followed by case studies in section 4. A mathematical example and an industrial plant are presented to show the application of the proposed method accompanied by discussion of the results. Finally, some concluding remarks are made.

2. MODELING OF SYSTEMS In process modeling, the overall objective is to find a model that best fits the identification data set. GP models provide a prediction of the output variables for a new input through Bayesian inference. For an output variable, y = (y1, ..., yN)T, GP models the regression function having a Gaussian prior distribution with zero mean, y = (y1 , ..., yN )T ∼ G(0, K)

(1)

K is the N × N covariance matrix with elements defined by a covariance function, Kij = C(xi, xj), where X = {x1, ..., xN} and xn ∈ RD, n = 1, ..., N are the input variables. The definition of the covariance function is nontrivial and a commonly used covariance function is ⎛ 1 C(x i , x j) = v0exp⎜⎜ − ⎝ 2

D



d=1



∑ ad(xd ,i − xd ,j)2 ⎟⎟ + bδi ,j

(2)

where θ = [v0 ad,d=1,...,D b] is the vector of modeling parameters, commonly known as hyper-parameters. xd is the dth component of the vector x and δ is a Kronecker delta defined as ⎧ 1 for i = j δi , j = ⎨ ⎩ 0 for i ≠ j ⎪ ⎪

18277

(3)

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research

Article

3. ACTIVE RECURSIVE GAUSSIAN PROCESS MODEL With changing process condition, the model prediction may not be accurate. In order to judge the performance of the prediction, the prediction error is used, e = ŷ − y (11)

Hyper-parameter v0 controls the overall scale of the local correlation, ad allows a different distance measure in each input dimension, and b is the estimate of the noise variance. The hyper-parameters can be estimated by maximization of the log-likelihood L(θ) = log(p(y|θ , x)) 1 1 N = − log(|K|) − y TK−1y − log(2π ) 2 2 2

where y is the measured output. With the GP variance serving as a measure of the confidence in the prediction, it is sensible to use this variance for the purpose of determining if new data would benefit the performance of a model. Like the mechanism that assesses the performance of the current operation using the control chart, it detect if there is prediction deviation (e) from the minimum bounds in the current operations. This means that one would retain the model parameters in the local plant until there is statistically sufficient evidence that the error value is changed. The new data that are above the limits are deemed as active; they will be used to update the model by a recursive scheme. As a result, the model does not need to be retrained and generally, the active updating of the model ensures a more efficient modeling. The use of current data also means that the model reflects the current status of the process more accurately. In GP modeling, the method reported by Ni et al.14−16 updates the mean and variance of the data set recursively but still involves the retraining of the hyper-parameters. In this work, the covariance matrix is recursively updated with new data to provide a better performance in terms of calculation. However, the data used for the GP model may contain similar information. The next steps are to determine redundant data and to remove them in a process termed pruning. The purpose is to keep the data set compact without sacrificing the performance of the model. Figure 1 illustrates the concept of the proposed method. The bold solid curve represents the predicted model with the shaded

(4)

where p(y|θ, x) = G(0, K). The optimization problem can be solved by using the derivative of the log-likelihood with respect to each hyper-parameter given as ∂L ∂K ⎞⎟ 1 T −1 ∂K −1 1 ⎛ + y K = − tr⎜K−1 K y ∂θ ∂θ ⎠ ∂θ 2 ⎝ 2

(5)

After the hyper-parameters are obtained, prediction of the GP model at a new input x* can be obtained. The predictive distribution of y* is p(y|x ) * p(y | y , X , x ) = * * p(y|X)

(6)

with mean (ŷ*) and variance (σ2) y ̂ = kTK−1y *

(7)

σ 2 = k − kTK−1k (8) * where k is the covariance vector between the new input and the training data and k* is the covariance of the new input, k = [C(x , x1) C(x , x2)··· C(x , xN )]T * * *

(9)

k = C(x , x ) (10) * * * The vector kTK−1 can be viewed as a smoothing term which weights the training outputs to make a prediction for the new input vector x*. Equation 8 provides a confidence level on the model prediction as higher variance value indicates the region of the input vector contains few data or is corrupted by noise. This solution requires the computation of the inversion of the N × N covariance matrix (K−1) which has an associated computation cost of O(N3). The training of the GP model can, therefore, be computationally demanding and time-consuming especially for a large set of data. In actual plant operation, the varying condition of the process implies that the prediction based on a trained model may show discrepancy. When new data become available, the hyperparameters should be updated to provide a better prediction that reflects the changing condition of the process. To alleviate the disadvantage associated with the computational load resulting from retraining, particularly for the computation of the inversion of the covariance matrix, the incorporation of the new data into the covariance matrix is proposed in this work using a recursive update method. In addition, a data selection method is also developed to compliment the update method. Since the size of the data may grow as the process continues to operate, some of the data used in the modeling may contain redundant information which is not critical to the accuracy of the prediction. The same information is carried in other sets of data, so the data set that has the least effect on the prediction performance after removal is chosen. The following section provides a detailed derivation of the proposed method. It begins with the recursive update method and then introduces the data removal method.

Figure 1. Concept of the active recursive update GP model.

region denoting ±2σ of the predictive distribution around the mean whereas the dashed-line curve represents the real process. The shaded region gives a measure of the confidence of the region. When there are data in the original training set, as indicated by star (*), the prediction confidence is high as indicated by a thin confidence band. On the other hand, the region with wide confidence band indicates a low confidence, so the prediction may not be accurate because of a large discrepancy between the predicted model and the real process. The model can be improved by incorporating information for region with 18278

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research

Article

low confidence. Therefore, the region with ±2σ value larger than a threshold value is designated as an active one. The selection of the variance threshold determines the distance between the data to be admitted and the point with the highest uncertainty. Choosing a threshold that is too small will result in too many data being admitted and added computation without benefitting the performance. A suitable threshold can be chosen based on the output with the higher variance. The location of the higher variance as the reference neighboring variance can be used to determine the threshold. Consequently, data that fall within this region will be used to update the model. The data to be added is represented by circle (○). With more training examples in the specified region, the model has richer information and the confidence band becomes thinner, meaning a higher level of confidence in the model. However, not every active data make the same contribution or some might contain similar information. Using more data than needed often results in overfitting of the data and leads to poor generalization; eliminating data which have no function or respond only to a few data points may not reduce the accuracy of the model, but it will help its generalization and training attributes. Moreover, to keep the data size compact, the similar information is pruned and this is indicated by the filled circle (•) in Figure 1. The detailed explanations of how to remove the data will be further discussed in the following sections. 3.1. Addition of Data by Recursive Updating of the Gaussian Process Model. From the definition of the GP model, if N data set is initially used to train the model, denote a prediction resulting from additional data with a subscript (N + 1). The prediction and variance can be written y ̂ = kTN + 1K−N1+ 1yN + 1

Therefore, when the new data becomes available, the incremental formula for KN−1+ 1 can be updated. The mean and variance for the predictive distribution of the output for a new input can be directly computed using eqs 12 and 13. As the Sherman− Morrison−Woodbury formula is a rank-1 update, the proposed method is similar to that of Seeger,17 in which it is also a rank-1 update method. 3.2. Removal of Data by Recursive Pruning. As data continue to grow, some of the data set may become redundant. In this situation, the removal of data trims the matrix and keeps it at a reasonable size. A common method is to remove the oldest data from the set. This does not take into account the importance of the data to the model. In this work, the method adopted is to remove the least critical data set from the currently available data which is similar to the recursive least-squares-SVM and adaptive kernel-based learning methods.23−26 When the data set is above a preset number, Nthr, pruning is carried out. For N number of input vectors, substituting each data pair into eq 12 the prediction in a matrix form can be obtained,

y = Ab where

(12)

σN + 12 = k − kTN + 1K−N1+ 1kN + 1 (13) * where KN+1 is the updated covariance matrix of the input data (Kij = C(xi, xj), i, j = 1, ..., N + 1); the vectors kN+1 and yN+1 are defined as T

kN + 1 = [C(x , x1) C(x , x2)··· C(x , xN )] * * *

(14)

yN + 1 = [y1 ··· yN + 1]T

(15)

zN + 1 =

pN + 1 −

y = [ y1 ··· yN ]T

(23)

b = K−1y

(24)

T

yn = [ a1Tn an (16)

pN + 1 = C(xN + 1, xN + 1)

(20)

(26)

(27)

a where a̅n = ⎡⎣ a12nn ⎤⎦ contains the nth row of A except for the nth ⎡b ⎤ element an. bn̅ = ⎢ b1 ⎥ contains all the elements of b except for ⎣ 2⎦ ⎡y ⎤ the nth element bn. yn̅ = ⎣ y1 ⎦ contains all the elements of y 2 except for the nth element yn. Let βn be the parameter after the reduction of the nth row of data given by

(18) (19)

⎡ b1 ⎤ ⎢ ⎥ T a 2Tn ]⎢ bn ⎥ = a̅ n bn̅ + anbn ⎢ ⎥ ⎣ b2 ⎦

⎡ y1 ⎤ ⎡ A11 a1n A12 ⎤ ⎥b yn̅ = ⎢ ⎥ = ⎢ ⎣ y2 ⎦ ⎣ A 21 a 2n A 22 ⎦

(17)

pN + 1 = [C(x1, xN + 1)··· C(xN , xN + 1)]T

(25) T

where y1 = [y1 ... yn−1] and y2 = [yn+1 ... yN] . The prediction outputs can be decomposed into two parts, the nth prediction output and the rest of the prediction outputs,

1 p TN + 1K−N1pN + 1

(22)

⎡ y1 ⎤ ⎡ A11 a1n A12 ⎤⎡ b1 ⎤ ⎥⎢ ⎥ ⎢ ⎥ ⎢ T T ⎢ yn ⎥ = ⎢ a1n an a 2n ⎥⎢ bn ⎥ ⎥⎢ ⎥ ⎢⎣ y ⎥⎦ ⎢ 2 ⎣ A 21 a 2n A 22 ⎦⎣ b2 ⎦

It is interesting to note that eq 16 has an intuitive interpretation: the new covariance matrix KN−1+ 1 is equal to the old covariance matrix K−1 N plus a new updating term based on the new data (xN+1). In eq 16, rN + 1 = [pN + 1K−N1 − 1]T

⎡ kT ⎤ ⎢ 1⎥ ⎢k T⎥ A = ⎢ 2⎥ ⎢⋮ ⎥ ⎢ ⎥ ⎢⎣ kTN ⎥⎦

Assume nth data would be removed; the prediction outputs in eq 21 can be reformulated as follows

Instead of using all the available N + 1 data to recalculate the covariance function, K−1 N+1 is obtained using the already available 22 K−1 the covaN . By Sherman−Morrison−Woodbury formula, riance matrix can be updated as ⎡ K −1 0 ⎤ N ⎥ + rN + 1r TN + 1zN + 1 K−N1+ 1 = ⎢ ⎢⎣ 0T 0 ⎥⎦

(21)

18279

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research βn = A̅ −n 1yn̅

Article

model. On the other hand, only similar data are removed. By performing this check, the overfitting issues can be avoided because there are no excessive data or less data in the same region. To sum up, Figure 2 shows the schematic of the proposed method. With available data from the process, a regression model

(28)

⎡A A ⎤ where A̅ n = ⎢ A11 A 12 ⎥. Thus, the prediction ŷrn of the reduced ⎣ 12 22 ⎦ data is ynr̂ = a̅ nTβn = a̅ nTA̅ −n 1yn̅

(29)

Substitute eq 27 into eq 29 and rearranging ⎡ A11 a1n A12 ⎤ ⎥bn = yn − bn(an − a̅ nTA̅ −n 1 a̅n) ynr̂ = a̅ nTA̅ −n 1⎢ ⎣ A 21 a 2n A 22 ⎦ (30)

Thus, the error is

(ern)

between the original and the reduced model

enr = yn − ynr̂ = bn(an − a̅ nTA̅ −n 1 a̅n)

Denote C = A

−1

(31)

where

⎡ C11 c1n C12 ⎤ ⎥ ⎢ C = ⎢ c1Tn cn c 2Tn ⎥ ⎥ ⎢ ⎢⎣ C21 c 2n C22 ⎥⎦

c and cn̅ = ⎡⎣ c12nn ⎤⎦. Apply elementary transformation on the matrix A and C the following results can be obtained −1 Â n = Ĉ n

(32)

⎡c c ⎤ ⎡a a ⎤ where  n = ⎢ an A̅̅n ⎥ and Ĉ n = ⎢ c nT Cn̅ ⎥. Furthermore using the ⎣ ̅n n ⎦ ⎣ ̅ n ̅ n⎦ block matrix inversion formula on  n25 gives ⎡ an a̅n ⎤−1 −1 ̂ ⎥ An = ⎢ ⎢⎣ a̅n A̅ n ⎥⎦ ⎡ d −1 −dn−1 a̅nA̅ −n 1 ⎤ n ⎢ ⎥ = ⎢ A̅ −1 + d −1 A̅ −1 a T a A̅ −1 −d −1 A̅ −1 a T ⎥ ⎣ n n n ̅ n ̅n n n n ̅n ⎦

Figure 2. Active recursive updating and pruning of the GP model.

(33)

is obtained together with the mean and variance (eqs 7 and 8). If the current model is not satisfactory when new data becomes available, the new data are used to update the model (eq 16). If the training data set is larger than the preset threshold size Nthr pruning of the data set with the smallest error is carried out (eq 34). The selection of the size threshold has an influence on the modeling performance. A model of small size may capture the process changes quickly but does not contain enough information to reflect the current process operating conditions sufficiently. On the other hand, while a sufficient number of samples may represent the process operating conditions appropriately, the model of large size may contain useless data. The size threshold should ideally have enough data to appropriately describe the characteristics of the system. It can be chosen based on an initial prediction performance test which can be done first off-line. The computation of the direct inversion of the N × N covariance matrix (K−1) has an associated computation cost of O(N3), whereas the recursive algorithm of the new covariance matrix (K−1) adopted in this paper only needs O(N2). Generally, in many industrial processes, the operating conditions change frequently in a relatively large range. The model should be updated by introducing new information and pruning useless information in a simple way. Consequently, the recursive update

where dn = an − a̅ nTA̅ −n 1 a̅n , that is, 1/cn = an − a̅ nTA̅ −n 1 a̅n , which is the first element of the matrix Ĉ n. Therefore, eq 31 becomes enr = yn − ynr̂ =

bn cn

(34)

Note that cn is the element of the matrix inversion of A and the inversion (A−1) is only done for the first time and is updated recursively in a similar manner using the covariance update in eq 16 but with elements of A instead. From all the candidates, the pruned data (npruned) are decided by the absolute error criterion. The deleted data point introduces the smallest absolute error when removed.

npruned = arg min|enr| n

(35)

After removing the first redundant data, keep doing the above same procedures to find out the second redundant data until the preset threshold size (Nthr) is reached. By removing the redundant data, the size of the data matrix can be kept compact without affecting the prediction performance. The data are admitted only to the region with low confidence on the prediction; consequently, they enrich the information of the 18280

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research

Article

of the new covariance matrix can alleviate the calculation load and ensure a compact model size which is suitable in practice.

4. CASE STUDIES The characteristics of the proposed method and its application will be discussed in this section. Two case studies, a numerical

Figure 3. Model trained with initial data in case I.

example and an application to an industrial plant, are presented. In the numerical example, the features of the proposed method and the way to obtain a good model are presented. This is followed by a demonstration of how the proposed method can be implemented in an online identification configuration. The performance of the model prediction is assessed with the absolute relative error (ARE). AREk =

yk − yk̂ yk

Figure 4. First update of the model in case I: (a) addition of data, (b) pruning of data.

(36)

By considering the absolute value, the main concern is about the magnitude instead of the sign. Also, the relative error does not have physical units associated with it. In addition, the rootmean-square error (RMSE) as well as the relative root-meansquare error (RRMSE) for total K number of elements is also used as a reference, RMSE =

RRMSE =

1 K

Benveniste27 in wavelet based network procedures. The equation is defined as ⎧ 4.246x −2 ≤ x < 0 y(x) = ⎨ −0.05x − 0.5 sin[(0.03x + 0.7)x] 0 ≤ x < 10 ⎩10e ⎪



(39)

K

∑ (yk

− yk̂ )

k=1

1 K

It consists of a linear function and a nonlinear function. A model is initially trained using 15 initial sets of data simulated with addition of noise. The identification result is presented in Figure 3. In all the figures, the predicted value from the model is represented by a solid line and the actual measured value is represented by a dashed line. Note that in this work, the purpose is not to improve the identification capability of the GP model; rather, it is to improve a current model with insufficient data by incorporating new data. In terms of capability of the Gaussian process model in regression, an extensive work has been carried out by Rasmussen.6 Gregoric and Lightbody (2009) explained the relationship between the GP model with the radial basis function neural network.28 The following illustrates how the model is improved by adding and pruning data. In this case study, there are 12 intervals if the input space is divided at each integer value. A threshold of 20 should cover the entire input space of interest. Thus, the size of 20 is chosen. It is

2

⎛ y − y ̂ ⎞2 ∑ ⎜⎜ k k ⎟⎟ yk ⎠ k=1 ⎝

(37)

K

(38)

The RMSE measures the differences between values predicted by a model and the actual values observed. RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a particular variable and not between variables, as it is scale-dependent. On the other hand, RRMSE gives an indication of how good a measurement is relative to the size of the thing being measured. Relative values are ratios and have no units. The ratios are commonly expressed as fractions or percentages. 4.1. Case I: Numerical Example. The numerical example in this study is a modified piecewise function studied by Zhang and 18281

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research

Article

Table 1. RMSE and RRMSE of Different Updates in Case I initial first update second update

RMSE

RRMSE (%)

0.74 0.66 0.59

1.33 0.99 0.53

Figure 7. Schematic of boiler.

Figure 5. Model after the second update in case I: (a) addition of data, (b) pruning of data.

Figure 8. Data distribution for 2 daily operation of the boiler, where the circle (○) and plus (+) represent data from day 1 and day 2, respectively.

band indicates a lower level of confidence and it can be seen that the prediction for the area with a wide band are not accurate because of lack of information. The training data distribution shows that there are a deficient of data. In order to build a healthy model, it is imperative that new data in these regions are added to improve the prediction performance of the model. In this case study, the variance threshold is chosen on the basis of the location of the output with the highest and the second highest variance, corresponding to the input of x = 2 and 5. With these inputs as a reference, the corresponding variances at x ± 1 are used to determine the variance threshold. The values of 2σ at these locations are 1.8, 2.2, 2.5, and 2, respectively, and the lowest value, 1.8, is chosen as the threshold. On the basis of this chosen value, the ranges of [1 3] and [4 6] are identified as the active region, where new data will be added. These regions are indicated

Figure 6. Absolute relative error for each update: (*) initial, (Δ) after first update, (○) after second update.

found that some of the prediction is good while the other is not. The shaded region in the figure represents mean ±2σ. A wide 18282

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research

Article

Figure 9. ARE for the predictions in case II when the training data contain the high and the low loads data: (a) model with update; (b) model without update.

Figure 10. Predicted results in case II when the training data contain the high and the low loads data: (a) model with update; (b) model without update.

by the shaded regions in Figure 4a. Data that fall within this region will be admitted and used to update the model recursively using the proposed algorithm. Figure 4a shows that the prediction results after the model is updated recursively with 10 sets of new data in this region. Compared with Figure 3, the prediction performance of the model shown in Figure 4a is improved and the prediction variance has decreased, indicating the improved level of confidence in the model. With the data limit set to 20 sets, 5 sets of the data will be removed after the addition. These five sets are chosen as they affect the prediction performance the least. The removed data are indicated by filled black dots in Figure 4b. Note that the prediction is similar after the removal of data, and the prediction variance remains consistent with the previous removal. Further improvement to the prediction performance can be achieved by successive addition and pruning of data. The result after two stages is shown in Figure 5. It can be seen that the prediction performance has improved with reduction in the prediction variancean increase in the level of confidence of the model. Compared to the initial model, the prediction performance and the model capability have been markedly improved.

Figure 6 shows ARE at different updates. The value after initial identification is represented by (*) symbol whereas the symbols (Δ and ○) represent the values after the first and the second update respectively. It is found that the value has been lowered with each successive update and it is the most pronounced in the region where there is an update of new data. In addition, Table 1 shows RMSE and RRMSE for the initial model and the subsequent update. Both values have gradually decreased, indicating the improvement of the prediction performance of the recursively updated model. 4.2. Case II: Application to Industrial Plant. The data was taken from an industrial plant in Taiwan. The plant produces high temperature, pressurized steam for power generation and exhaust (low pressure steam) is later used for downstream processes. In this study, the operation data from the boiler generating the steam is used. The boiler is a single drum water tube boiler with a maximum steam capacity of 350 t/h. Figure 7 shows the schematic of the boiler. The feedwater enters the drum and then circulates in a tube heated by the combustion chamber. The hot water rises into the steam drum, and the saturated vapor 18283

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research

Article

Figure 12. ARE for the predictions in case II when the training data contain the low load data only: (a) model with update; (b) model without update.

Figure 11. Prediction results for sampling points (200−300) in case II when the training data contain the high and the low loads data: (a) model with update; (b) model without update.

Table 2. RMSE and RRMSE of the Predictions with and without Update in Case II When the Training Data Contain the High and the Low Loads update no update

RMSE

RRMSE (%)

2.60 3.21

0.96 1.18

Down-sampling may strengthen the identification of process changes.15 Naturally the computation is expected to increase with more data (a smaller down-sample factor) and decrease with less data (a larger down-sample factor). The accuracy will depend on the data selection, and the selection of the subsets was reported in literature.17,18 This data is further separated into two parts, namely, a training set (180 data sets generated from one daily operation) and a testing set (the remaining 180 sets of data which consist of the operation of the next day). The training sets are used to build the initial model; the testing sets represent the online data and are delivered as a stream of samples. In fact, the plant is operated at two distinct operation modes according to the steam load. The plant is operated at the high load from 8 in the morning until 8 in the evening; the other time it is operated at the low load. Figure 8 shows the data distribution of the input. It is found that two clusters correspond to the high and the low operation modes. Also, the data from day 1 and day 2 are suitable for this study to demonstrate the updated recursive GP model because all data from day 1 and day 2 are not similar. It is noted that for confidential reason the real data value has been modified.

is dried and superheated. The superheated steam can then be used to drive the steam turbine to generate electricity. The fuel is heavy oil and pulverized coal. The plant uses the Emerson Delta V DCS system and data for the boiler can be obtained from the system. The operating data is recorded every minute, starting with data at the low load. Two days’ data has been used in this study. The input variables are air flow (kNm3/h), fuel (t/h), and feedwater flow (t/h), and the output variable is superheated steam flow (t/ h). A total of 2880 data sets are recorded for a 2-day period. The data are down-sampled by a factor of 8 (one every 8 samples was retained) which gives the total training data to be 360 sets. 18284

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research

Article

in the nonupdated model and is more evident when the process is making the transition from the low load to the high. Because of the availability of information on the current state of the process, the prediction performance of the updated model is better than that of the nonupdated one. This implies the updated model can better cope with the transitional stage of the operation. Figure 11 shows the enlarged part of the sample 200 to sample 300 for the model with update and without update. In addition, Table 2 shows the RMSE and RRMSE values for prediction with and without update respectively. The updated model provides a better performance as indicated by the RMSE and RRMSE values of 2.60 and 0.96%, compared to 3.21 and 1.18% of the model without update. In the updated model, new data are continuously assimilated and the model is updated recursively, resulting in richer operation information of the model than the nonupdated model. Consequently the prediction performance of the model is improved. To further illustrate the capability of the updated model in dealing with different operating modes (high and low loads), the case is tested when the data availability is confined to one mode. This is achieved by training a model with data from the low load only (first 60 data sets for training and the remaining 300 sets for testing) and then used it for the model prediction. In the second part, because the model is only accurate for part of the operation, using the same value, it means the data is added and removed successively at each time point when it encounters the second operation. Figure 12 shows the ARE results for both models with and without update. ARE remains at the same level throughout the whole operation for the updated model. In contrast, the relative error for the nonupdated case exhibits two peaks of errors corresponding to the high load operation region. Although the nonupdated model performance is not good for the whole operation, it performs well in the low load region. Figure 13 shows the predict results for both models. For the nonupdated model, the predicted value for low load region is relatively good, but for the high load operation, the prediction performance deteriorates greatly. On the other hand, the prediction performance of the updated model is good in both the low load and the high load operation. The updated model shows good prediction whereas for the nonupdated model, the prediction is good in the low load mode but bad in the high load mode. Table 3 shows the RMSE and RRMSE values. The RMSE and RRMSE values are 3.34 and 1.23% for the updated model and 13.9 and 4.95% for the model without update. In this case, the performance of the updating scheme is again better and the difference in performance is even more prominent. Because the initial trained model contains only low load information, the prediction performance for that particular region is good. However, in the high load region, the prediction performance is not so good. In the case of the updated model, the data is continuously taken in to reflect the current operation condition, so it is able to better cope with the operation shifting to the high load mode and leads to the superior performance. These results confirm the capability and the feasibility of the updated model in the online operation.

Figure 13. Predicted results in case II when the training data contain the low load data only: (a) model with update; (b) model without update.

Table 3. RMSE and RRMSE of the Predictions with and without Update in Case II When the Training Data Contain Only the Low Loads update no update

RMSE

RRMSE (%)

3.34 13.90

1.23 4.95

The process is modeled with the input vector x = [y(k − 1) u(k − 1)] which is based on the past outputs and inputs to form the ARX structure. In the first part of this case study, the data consist of operation over one day; therefore, most information, except slight daily variation, is included. The variance threshold value is chosen to illustrate the capability of the proposed method. The result is that the model is fairly accurate even without update and consequently a small threshold value 2 is set. On the other hand, the size threshold of 60 is chosen based on the period of the changing load. Figures 9 and 10 show the plot of ARE and the results predicted by the model with and without update respectively. It can be seen that the relative error is smaller in the case with the updated model. A larger error can be observed

5. CONCLUSION From a full identification perspective, the GP approach to regression is simple and elegant, and it can model nonlinear problems in a probabilistic framework. The disadvantage is its computational complexity, as estimating the mean requires a matrix inversion of the covariance matrix, which becomes problematic for the industrial application with a large data size. 18285

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286

Industrial & Engineering Chemistry Research

Article

(8) Azman, K.; Kocijan, J. Application of Gaussian processes for blackbox modelling of biosystems. ISA Trans. 2007, 46 (4), 443−457. (9) Lundgren, A.; Sjoberg, J. Gaussian processes framework for validation of linear and nonlinear models. In 13th IFAC Symposium on System Identification (SYSID 2003), Rotterdam, The Netherlands,Aug 27−29, 2003; pp 67−72. (10) Sciascio, F. D.; Amicarelli, A. N. Biomass estimation in batch biotechnological processes by Bayesian Gaussian process regression. Commun. Chem. Eng. 2008, 32, 3264−3273. (11) Chen, T.; Ren, J. Bagging for Gaussian process regression. Neurocomputing 2009, 72, 1605−1610. (12) Yan, W.; Hu, S.; Yang, Y.; Gao, F.; Chen, T. Bayesian migration of Gaussian process regression for rapid process modeling and optimization. Chem. Eng. J. 2011, 166, 1095−1103. (13) Kadlec, P.; Grbic, R.; Gabrys, B. Review of adaptation mechanisms for data drive soft sensors. Commun. Chem. Eng. 2011, 35, 1−24. (14) Ni, W.; Tan, S.; Ng, W. Recursive GPR for nonlinear dynamic process modeling. Chem. Eng. J. 2011, 173, 636−643. (15) Ni, W.; Tan, S.; Ng, W.; Brown, S. D. Moving-window GPR for nonlinear dynamic system modeling with dual updating and dual preprocessing. Ind. Eng. Chem. Res. 2012, 51 (18), 6416−6428. (16) Ni, W.; Wang, K.; Chen, T.; Tan, S.; Ng, W. GPR model with signal preprocessing and bias update for dynamic processes modeling. Control Eng. Pract. 2012, 20 (12), 1281−1292. (17) Seeger, M.; Williams, C. K. I.; Lawrence, N. D. Fast forward selection to speed up sparse Gaussian process regression. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, 2003. (18) Csato, L.; Opper, M. Sparse on-line Gaussian processes. Neural Comput. 2002, 14 (3), 641−668. (19) Keerthi, S. S.; Chu, W. A matching pursuit approach to sparse Gaussian process regression. In Advances in Neural Information Processing Systems; Neural Informatin Processing Systems Foundation, 2005; Vol. 18. (20) Ranganathan, A.; Yang, M.-H.; Ho, J. Online Sparse Gaussian Process Regression and Its Applications. IEEE Trans. Image Process. 2011, 20 (2), 391−404. (21) Petelin, D.; Grancharova, A.; Kocijan, J. Evolving Gaussian process models for prediction of ozone concentration in the air. Simul. Modell. Pract. Theory 2013, 33, 68−80. (22) Golub, G. H.; Van Loan, C. F. Matrix Computations, 3rd ed.; The John Hopkins University Press: Baltimore, 1996. (23) Wang, X. D.; Laing, W. F.; Cai, X. S.; Lv, G. Y.; Zhang, C. J.; Zhang, H. R. Application of adaptive least square support vector machines in nonlinear system identification. In Proceedings of The Sixth World Congress on Intelligent Control and Automation, Dalian, China, June 21−23, 2006; pp 1897−1900. (24) Tang, H. S.; Xue, S. T.; Chen, R.; Sato, T. Online weighted LSSVM for hysteretic structural system identification. Eng. Struct. 2006, 28 (12), 1728−1735. (25) Wang, H. Q.; Li, P.; Gao, F. R.; Song, Z. H.; Ding, S. X. Kernel classifier with adaptive structure and fixed memory for process diagnosis. AIChE J. 2006, 52 (10), 3515−3531. (26) Liu, Y.; Wang, H.; Yu, J.; Li, P. Selective recursive kernel learning for online identification of nonlinear systems with NARX form. J. Process Contr. 2010, 20 (2), 181−194. (27) Zhang, Q.; Benveniste, A. Wavelet networks. IEEE Trans. Neural Networks 1992, 3 (6), 889−898. (28) Gregorcic, G.; Lightbody, G. Gaussian process approach for modeling of nonlinear systems. Eng. Appl. Art. Intell. 2009, 22, 522−533.

On the basis of this insight, in this work, a nonlinear identification of process based on a recursive GP model is proposed. The proposed algorithm profits from the fact that it is not necessary to retrain the model if the model is updated. Two features of the proposed algorithm are the following: 1. A method is proposed to recursively update the covariance matrix of the GP model. Also, with the GP model, an indication of the confidence in the predicted values is obtained. 2. The process is augmented by a growing and a pruning process. The active updating of the model ensures a more efficient identification. The use of the current data means that the model is able to reflect the current status of the process more accurately. The update method is augmented by a pruning process. It removes the data that are least critical to the model with the aim of keeping the covariance matrix at a reasonable size. The application of the proposed method to the numerical example and the industrial cogeneration plant confirmed its prediction performance. In the numerical example, it is shown that it is possible to obtain a healthy model by using the selective update based on a prediction variance provided by the GP model. The industrial application demonstrates the feasibility of the method to an online application. On the basis of this compact GP model, further work will be focused on addressing control of nonlinear systems and implementation efficiency issues. It is also noted that the hyper-parameter, in fact, may not vary large from the addition of data, but the recursive optimality may be lost because no retraining occurs. The future work would look into the heuristic for retraining when ML optimality lost becomes too great in a manner similar to the work of Petelin et al.21 Although the heuristic rules of selecting the variance and the size threshold are found to be satisfactory for the case studies in this work, a more precise method for determining the relevant values is devised in our future work as well.



AUTHOR INFORMATION

Corresponding Author

*Tel.: +886-3-2654107. Fax: +886-3-2654199. E-mail address: [email protected]. Notes

The authors declare no competing financial interest.

■ ■

ACKNOWLEDGMENTS The authors wish to express their sincere gratitude to the National Science Council, R.O.C. for the financial support. REFERENCES

(1) Ljung, L.; Hjalmarsson, H.; Ohlsson, H. Four encounters with system identification. Eur. J. Control 2011, 17 (5−6), 449−471. (2) Vincent, V. Nonlinear system identification: A state-space approach. Ph.D. Thesis, University of Twente, 2002. (3) Himmelblau, D. M. Accounts of experiences in the application of artificial neural networks in chemical engineering. Ind. Eng. Chem. Res. 2008, 47 (16), 5782−5796. (4) Boyle, P. Gaussian processes for regression and optimization. Ph.D. Thesis, Victoria University of Wellington, 2007. (5) O’Hagan, A. Curve fitting and optimal design for prediction. J. R. Stat. Soc. B 1978, 40 (1), 1−42. (6) Rasmussen, C. E. Evaluation of Gaussian processes and other methods for non-linear regression. Ph.D. Thesis, University of Toronto, 1996. (7) Rasmussen, C. E.; Williams, C. K. I. Gaussian Processes for Machine Learning; MIT Press, 2006. 18286

dx.doi.org/10.1021/ie4031538 | Ind. Eng. Chem. Res. 2013, 52, 18276−18286