Article pubs.acs.org/IECR
Correntropy Kernel Learning for Nonlinear System Identification with Outliers Yi Liu† and Junghui Chen*,‡ †
Engineering Research Center of Process Equipment and Remanufacturing, Ministry of Education, Institute of Process Equipment and Control Engineering, Zhejiang University of Technology, Hangzhou, 310014, People’s Republic of China ‡ R&D Center for Membrane Technology, Department of Chemical Engineering, Chung-Yuan Christian University, Chung-Li, Taiwan, 320, Republic of China ABSTRACT: One significant challenge in nonlinear system identification developed for industrial processes is that the modeling samples often contain outliers and noise. In this work, a novel general identification method, Correntropy Kernel Learning (CKL), is proposed for the identification of nonlinear systems with outliers and noise. Unlike the traditional mean squared error criterion adopted by almost all the existing identification methods, correntropy is introduced into the field of nonlinear system identification. A new correntropy-based index is proposed to evaluate the performance of identification models for nonlinear systems with outliers and noise. The CKL identification method can reduce the effects of outliers by the use of a robust nonlinear estimator that maximizes correntropy. Without resorting to unnecessary efforts, the outlier samples can be simultaneously detected once the CKL identification model is obtained. Moreover, an efficient two-level training procedure is proposed to implement the CKL method in a more practical manner. The superiority of the proposed CKL method is first demonstrated through a benchmark example in different situations. It is also compared with other KL methods for identification of an industrial process in Taiwan. The benefit of its more accurate and reliable performance indicates that CKL is promising in practice for the identification of nonlinear systems with outliers. interpretations concerning the output variables of interest.29 Without awareness, learning samples with outliers may lead to biased parameter estimation and the overfitting problem because the identification model is corrupted by fitting those unwanted data. Even though the modeling method itself is well selected, the obtained identification model may result in the loss of generalization performance in the test phase because of data quality issues.34 Therefore, it becomes very important to remove the effect of outliers. There are many outlier detection methods shown to be able to detect obvious outliers.24 As pointed out by Kadlec et al.,24 this issue is solved in a rather ad hoc manner, leading to unnecessarily high costs of the process model identification.33 Furthermore, these methods could not detect all the inconspicuous outliers as they are masked by their adjacent outliers. In practice, it is very likely that the refined data set obtained by an outlier detection method still contains some outliers.24−29 Recently, Bayesian approaches have been proposed to obtain robust identification models in the presence of outliers.32−34 Additionally, outlier models have been proposed to deal with different types of outliers.33 However, these interesting methods seem somewhat complex to implement. From a practical viewpoint, it should be more attractive to develop simpler and more general methods based
1. INTRODUCTION Nonlinear system identification is often used to estimate models of nonlinear systems based on the observed input− output data, but it is still a difficult task in practice. In the past two decades, considerable interest in both the theory and applications, such as artificial neural networks (ANNs), wavelet networks, support vector machines (SVMs), as well as fuzzy systems and other data/rule-based empirical methods, have attracted researchers’ attention.1−10 Among them, SVM, leastsquares SVM (LS-SVM) and a number of kernel learning (KL) modeling methods11−23 have been increasingly reported for nonlinear system identification, including linear autoregressive models with exogenous inputs (ARX),14 state-dependent ARX models,15 nonlinear ARX (NARX) models,16 nonlinear autoregressive and moving average models, 17 subspace models,18 Hammerstein systems,19,20 Wiener systems,21 and other applications.22,23 Among these methods, the structure determination of SVM can be implemented in a straightforward way. Furthermore, SVM and other KL methods can obtain relatively good modeling performance with a limited training data set. Therefore, the identification results show that SVM and related KL methods are alternative methods, especially when the training data are insufficient.11−23 In system identification development for industrial processes, one significant challenge is that the modeling samples often contain different kinds of outliers, including process disturbances, instrument degradation, transmission problems, and potential human errors.24−34 Outliers are observations which appear to deviate markedly from the typical ranges of other observations.25 The presence of outliers in the variables affects the quality and reliability of the data, resulting in erroneous © XXXX American Chemical Society
Special Issue: David Himmelblau and Gary Powers Memorial Received: June 4, 2013 Revised: October 28, 2013 Accepted: November 19, 2013
A
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
on existing nonlinear identification models without resorting to unnecessary efforts. Although traditional SVM-based identification methods have good nonlinear modeling abilities, they are not robust for outliers. Recent studies have shown that the performance can be improved by robust SVM and weighted LS-SVM methods when the modeling set contains uncertainty.35−38 However, most of these methods are heuristic by introducing some userdefined parameters. They might be difficult to implement on general nonlinear system identification problems with outliers. To this end, this paper aims to develop a novel KL method for robust identification of nonlinear systems with outlier samples. Unlike the traditional mean squared error (MSE) criterion adopted by almost all existing identification methods,1−3 correntropy is used in nonlinear system identification. The concept of correntropy recently proposed by Liu et al.39 has attracted increasing attention, particularly in the area of signal processing.39−45 As a novel statistical measure, correntropy can deal with non-Gaussian noise and impulsive noise.39−45 However, to the best of our knowledge, little work has been reported on the application of correntropy in the field of process systems engineering in the literature,46 especially for the identification of nonlinear systems. Therefore, a correntropy kernel learning (CKL) method is proposed in this work. The CKL method can reduce the effects of outliers by the use of a robust nonlinear estimator that maximizes correntropy (MC). Moreover, the outlier samples can be simultaneously detected once the identification model is obtained. The remainder of this paper is structured as follows. Correntropy and the MC criterion are introduced in section 2. The CKL identification method for nonlinear systems is proposed in section 3. Additionally, an efficient two-level training procedure with detailed algorithmic implementation is developed in this section. The proposed method is first illustrated by a benchmark problem in section 4. Additionally, it is further evaluated in an industrial process in Taiwan. Comparison studies with other methods are also investigated. Finally, concluding remarks are made in section 5.
1 VN̂ (W , Y ) = N =
∫ k(w , y) dFWY (w , y)
⎛ || w − y ||2 ⎞ 1 ⎟ exp⎜ − 2π σ 2σ 2 ⎠ ⎝
i=1
1 1 σ 2π N
N
∑ i=1
⎡ (w − y )2 ⎤ i i ⎥ exp⎢ − ⎢⎣ 2σ 2 ⎥⎦
N ⎧ 1 val 2 ⎪ = MSE ( e ) ei ∑ S i ⎪ val Nval i = 1 ⎨ ⎪ ⎪ RMSES = MSES ⎩ val val
(3)
(4)
where ei = f(xi;θ) − yi, i = 1, ..., Nval. A k-fold CV procedure partitions the available data set into k disjoint subsets. Then, k models are trained with each model being trained on a different combination of k − 1 of the k subsets (Strn) and the statistic evaluated over the remaining partition (Sval). Nval is the number of samples in the validation data set (Sval). MSE is a quadratic function in the joint space with a valley along the estimated line. The quadratic increase for values away from the estimated line has the net effect of amplifying the contribution of the samples far away from the mean value of the error distribution, because MSE is a similarity metric in the joint space to quantify the difference between f(xi;θ) and yi.39 As a result, Gaussian distributed residuals provide optimality for the MSE procedure39,45 but not for other data distributions, particularly when the data distribution has outliers or it is nonsymmetric. In this work, to overcome the disadvantage of MSE, a correntropy-based error (CE) criterion is proposed to evaluate the performance of model selection in developing nonlinear identification models. It can be defined as follows:
(1)
where V is correntropy, E[·] denotes the mathematic expectation, FWY(w,y) denotes the joint distribution function of (W,Y), and κ(·,·) is a shift-invariant Mercer kernel.39 The most popular kernel used in correntropy is the Gaussian kernel with its kernel width σ > 0 as defined below. κσ(w , y) =
∑ κσ(wi , yi )
Intuitively, correntropy is closely related to the similarity between W and Y. That is if W is similar to Y, then the difference between W and Y should have a large value of correntropy. Furthermore, this characteristic can be applied to the model estimation issue. That is, correntropy can be utilized as a goodness of fit indicator to describe how well the prediction of a model matches the actual data.39 2.2. MC Criterion for Model Estimation. The development of a good identification model depends on suitable selection of related model parameters.1−3 From the practical viewpoint, a good model should approximate the system and avoid the overfitting problem. Empirical studies have shown that k-fold (k = 5 or k = 10) cross-validation (CV) is a simple and efficient criterion for model selection.3 Generally, the model selection issue seeks to optimize a CV estimate of an appropriate statistic measuring generalization ability on unseen data. In the past, MSE or root MSE (RMSE) indices were often used and defined as
2. CORRENTROPY AND MC CRITERION 2.1. Correntropy as a Novel Similarity. Correntropy (or cross correntropy) is defined as a novel measure of similarity between the two vectors of random variables. The concept of correntropy is a generalized similarity measure between two arbitrary random variables W and Y with the same dimensions defined by39 V (W , Y ) = E[k(W , Y )] =
N
CESval(ei) =
1 Nval
Nval
∑ κσ(ei) = i=1
1 Nval
Nval
⎛
ei2 ⎞ ⎟ 2 ⎝ 2σ ⎠
∑ exp⎜− i=1
(5)
Now the concept of correntropy can be extended to the model estimation issue. The variable W can be considered as a mathematical expression of the unknown function f(X;θ) with an input set X = {xi ∈ Rm}i=1,...,N and the model parameters θ = [θ1,...,θM]T, which approximate the dependence of an output set Y = {yi ∈ R}i=1,...,N. As a new measure, correntropy can be used to describe how well f(X;θ) fits the data set Y. Consequently, the maximum correntropy of the difference between f(X;θ) and Y is called the MC criterion for model estimation.39 That is, for
(2)
When the joint distribution of W and Y is unknown and only a N finite N number of samples {(wi,yi)}i=1 are given, the ̂ correntropy estimator of samples VN(W,Y) can be defined and calculated as follows39 B
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
estimation in eq 6 is also equivalent to a weighted least-squares problem as:39
a modeling data set of S = {X,Y}, the MC criterion can be formulated as39,46 1 N
̂ θmax correntropy = arg max θ
1 = arg max θ N
N
N
∑ κσ(f (x i; θ), yi )
̂ θmaxcorrentropy = arg min ∑ ρ(ei)ei2
i=1 N
θ
∑ κσ(ei)
where the difference is the error, i.e., ei = f(xi;θ) − yi, i = 1, ..., N, produced by the system during supervised learning. In the case of adaptive systems, θ = [θ1,...,θM]T is a set of M adjustable model parameters. Note that the following properties always exist in eq 6: N ⎧ ⎪ lim 1 ∑ κσ(ei) = 1 ⎪ ei → 0 N i = 1 ⎪ N ⎪ 1 ⎨ lim κσ(ei) = 0 ∑ ⎪ |ei|→∞ N i = 1 ⎪ N ⎪ 1 ≤ 0 ∑ κσ(ei) ≤ 1 ⎪ N i=1 ⎩
(7)
This means a larger value of correntropy can lead to a smaller fitting error of the model. On the other hand, a smaller value of correntropy can result in a larger fitting error of the model. And the value of correntropy is in the range of [0, 1]. Consequently, as for model identification, this general estimation in eq 6 is also equivalent to an optimization problem as follows ̂ θmax correntropy = arg min θ
1 σ 2π
N
⎡
i=1
⎣
ei2 ⎞⎤ ⎟⎥ 2 ⎝ 2σ ⎠⎥⎦ ⎛
∑ ⎢⎢1 − exp⎜−
(8)
Then, eq 8 is differentiated with respect to θ = [θ1,...,θM]T. The derivatives are set to zero, and a system of M equations can be obtained.46 N
∑ eiρ(ei) i=1
∂ei = 0, j = 1, ..., M ∂θj
(9)
where ρ(ei) = (exp(−ei2/2σ2))/σ3(2π)1/2,i = 1, ..., N can be regarded as the weighted terms.39,46 The kernel width σ plays an important role in the smoothing process. When σ → 0+, the Gaussian kernel will approach the Dirac delta function. In this case, the MC estimation is identical to the maximum a posteriori (MAP) estimation.41 When σ → ∞, the MC estimation will be equivalent to the minimum MSE estimation.41 Consequently, as recently proved by Chen and Principe, the MC estimation is essentially a smoothed MAP estimation, including the MAP and the MSE estimations as the extreme cases.41 Some approaches39,45−47 can be utilized to determine the kernel width σ for ρ(ei), e.g., the common Silverman’s rule.47 Here, the kernel width can be simply computed as46 σ=
max|ei| , i = 1, ..., N 2 2
(11)
This problem was first proposed for signal processing by Liu et al.39 The weighted terms ρ(ei) = (exp(−ei2/2σ2))/σ3(2π)1/2, i = 1, ..., N mean that large errors get larger attenuation, so the estimation is resistant to outliers.40,45 Recently, Munoz and Chen46 applied the MC-based wavelet modeling method to fitting batch data according to the time; that is, ei = f(ti;θ) − yi, i = 1, ..., N, where ti denotes the ith time instance. Compared with traditional criteria adopted for data-driven process modeling, such as the well-known minimum MSE, the MC criterion has several advantages: (1) it is always bounded for any distribution; (2) it contains all even-order moments, and it is useful for nonlinear and non-Gaussian signal processing; (3) it is a local similarity measure, so it is robust to outlier samples.39−41 Additionally, correntropy has a close relationship with redescending m-estimators.42 Because of these advantages, the concept of correntropy and its maximum criterion have been recently applied to signal processing, spectral characterization, and audio/speech processing.39−45 However, little work of correntropy has been applied to the issue of nonlinear system identification. By comparing MSE with correntropy, it is found that two similarity metrics are not essentially rooted in the same soil. MSE is global, whereas correntropy is local. The global property implies that all samples in the joint space contribute equally to the value of the similarity measure.39 Closeness MSE-metric leaves the possibility that in a small region where data have outlier samples, the estimated and the actual functions can be considerably different. This is not the case for correntropy-metric, which guarantees the minimal proximity of the estimated and the actual functions, because correntropy is able to easily distinguish the similarity of the collected data. Such similarity is important in our case when the estimated function is used to predict the behavior of the processes with outlier data, and the accuracy of the prediction has to be made on a pointwise basis. The correntropy-metric is a robust cost function for outlier samples because it cannot amplify the effect by outliers.39
(6)
i=1
i=1
3. CKL METHOD FOR NONLINEAR SYSTEM IDENTIFICATION The main objective of this work is to develop a simple and general KL framework for robust identification of nonlinear processes. The MC criterion is applied to the issue of the nonlinear system identification, especially in the presence of outliers. The central idea of the proposed CKL method is to integrate the MC criterion and KL into a unified framework. Without resorting to unnecessary efforts, the outliers can be simultaneously detected once the CKL identification model is obtained. 3.1. CKL Framework for Nonlinear System Identification. The nonlinear ARX (NARX) form, which can generally represent a wide class of discrete-time nonlinear systems,1−3 is investigated in this work. For simplicity, consider single-input−single-output (SISO) nonlinear systems using the NARX form governed by the following relationship:1−3
(10)
Assume that any error value that is equal to or greater than the product of 2√2 and the kernel width σ contributes very little to the value of correntropy. Therefore, the general C
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article N ⎧ ⎪ ∂L = 0 → w = ∑ αiϕ(x i) ⎪ ∂w i=1 ⎪ ⎪ ∂L ⎪ ∂e = 0 → αi = γρ(ei)ei , i = 1 , ..., N ⎪ i ⎨ ∂L = 0 → yi − w T ϕ(x i) − b − ei = 0, ⎪ ∂ α ⎪ i ⎪ i = 1 , ..., N ⎪ N ⎪ ∂L = 0 → ∑ αi = 0 ⎪ ⎩ ∂b i=1
yi = f (yi − 1 , ..., yi − n , ui − 1 , ..., ui − nu) + ei y
= f (x i ; θ) + ei , i = 1, ..., N
(12)
where f(·) is the wanted nonlinear model; i is the time instance; and yi, ui, and ei are the system output, the system input, and the noise vector at instance i, respectively. ny and nu are the corresponding lags of the output and the input. xi is a general input vector which is used to consider the serial correlations by augmenting the previous observations of the system output and the system input. Like eq 12, a general form of the kernelized nonlinear model for process modeling can be formulated as16
After elimination of the variables, w and ei, the following solution can be obtained:
yi = f (x i ; θ) + ei
⎡ K + Ω 1 ⎤⎡ α ⎤ ⎡ y ⎤ ⎢ ⎥⎢ ⎥ = ⎢ ⎥ ⎣1T 0 ⎦⎣ b ⎦ ⎣ 0 ⎦
= f (x i ; w, b) + ei = w T ϕ(x i) + b + ei , i = 1, ..., N
(13)
N ⎧ γ 1 ⎪ min J ( w , b , ) ρ = ∑ ρ(ei)ei2 + || w ||2 ⎪ 2 2 ⎨ i=1 ⎪ ⎪ s.t. y − w T ϕ(x i) − b − ei = 0, i = 1, ..., N ⎩ i
P = H−1 = (K + Ω)−1
(14)
⎧ ⎪α = ⎪ ⎨ ⎪ ⎪b = ⎩
⎡ 11T Py ⎤ ⎥ P⎢y − T 1 y1 ⎦ ⎣ 1T Py 1T y1
ei = yi − yi ̂ = yi − ⟨w, ϕ(x i)⟩ − b N
= yi −
i = 1, ..., N
N
= yi −
∑ αjkij − b = yi − αT k i − b j=1
(20)
where ki = [ki1,...,kiN]T ∈ RN×1 is a kernel vector with its element kij = ⟨ϕ(xi),ϕ(xj)⟩, ∀ i,j = 1, ..., N. Then a new set of values for the weighted terms can be updated:
ρ(ei) =
N
i = 1, ..., N
ei2
( ) , i = 1, ..., N
exp −
i=1
i=1
∑ αj⟨ϕ(x j), ϕ(x i)⟩ − b j=1
N
ϕ(x i) − b − ei],
(19)
Level 2 of CKL: Iteratively Weighting. After the parameters are obtained in level 1, the predicted values can be obtained
L = (|| w ||2 + γ ∑ ρ(ei)ei2)/2
∑ αi[yi − w
(18)
As can be seen in eq 16, the solution of model coefficients θ = [wT,b]T can be transformed to θ = [αT,b]T. It can be expressed as
where the user-defined regularization parameter γ (γ > 0) determines the trade-off between the model’s complexity and approximation accuracy. Here, the same regularization term ∥w∥2 in LS-SVM13 is adopted to further compare the proposed method with the LS-SVM related methods. The above problem cannot be solved directly because the weighted terms ρ(ei) = (exp(−ei2/2σ2))/σ3(2π)1/2 depend on the model coefficients θ = [wT,b]T. Here, a two-level iterative procedure is suggested as follows. In the first level, the weighted terms ρ(ei) can be fixed, indicating that a weighted KL problem is formulated. In the second level, the weighted terms ρ(ei) are updated using the obtained model coefficients θ = [wT,b]T. A detailed training algorithm of the proposed CKL method can be described as follows. Level 1 of CKL: Initialization and Update of the CKL Model. First, set the weighted terms ρ(ei) with a fixed value. Assume that the initial value of ρ(ei) is equal to one; i.e., ρ(ei) = 1, i = 1, ..., N, which means all the training samples have the same weight toward the overall squared error. To solve the optimization problem, the Lagrangian can be constructed:
T
(17)
where y = [y1,...,yN]T and 1 ∈ RN×1 is a vector of ones; K ∈ RN×N is a kernel matrix whose element is denoted as Kij = ⟨ϕ(xi),ϕ(xj)⟩, ∀ i,j = 1, ..., N using the kernel trick;11−13 and Ω is a diagonal matrix whose diagonal element Ωi = 1/(γρl(ei)), i = 1, ..., N. Notice that K + Ω is symmetric positive-definite, so it is invertible. For simplicity, the quantity is defined as H = K + Ω and then its inverse can be computed as
where θ = [wT,b]T; that is, the symbols w and b are the model parameter vector and the bias term, respectively. When the MC criterion39 and the KL framework with regularization11−13 are applied to eq 13, the proposed method seeks the nonlinear identification model by solving the following optimization problem:
+
(16)
2σ 2
σ 3 2π
(21)
and Ω can also be updated with its diagonal element Ωi = 1/ (γρ(ei)), i = 1, ..., N. Then, the new values of the coefficient θ = [αT,b]T can be re-estimated using eqs 19 and 20. The set of procedures between level 1 and level 2 is repeated until the
(15)
where α = [α1,...,αN]T represents Lagrange multipliers. The optimality conditions are as follows: D
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
weighted terms ρ(ei) are almost unchanged from run to run. Details will be discussed in section 3.2. Finally, a CKL identification model with the NARX form has been obtained. For online estimation of a new sample xt, the prediction ŷt can be obtained as follows:
Δl + 1, i
N
yt ̂ =f (w, b; x t ) =
T
∑ αi⟨ϕ(x i), ϕ(x t)⟩ + b = α
kt + b (22)
i=1
··· 0
= Hl + 1, i − 1 + vl + 1, isTl + 1, i , i = 1, ..., N
(25)
(26)
where Hl+1,i, i = 1, ..., N denotes the ith modification of Hl+1 and Hl = Hl+1,0 and Hl+1 = Hl,N for simplicity. By plugging this expression into eq 23, Hl+1,i−1 is Pl + 1, i = (Hl + 1, i − 1 + vl + 1, isTl + 1, i)−1 = Pl + 1, i − 1 −
Pl + 1, i − 1vl + 1, isTl + 1, iPl + 1, i − 1 1 + sTl + 1, iPl + 1, i − 1vl + 1, i
(27)
Hl+1,i−1,
where Pl+1,i = i = 1, ..., N and, correspondingly, Pl = Pl+1,0 and Pl+1 = Pl,N for simplicity. The efficient computation procedure of P l+1,i can be described as N rank-one modifications at most. However, it is not necessary to include all the elements to update Pl+1. In the development of this selection algorithm, two conditions should be considered: • If ηl+1,i is too small, Δl+1,i is unlikely to provide new information for Pl+1, i.e. |ηl + 1, i| < εupdate
(28)
and correspondingly in this situation Pl+1,i = Pl+1,i−1. εupdate > 0 is a user-defined small value, which can be used to check the modification of the matrix. Thus, m is the total update number when |ηl+1,i| ≥ εupdate, i = 1, ..., N is held. In each rank-one updating procedure, it requires about O(N2) calculations. Consequently, the total calculations are about O(mN2) to obtain the CKL model for the (l+1)th iteration. Generally, m ≪ N when l ≥ 1 and correspondingly O(mN2) ≪ O(N3). Compared with direct computation of the inverse of Hl+1 starting from scratch, the rank-one update algorithm is more efficient. • If m is too small, the information on updating structure Pl+1 is useless because it wastes computational time. A more practical criterion to stop the iteration can be defined as the update ratio: m ≤ εiteration (29) N
(23)
where 0 < εiteration ≪ 1 is also a user-defined small value to judge the iteration steps. If eq 29 is satisfied, the (l+1)th iteration should be completed despite the maximal iterative number lmax. Thus, one more advantage of m is that it can be utilized as an index to check whether the iterative training procedure is completed. The flexible selection algorithm for the CKL identification model is preferable so that its Pl+1 can be adjusted to compensate for changes in the process behavior.
N
∑ Δl+ 1,i
⋱ ⋮
Hl + 1, i = Hl + 1, i − 1 + Δl + 1, i
where M ∈ RN×N and H ∈ RN×N are both nonsingular matrices, v ∈ RN×1 and s ∈ RN×1 are vectors, and μ = 1 (corresponding to an update) or μ = −1 (corresponding to a downdate).48 This shows that the inverse of M = H + μvsT can be computed from the inverse of H already available in a simpler and inexpensive manner. At the lth iteration, the related items of H = K + Ω, P = H−1, θ = [αT,b]T, and ρ(ei) in section 3.1 can be denoted as Hl = Kl T T + Ω l, Pl = H−1 l , θl = [αl ,bl] , and ρl(ei), respectively. Correspondingly, the diagonal element of Ωl can be denoted as Ωl,i = 1/(γρl(ei)), i = 1, ..., N. Obviously, Hl+1 can be expressed as Hl + 1 = Hl +
··· δl + 1, i
⎡0 ⎤ ··· 0 ⎤ ⎢ ⎥ ⋮ ⎥ ⋱ ⋮⎥ ⎢ ⎥ ⎢1⎥ ··· 0 ⎥ = ⎢ ⎥[0···ηl + 1, i···0] ⎥ ⎢γ ⎥ ⋱ ⋮⎥ ⎢ ⎥ ⋮ ⎥ ··· 0 ⎦ ⎢⎣ ⎥⎦ 0
where δl+1,i = ηl+1,i/γ = Ωl+1,i − Ωl,i = 1/γ(1/(ρl+1(ei) − 1/ (ρl(ei))), i = 1, ..., N. From eqs 24 and 25, Hl+1 with newly added element Δl+1,i can be expressed as
After training a CKL identification model, the outlier samples can be detected simultaneously. This is because the outlier samples have smaller weights ρ(ei) (eq 21). Because of this advantage, the outliers can be removed by this postidentified method.46 Although these outliers are kept in the model, they cannot affect the identification model because their weights are very small. Therefore, despite the outlier samples, a robust CKL identification model can be obtained. 3.2. Efficient Training Procedure of the CKL Method. As aforementioned, a two-level iterative training procedure is proposed to obtain the CKL model. Particularly, at each iteration, the main computational load is O(N3) for the inverse of the matrix H = K + Ω, i.e., P = H−1 in eq 18. It might be less using a conjugate gradient algorithm.13 Alternatively, a Cholesky factorization can also be applied because H is symmetric and positive definite.48 However, it is still computationally inefficient. Note that at each iteration, only the diagonal elements of Ω are changed but other elements in H are always kept unchanged. Instead of using all the available data pairs to recalculate P, it is worthy of finding a way to take advantage of the already available P to obtain the updated P with a minimum effort. First, according to the Sherman−Morrison−Woodbury formula,48 an efficient method for a rank-one update of an inverse of a matrix H can be obtained: H−1vsT H−1 1 + sT H−1v
··· 0 ⋱ ⋮
= vl + 1, isTl + 1, i , i = 1, ..., N
i=1
M−1 = (H + μ vsT )−1 = H−1 − μ
⎡0 ⎢ ⎢⋮ = ⎢0 ⎢ ⎢⋮ ⎢⎣ 0
(24) E
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
method can obtain better identification performance in industrial systems because the quality of modeling samples is always not good. Remark 1. In essence, the proposed CKL method can be considered as a nonlinear robust estimator. At first glance, CKL is somewhat similar to the weighted LS-SVM approaches.35−38 However, most of the traditional weighted SVM and LS-SVM methods35−38 adopt different heuristic weighting strategies to reduce the effect of outlier samples. There are several improved SVM-based methods adopting a robust error distribution other than Gaussian. However, in practice, it is not a trivial task to design these weighting strategies (or the loss functions in SVMrelated methods35). Additionally, it is difficult to check whether these weighting schemes are suitable for the complicated industrial data set beforehand. Unlike these heuristic schemes which deal with special problems, CKL can better handle general identification problems with outliers. This is one main advantage of the proposed method.
In summary, the proposed CKL method for nonlinear system identification and its two-level training procedure is illustrated in Figure 1. It should be noted that, only using the two-level
4. ILLUSTRATIVE EXAMPLES In this section, from different aspects and comparison purposes, two examples are utilized to evaluate the characteristics and to illustrate its implementation of CKL since this is a new identification method with delicate algorithmic structure and properties. The simulation environment for both examples is MatLab V2009b with a CPU having a main frequency of 2.3 GHz and 4 GB of memory. 4.1. A Benchmark Problem for Comparative Study. First, a well-known SISO system using the NARX form, referred to frequently as a benchmark case by many other identification methods, e.g., ANN, SVM, KL, and fuzzy systems,5,9,16 is given to demonstrate the effect of CKL and meanwhile to compare with LS-SVM,13 which is represented as the existing popular KL approach. LS-SVM also shows comparative prediction performance of SVM for many nonlinear modeling problems.13,16,19,20,37,38 Moreover, CKL is also compared with a weighted LS-SVM (WLS-SVM) method37 to show its adaptive weighting property by correntropy. The system output and its input variable can be described as5
Figure 1. Correntropy kernel learning (CKL) method for nonlinear system identification and its two-level efficient training procedure.
algorithm in section 3.1, the CKL nonlinear identification model can be obtained in an iterative manner. Furthermore, section 3.2 provides an efficient training procedure for the CKL method. In view of the computational load, it is a necessary complement for CKL and makes the algorithm more dedicated and practical. The training procedure in section 3.2 makes the CKL method more efficient in two ways. First, the matrix Pl+1 can be updated fast without starting from the beginning. Moreover, a practical judgment is utilized to check whether the iterative training procedure is completed before the maximal number of iteration lmax. Consequently, the two-level training algorithm can be implemented in a more efficient manner in practice. As for KL identification models, the model/parameter selection is an important issue in the machine learning. There are many theories which lead to different model selection criteria, e.g., cross-validation and Bayesian inference.11−13 However, there is still no optimal parameter selection theory specifically for industrial applications. In this paper, the MCbased k-fold CV procedure developed in section 2 is adopted. The Gaussian kernel K(xi,xj) = exp(−∥xi−xj∥/τ) (τ > 0) is used because it is a common kernel function. Consequently, the CKL identification model can be finally obtained using an MCbased k-fold CV procedure for a candidate set of kernel parameters τ ∈ τS and regularization parameters γ ∈ γS. Generally, traditional identification methods, e.g., LS-SVM, are sensitive to outliers. They are based on the MSE loss function, which is only optimal when the underlying noises obey Gaussian distribution.35−38 Unlike the MSE criterion, the non-Gaussian noise and outliers in the process can be suitably treated by the MC criterion.39 Additionally, the nonlinear relationship between the process input and output can be identified using the kernel trick and the regularization technique. Consequently, as expected, the proposed CKL
⎧ y y y (y − 1)uk − 1 + uk ⎪ y = k k−1 k−2 k−2 k+1 ⎪ 1 + yk2− 1 + yk2− 2 ⎪ ⎨ ⎧ sin(πk /125), ⎪ k ≤ 203 ⎨ ⎪ uk = ⎪ ⎪ ⎩ 0.8 sin(πk /125) + 0.2 sin(2πk /25), k > 203 ⎩ ⎪
(30)
Here, a sequence of 400 samples is collected. The first half of them are for training, and the rest are for testing. The general input vector of the CKL model consists of uk and yk, as well as their one and two step delayed term(s), respectively, i.e., xk = [yk,yk−1,yk−2,uk,uk−1]T. Both the deterministic and stochastic situations are considered, and the issue of the identification model for reducing the effect of outliers is specially explored. First, all training and test samples are noise-free. The corresponding identification result for the test samples in the noise-free environment is shown in Figure 2. In this case, both CKL and LS-SVM methods can obtain almost the same good prediction performance. This is because the weights of the training samples, i.e., ρ(ei), in the CKL model are almost the same in the noise-free environment. It also indicates that LSSVM is a special method of CKL in the deterministic situation. F
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
postidentified. Additionally, after training a CKL model, its identification error distribution of the training set is shown in Figure 4. Most training errors obey the Gaussian distribution.
Figure 2. Both of the CKL and LS-SVM identification methods for prediction of the nonlinear SISO system in noise-free environment.
Then, to simulate the industrial environment, both the input and output signals are corrupted by independent Gaussian noise with a variance of 0.02. Furthermore, the system output is corrupted by some outliers (about 10% of the training and test samples). Correspondingly, the training samples become an error-in-variable set for model identification. This problem is still less investigated for nonlinear systems.2 The training results of a CKL identification model and its weighted terms ρ(ei) can be shown in Figure 3. As previously mentioned, a large training
Figure 4. The identification error distribution of the CKL model for the training set.
Only a few of them are larger. They can be considered as outliers. Therefore, the CKL identification method can be further extended to outlier detection using a repeated procedure by removing the postidentified outliers out of the training set. Although these outliers are reserved in the CKL model, they show little effect on the identification performance because of the small weights. This advantage makes the CKL method suitable for practical identification problems because modeling samples often contain different kinds of outliers and noise. As previously mentioned, CKL can be considered as a weighted KL method. Here, to show the characteristic of CKL, it is further compared with a well-known WLS-SVM method proposed by Suykens et al.37 ⎧1, if |ei /s |̂ ≤ c1 ⎪ ⎪ c 2 − |ei /s |̂ , if c1 < |ei /s |̂ ≤ c 2 ρ(ei) = ⎨ ⎪ c 2 − c1 ⎪ −4 otherwise ⎩10 ,
(31) 37
where c1 and c2 are user-defined parameters and ŝ is a robust estimate of the standard deviation of the LS-SVM error variables ei. In the estimate of ŝ, one takes into account how much the estimated error distribution deviates from a Gaussian distribution.37 However, the WLS-SVM method is heuristic but not general. Additionally, it is not in an adaptive weighting manner because several user-defined parameters in eq 31 should be determined. It is not straightforward for many practical problems. This disadvantage exists in most of the traditional weighted KL methods using different heuristic strategies.35−38 The corresponding prediction result of the test samples combined only with the Gaussian noise is shown in Figure 5. The WLS-SVM method shows a little better prediction performance than LS-SVM. However, as shown in eq 31, its weights are discontinuous, and most of them are equal to 1 in this case. Consequently, CKL can achieve better prediction
Figure 3. The main training results of a CKL identification model and its weighted terms ρ(ei). (To show the figure more clearly, the other 50 training samples only combined with noise are not plotted.) Some outliers can be simply identified using a cutoff value once the CKL identification model is obtained.
error can introduce a small value of weight to the related sample. Consequently, the bad influence of outliers can be reduced by the effect of correntropy-based weighting strategy. Finally, a relatively smooth model which is more suitable for system identification can be obtained. Interestingly, the outlier samples can be simultaneously detected once a CKL identification model is obtained. Because of the effect of correntropy, outliers have much smaller weights compared to most normal samples. For example, using a cutoff value shown in Figure 3, some outliers can be simply G
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
2. Generally, in the Gaussian environment, a smaller value of RMSE means a larger value of CE. Both of them indicate better prediction performance. There are two scenarios investigated below. In the first scenario, the process is combined with the Gaussian noise as shown in Figure 5. CKL achieves the best prediction performance because its prediction is the closest to the realistic system output without outliers or noise. As shown in Table 1, two indices of CEreal and RMSEreal of the test samples clearly show the same results. However, the realistic system output is unknown, and correspondingly two indices of CEreal and RMSEreal in Table 1 are “virtual.” They cannot be computed in a real situation. Alternatively, two indices of CE and RMSE should be used to evaluate the prediction performance according to the measurement outputs. As shown in Table 1, the tendencies of CE and CEreal of the three methods are almost the same; i.e., 0.975/ 0.929/0.883 is almost the same as 0.985/0.945/0.899. The ratios of CE and CEreal (i.e., CE/CEreal) for three methods are generally the same. On the other hand, at first glance, the tendencies of RMSE and RMSEreal of the three methods are almost the same (0.023/0.040/0.053 and 0.018/0.035/0.049). However, the ratios of RMSE and RMSEreal (i.e., RMSE/ RMSEreal) of the three methods are somewhat different because some samples have a larger amplitude of noise than their values as shown in Figure 5. This indicates that CE and RMSE can be adopted to evaluate the prediction performance in the Gaussian situation. CE is more suitable when the amplitude of noise is different among the output samples. In the second scenario, the process is combined with outliers and the Gaussian noise (Figure 6). The simulations are run 10 times, and the mean results are listed in Table 1. Using both CEreal and RMSEreal for the test samples, CKL is superior to the other two methods. However, using the RMSE index only is difficult to identify the best method because they are almost the same (0.117/0.122/0.123). In our simulations, sometimes LSSVM can obtain a smallest RMSE among the three approaches. However, compared with the realistic output, LS-SVM actually suffers from overfitting of the outliers (Figure 6). Additionally, the ratios of RMSE and RMSEreal (i.e., RMSE/RMSEreal) of the three methods are very different. Consequently, the traditional RMSE index widely adopted in process systems engineering is not a suitable index when the system is corrupted by outliers because RMSE may be contaminated by the phenomenon of overfitting. Alternatively, as listed in Table 1, the tendencies of CE and CEreal of the three methods are almost the same; i.e., 0.922/0.917/0.913 are still almost the same as 0.980/0.973/ 0.968. The ratios of CE and CEreal (i.e., CE/CEreal) of the three methods are almost the same because the larger errors should get larger attenuation. Therefore, at least the CE index is auxiliary and useful for performance evaluation. It is suggested that the CE index be utilized as a more suitable index for performance evaluation in the error-in-variable environment, especially for outliers. Finally, the efficient training procedure of the CKL identification method developed in section 3.2 is investigated. In this work, the maximal iteration number lmax = 10. The values of εupdate and εiteration can be chosen to be very small, e.g., εupdate = 0.01 and εiteration = 0.01. As for the second scenario, the average results of 10 simulations can be shown in Figure 7. In the first iteration, weights of all the samples need to be updated. As for the second iteration, only about 4% of samples need to be updated. That is to say, about 96% of the samples can keep
Figure 5. All of the CKL, WLS-SVM, and LS-SVM identification methods for prediction of the nonlinear SISO system with Gaussian noise. (The prediction comparisons when the system outputs are around 0 have validated that CKL can obtain much better results.)
performance than WLS-SVM and LS-SVM methods. When the system outputs are around 0, CKL is much better than the other two methods because of the large amplitude of the noise, which can also be considered as unobvious outliers or some impulsive noise for these samples. The prediction result for the test samples combined with both noise and outliers is shown in Figure 6. The prediction
Figure 6. All of the CKL, WLS-SVM, and LS-SVM identification methods for prediction of the nonlinear SISO system with Gaussian noise and outliers. (The prediction comparisons shown in the ellipses have validated that CKL can reduce the effect of outliers and thus obtain better results.)
comparisons shown in the ellipses have validated that CKL can reduce the effect of outliers to obtain better performance. Generally, CKL is superior to WLS-SVM and LS-SVM methods in this error-in-variable environment. As shown in Figures 5 and 6, the CKL method traces most the nonlinear dynamics during the test phase, although all samples are combined with different amplitudes of noise (compared to the samples themselves) and some outliers are large. As for performance evaluation, two indices are investigated, including the traditional RMSE and the proposed CE in section H
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
Table 1. Proposed CE and Traditional RMSE Indices for Performance Evaluation of CKL, WLS-SVM, and LS-SVM Identification Methods in Different Scenariosa performance index measurement of system output scenario Gaussian noise, 100 test samples, i.e., Figure 5
outliers and noise, 200 test samples, 10 simulations
a
realistic system output without outliers nor noise
method
CE
RMSE
CEreal
RMSEreal
CE/CEreal
RMSE/RMSEreal
CKL WLS-SVM36 LS-SVM13 CKL WLS-SVM36 LS-SVM13
0.975 0.929 0.883 0.922 0.917 0.913
0.023 0.040 0.053 0.117 0.125 0.127
0.985 0.945 0.899 0.980 0.973 0.968
0.018 0.035 0.049 0.041 0.047 0.053
0.990 0.983 0.982 0.941 0.942 0.943
1.278 1.143 1.082 2.854 2.660 2.396
A larger value of CE is better; a smaller value RMSE is better.
Figure 8. A simplified flowchart of the reboiler in the production of ethylene chloride. Figure 7. The total update ratio (m/N in eq 29) in each iteration for training a CKL identification model (average results of 10 simulations).
The Second Scenario: Compared with the training set, the testing samples are data from a sequence of 60 min with a small change in the operating conditions. Without any prior knowledge, the general input vector of the CKL model consists of uk and yk, as well as their one step delayed terms, respectively, i.e., xk = [yk,yk−1,uk,uk−1]T. The training results of the CKL and WLS-SVM identification models and their weighted terms ρ(ei) can be shown in Figure 9. Compared to the simulation case in section 4.1, the outliers are not obvious, and some of them might be masked by their adjacent outliers. This is very common in many processes,29 especially when the available training samples are insufficient. The fitting results of CKL and WLS-SVM are almost the same. Additionally, the weighted terms of CKL are continuous. They can be adaptively determined by their training errors. However, the weights of WLS-SVM are discontinuous, and most of them are equal to 1. Therefore, the heuristic weighting strategy of traditional WLS-SVM37 is not suitable for many practical problems. As for the first scenario, the prediction results of the testing samples of all the CKL, WLS-SVM, and LS-SVM identification methods are shown in Figure 10. The corresponding prediction errors of the three methods are shown in Figure 11. In Figures 9−11, although the fitting results of CKL and WLS-SVM approaches for the training data are almost the same (Figure 9), the prediction results on the testing set are different (Figures 10 and 11). Generally, the CKL method can achieve better prediction performance than the other two methods because more samples have relatively small prediction errors.
the same weights from the second iteration as shown in eq 28. On average, only four iterations are needed to train a CKL model. From the second to fourth iterations, the total update ratios are always small, so a few weights need to be updated. Consequently, it is more efficient to use this strategy in practice. 4.2. An Industrial Example in Taiwan. In this section, an industrial example in Taiwan is explored to further validate the effectiveness of the CKL identification method. A simplified flowchart of the process is shown in Figure 8. This reboiler is used to produce pure ethylene chloride. On the basis of fundamental knowledge and past experience in this production line, one important control loop for the final product is manipulation of the liquid level of NC-103B shown in Figure 8. Advanced control strategies can be designed once a suitable identification model for this loop is obtained. Therefore, the proposed CKL identification method is further validated using the data of December 2011 in this industrial case. In this process, the sampling time is 30 s. All the training and testing data are not preprocessing. The samples collected in the first 100 min are for training. And two scenarios are investigated for testing. The First Scenario: The testing samples are data from a sequence of 100 min under almost the same operating conditions as the training set. I
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
in Table 2. The identified results of all the CKL, WLS-SVM, and LS-SVM methods with different model orders are also obtained and summarized in Table 2. Table 2. The Proposed CE and Traditional RMSE Indices for Performance Evaluation of CKL, WLS-SVM, and LSSVM Identification Methods for the First Scenario of the Industrial Process in Taiwana general input vector (model orders) xk = [yk,uk]
T
xk = [yk,uk,uk−1]T
xk = [yk,yk−1,uk]T
Figure 9. The main training results of the CKL and WLS-SVM identification models and their weighted terms ρ(ei). The ρ(ei)’s of CKL are continuous and adaptively weighted by their training errors. The weights of WLS-SVM are discontinuous, and most of them are equal to 1.
xk = [yk,yk−1,uk,uk−1]T
xk = [yk,yk−1,yk−2,uk,uk−1,uk−2]T
method
CE
RMSE
CKL WLS-SVM36 LS-SVM13 CKL WLS-SVM36 LS-SVM13 CKL WLS-SVM36 LS-SVM13 CKL WLS-SVM36 LS-SVM13 CKL WLS-SVM36 LS-SVM13
0.48 0.46 0.46 0.45 0.32 0.33 0.38 0.34 0.35 0.67 0.52 0.54 0.47 0.36 0.37
0.42 0.43 0.43 0.42 0.53 0.52 0.39 0.46 0.44 0.34 0.45 0.43 0.39 0.50 0.48
a
A larger value of CE is better; a smaller value RMSE is better; the best results for each xk are bold and underlined.
In Table 2, whatever model orders are chosen, the CKL method achieves better performance than WLS-SVM and LSSVM. In this case, the suitable model orders can be roughly determined from both the CE and RMSE indices, i.e., xk = [yk,yk−1,uk,uk−1]T. If the model orders are not enough, i.e., xk = [yk,uk]T, CKL is only a little better than WLS-SVM and LSSVM. If the model orders are a little larger, e.g., xk = [yk,yk−1,yk−2,uk,uk−1,uk−2]T, CKL can still obtain much better predictions than WLS-SVM and LS-SVM because error-invariables can be accumulated with more orders. Therefore, from the results of offline identification in Table 2, the model orders can be simply determined using the CKL method. Intuitionally, the concept of correntropy may be extended to the issue of model order determination, which is an interesting work for the future. As for the second scenario, the prediction results of the testing samples of all the CKL, WLS-SVM, and LS-SVM identification methods are shown in Figure 12. The corresponding prediction error distributions of the three methods are shown in Figure 13. In this study case, the operating conditions of the training set are somewhat different. As shown in Figure 12, the online prediction results in the range of 15−40 min validate that CKL is superior to WLS-SVM and LS-SVM. This is because the identification model based on the traditional LS-SVM method is contaminated by some outliers. All weights of the LS-SVM identification model are equal. Consequently, the dynamic relationship of yk+1 and xk is distorted to some extent. WLS-SVM is better than LS-SVM because the former utilizes a heuristic weighting strategy. Among three methods, the CKL method can achieve the best prediction performance with better CE and RMSE indices. Additionally, as shown in Figure 13, the prediction error distribution seems a more Gaussian shape in this case. This is mainly because most of the internal dynamic relationship has been captured regardless of the existence of outliers.
Figure 10. All of the CKL, WLS-SVM, and LS-SVM identification methods for prediction for the first scenario of the industrial process.
Figure 11. Prediction errors of all the CKL, WLS-SVM, and LS-SVM identification methods for the first scenario of the industrial process.
As for the first scenario, both the CE and RMSE indices are adopted to evaluate the performance quantitatively, as tabulated J
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
Figure 12. Prediction errors of all the CKL, WLS-SVM, and LS-SVM identification methods for the second scenario of the industrial process.
Figure 14. The update ratio (m/N in eq 29) in each iteration for training a CKL identification model of the industrial process.
also be considered as an identification problem in the error-invariable environment. By removing obvious outliers, the CKL method can achieve better results than the LS-SVM and WLSSVM methods. In this work, the proposed CKL method can be considered as an autoweighted KL method. Therefore, it is mainly compared with the LS-SVM and WLS-SVM methods. As for outlier detection, several useful statistical methods are available. The Gaussian mixture model (GMM),28,31 which is a semiparametric method, can represent the data distribution for outlier detection. These methods can be utilized in the preprocessing step before identification. Additionally, combination of the GMM-based method with the KL framework for nonlinear system identification is also an interesting topic in the future. In summary, as shown in this paper, the proposed CKL method is more general and efficient than LS-SVM and WLSSVM methods for identification of nonlinear systems with outliers and noise. In general, this identification method can be extended to other KL models, such as state-dependent ARX models,15 nonlinear autoregressive and moving average models,17 subspace models,18 Hammerstein/Wiener models,19−21 and so forth. Similar to some KL methods, the CKL method for SISO NARX systems can be extended straightforwardly to multi-input−multi-output NARX systems.16
Figure 13. Prediction error distributions of all the CKL, WLS-SVM, and LS-SVM identification methods for the second scenario of the industrial process.
Finally, the training procedure of the CKL method developed in section 3.2 is further investigated for this industrial process. As can be shown in Figure 14 (xk = [yk,yk−1,uk,uk−1]T), all the weights of the samples need to be updated in the first iteration. As for the second iteration, only about 10% of the samples need to be updated. That is to say, about 90% of the samples can keep the same weights from the second iteration as shown in eq 28. There are seven iterations for training this CKL model (for εiteration = 0.02). From the second to the seventh iterations, the update ratios are always small, so only a few weights need to be updated. Generally, the overall computational load of CKL is only a little larger than WLS-SVM. It can be accepted in most offline training applications. Therefore, it is more efficient to use this strategy in practice. Compared with the iterative results of the benchmark problem shown in Figure 7, the update ratio in each iteration of this industrial case is relatively large. This is mainly because the outliers are not obvious when the samples are from a normal running process. Instead, most of the samples are corrupted with noise, and some of them are indistinguishable outliers, which can also be shown in Figure 9. Therefore, this case can
5. CONCLUSIONS The objective of this work is to develop a general nonlinear identification method for systems with outliers and noise. Unlike the traditional MSE criterion adopted by almost existing identification methods, correntropy is introduced in the area of nonlinear system identification. Without much effort in outlier detection, the proposed CKL method can be utilized for identification of nonlinear systems with outliers. The main appealing properties, such as the structural risk minimization principle, the kernel technique, the convex optimization problem, and a few free parameters to be adjusted, are still preserved in this identification framework. Moreover, its distinguished characteristics can be summarized in three main aspects: (1) The CKL identification model can be obtained in an efficient two-level training procedure. K
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
(7) Bakshi, B. R.; Stephanopoulos, G. Wave-net: a multiresolution, hierarchical neural network with localized learning. AIChE J. 1993, 39, 57−81. (8) Chen, J. H.; Bruns, D. D. WaveARX neural network development for system identification using a systematic design synthesis. Ind. Eng. Chem. Res. 1995, 34, 4420−4435. (9) Wang, L. X. Adaptive Fuzzy Systems and Control: Design and Stability Analysis; Prentice Hall: Upper Saddle River, NJ, 1994. (10) Harris, C. J.; Hong, X.; Gan, Q. Adaptive Modelling, Estimation and Fusion from Data: A Neurofuzzy Approach; Springer-Verlag: Heidelberg, 2002. (11) Vapnik, V. N. The Nature of Statistical Learning Theory; Springer-Verlag: New York, 1995. (12) Schölkopf, B.; Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, 2002. (13) Suykens, J. A. K.; Van Gestel, T.; De Brabanter, J.; De Moor, B.; Vandewalle, J. Least Squares Support Vector Machines; World Scientific: Singapore, 2002. (14) Zhang, J.; Sato, T.; Iai, S. Novel support vector regression for structural system identification. Struct. Control Health Monit. 2006, 14, 609−626. (15) Toivonen, H. T.; Tötterman, S.; Åkesson, B. Identification of state-dependent parameter models with support vector regression. Int. J. Control 2007, 80, 1454−1470. (16) Liu, Y.; Wang, H. Q.; Yu, J.; Li, P. Selective recursive kernel learning for online identification of nonlinear systems with NARX form. J. Process Control 2010, 20, 181−194. (17) Martínez-Ramón, M.; Rojo-Á lvarez, J. L.; Camps-Valls, G.; Muñoz-Marí, J.; Navia-Vázquez, Á .; Soria-Olivas, E.; Figueiras-Vidal, A. R. Support vector machines for nonlinear kernel ARMA system identification. IEEE Trans. Neural Networks 2006, 17, 1617−1622. (18) Verdult, V.; Verhaegen, M. Kernel methods for subspace identification of multivariable LPV and bilinear systems. Automatica 2005, 41, 1557−1565. (19) Goethals, I.; Pelckmans, K.; Suykens, J. A. K.; De Moor, B. Identification of MIMO Hammerstein models using least squares support vector machines. Automatica 2005, 41, 1263−1272. (20) Li, C. H.; Zhu, X. J.; Cao, G. Y.; Sui, S.; Hu, M. R. Identification of the Hammerstein model of a PEMFC stack based on least squares support vector machines. J. Power Sources 2008, 175, 303−316. (21) Tötterman, S.; Toivonen, H. T. Support vector method for identification of Wiener models. J. Process Control 2009, 19, 1174− 1181. (22) Gregorcic, G.; Lightbody, G. Nonlinear system identification: From multiple-model networks to Gaussian processes. Eng. Appl. Artificial Intelligence 2008, 21, 1035−1055. (23) Ni, W. D.; Wang, K.; Chen, T.; Ng, W. J.; Tan, S. K. GPR model with signal preprocessing and bias update for dynamic processes modeling. Control Eng. Pract. 2012, 20, 1281−1292. (24) Kadlec, P.; Gabrys, B.; Strandt, S. Data-driven soft sensors in the process industry. Comput. Chem. Eng. 2009, 33, 795−814. (25) Grubbs, F. E. Procedures for detecting outlying observations in samples. Technometrics 1969, 11, 1−21. (26) Pearson, R. K. Outliers in process modeling and identification. IEEE Trans. Control Syst. Technol. 2002, 10, 55−63. (27) Chiang, L. H.; Pell, R. J.; Seasholtz, M. B. Exploring process data with the use of robust outlier detection algorithms. J. Process Control 2003, 13, 437−449. (28) Hodge, V. J.; Austin, J. A survey of outlier detection methodologies. Artif. Intell. Rev. 2004, 22, 85−126. (29) Liu, H.; Shah, S. L.; Jiang, W. Online outlier detection and data cleaning. Comput. Chem. Eng. 2004, 28, 1635−1647. (30) Zeng, J. S.; Gao, C. H. Improvement of identification of blast furnace ironmaking process by outlier detection and missing value imputation. J. Process Control 2009, 19, 1519−1528. (31) Yang, X. W.; Latecki, L. J.; Pokrajac, D. Outlier detection with globally optimal exemplar-based GMM. In Proceedings of the SIAM
(2) Unlike traditional weighted KL methods, the CKL method can determine its continuous weights in an adaptive manner. The internal dynamic relationship of a nonlinear identification model can be captured well. (3) Without resorting to unnecessary efforts, the outlier samples can be simultaneously detected once the CKL identification model is obtained. Furthermore, with the help of the novel CE index, the performance evaluation of nonlinear identification models should be more suitable for systems with outliers. The superiority of the proposed CKL method in terms of accuracy and reliable performance has been validated through a benchmark example and an industrial process. Some interesting future studies have also been highlighted.
■
AUTHOR INFORMATION
Corresponding Author
*Tel.: +886-3-2654107. Fax: +886-3-2654199. E-mail: jason@ wavenet.cycu.edu.tw. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS The authors would like to gratefully acknowledge National Science Council, R.O.C. and National Natural Science Foundation of China (Grant No. 61004136) for their financial support.
■
NOMENCLATURE: MAIN ABBREVIATIONS ARX = autoregressive with exogenous input ANN = artificial neural networks CKL = correntropy kernel learning CV = cross-validation GMM = Gaussian mixture model KL = kernel learning LS-SVM = least-squares support vector machines MAP = maximum a posteriori MC = maximum correntropy NARX = nonlinear autoregressive with exogenous input SISO = single-input−single-output SVM = support vector machines
■
PARAMETERS CE = correntropy-based error MSE = mean squared error RMSE = root mean squared error
■
REFERENCES
(1) Ljung, L.; Hjalmarsson, H.; Ohlsson, H. Four encounters with system identification. Eur. J. Control 2011, 17, 449−471. (2) Söderström, T. System identification for the errors-in-variables problem. Trans. Inst. Meas. Control (London) 2012, 34, 780−792. (3) Hong, X.; Mitchell, R. J.; Chen, S.; Harris, C. J.; Li, K.; Irwin, G. W. Model selection approaches for non-linear system identification: a review. Int. J. Syst. Sci. 2008, 39, 925−946. (4) Zhu, Y. C. Multivariable System Identification for Process Control; Elsevier Science & Technology Books: United Kingdom, 2001. (5) Narendra, K. S.; Parthasarathy, K. Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Networks 1990, 1, 4−27. (6) Himmelblau, D. M. Accounts of experiences in the application of artificial neural networks in chemical engineering. Ind. Eng. Chem. Res. 2008, 47, 5782−5796. L
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
International Conference on Data Mining; SIAM: Philadelphia, PA, 2009; pp 145−154. (32) Khatibisepehr, S.; Huang, B. Dealing with irregular data in soft sensors: Bayesian method and comparative study. Ind. Eng. Chem. Res. 2008, 47, 8713−8723. (33) Khatibisepehr, S.; Huang, B. A Bayesian approach to robust process identification with ARX models. AIChE J. 2013, 59, 845−859. (34) Yu, J. A. Bayesian inference based two-stage support vector regression framework for soft sensor development in batch bioprocesses. Comput. Chem. Eng. 2012, 41, 134−144. (35) Chuang, C. C.; Su, S. F.; Jeng, J. T.; Hsiao, C. C. Robust support vector regression networks for function approximation with outliers. IEEE Trans. Neural Networks 2002, 13, 1322−1330. (36) Hong, D. H.; Hwang, C. Support vector fuzzy regression machines. Fuzzy Sets Syst. 2003, 138, 271−281. (37) Suykens, J. A. K.; De Brabanter, J.; Lukas, L.; Vandewalle, J. Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 2002, 48, 85−105. (38) Wen, W.; Hao, Z. F.; Yang, X. W. A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression. Neurocomputing 2008, 71, 3096−3103. (39) Liu, W. F.; Pokharel, P. P.; Príncipe, J. C. Correntropy: properties and applications in non-Gaussian signal processing. IEEE Trans. Signal Process. 2007, 55, 5286−5298. (40) Príncipe, J. C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer-Verlag: New York, 2010. (41) Chen, B. D.; Príncipe, J. C. Maximum correntropy estimation is a smoothed MAP estimation. IEEE Signal Process. Lett. 2012, 55, 491− 494. (42) He, R.; Zheng, W. S.; Hu, B. G.; Kong, X. W. A regularized correntropy framework for robust pattern recognition. Neural Comput. 2011, 23, 2074−2100. (43) Huijse, P.; Estevez, P. A.; Zegers, P.; Príncipe, J. C.; Protopapas, P. Period estimation in astronomical time series using slotted correntropy. IEEE Signal Process. Lett. 2011, 18, 371−374. (44) Xu, J.; Príncipe, J. C. A pitch detector based on a generalized correlation function. IEEE Trans. Audio, Speech, Language Process. 2008, 16, 1420−1432. (45) Singh, A.; Príncipe, J. C. Information theoretic learning with adaptive kernels. Signal Process. 2011, 91, 203−213. (46) Munoz, J. C.; Chen, J. H. Removal of the effects of outliers in batch process data through maximum correntropy estimator. Chemom. Intell. Lab. Syst. 2012, 111, 53−58. (47) Silverman, B. W. Density Estimation for Statistics and Data Analysis; Chapman & Hall, CRC: London, 1986. (48) Golub, G. H.; van Loan, C. F. Matrix Computations, 3rd ed.: The John Hopkins University Press: Baltimore, MD, 1996.
M
dx.doi.org/10.1021/ie401347k | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX