Modeling a Large-Scale Nonlinear System Using ... - ACS Publications

788

Ind. Eng. Chem. Res. 2007, 46, 788-800

PROCESS DESIGN AND CONTROL Modeling a Large-Scale Nonlinear System Using Adaptive Takagi-Sugeno Fuzzy Model on PCA Subspace Jialin Liu* Department of Information Management, Fortune Institute of Technology, 1-10, Nwongchang Road, Neighborhood 28, Lyouciyou Village, Daliao Township, Kaohsiung Country, Taiwan, Republic of China

A data-driven Takagi-Sugeno fuzzy model is developed for modeling a real plant situation with the dependent inputs and the nonlinear and time-varying input-output relation. The collinearity of inputs can be eliminated through the principal component analysis. The TS model split the operating regions into a collection of IFTHEN rules. For each rule, the premise is generated from clustering the compressed input data, and the consequence is represented as a linear model. A post-update algorithm for model parameters is also proposed to accommodate the time-varying nature. Effectiveness of the proposed model is demonstrated using real plant data from a polyethylene process. 1. Introduction In many manufacturing processes, a model that can infer the final product qualities from online measurements is extremely useful for process monitoring and fault detection. However, development of such a model is extremely difficult due to the nonlinear and time-varying nature of the input-output relation. For example, a polyethylene plant operates in multiple regions in order to manufacture products with different grades for different customers. New grades are being developed constantly in order to meet their evolving needs. Thus versatility and adaptability are also desirable characteristics of a good process model in addition to accuracy. There are two main categories of process models: “firstprinciple” model and data-driven model. A first-principle physical model can be obtained from the fundamental process knowledge. However, due to the complexity of the manufacturing process, such fundamental models either require a lot of effort and time to develop or are too simplistic to be accurate in practice. On the other hand, data-driven models provide accurate information for a particular process by the analysis of plant input-output data. Linear approaches can be easily updated to accommodate the time-varying nature of the plant; however, they lack the ability to extrapolate into different operating regions. To cover a wide range of operations, nonlinear models may be used. However, determination of the appropriate structure of a nonlinear model is not easy and therefore difficult to update. Alternatively, an array of piecewise linear models may be used. Measurement noises and redundancy often exist in the collection data. The former corrupts the data quality and reduces the model reliability; the latter induces collinearity of the variables that results in over-fits. Principal component analysis (PCA) is a popular multivariate statistical method of compressing high-dimensional inputs into a few orthogonal variables. Meanwhile, the measurement noises can also be properly filtered. By projecting online data onto the PCA subspace * Phone: +886-7-7889888, ext. 6122. Fax: +886-7-7889777. E-mail: [email protected].

constructed from the normal operating condition (NOC) data, the residuals and the distances from the subspace origin can be measured for process monitoring. Research on using PCA for online fault detection has been reviewed comprehensively by Russell et al.1 It is also advantageous to use compressed data for process modeling. In the online prediction phase, the inputs of new data have to be compressed before predicting the outputs. If the compressed inputs are not consistent with the reference data set that was used to construct the model, alarms should be triggered instead of predicting infeasible outputs. Unreasonable extrapolation can therefore be avoided. If piecewise linear models are used, a suitable model must be selected. For example, the Takagi-Sugeno (TS) model2 can be used to select a suitable linear model by fuzzy IF-THEN rules. The antecedent or premise of each rule can be described through linguistic variables and corresponding fuzzy sets.3 In a chemical process plant there are many variables. Moreover, more quantified descriptions may be required to represent the expert’s knowledge. The number of fuzzy rules will increase exponentially with the number of linguistic variables, known as “curse of dimensionality”. In order to alleviate the rule-explosion problem, Wang4 proved that the hierarchical approach5 linearly increases the number of rules with the number of input variables through hierarchically connecting low-dimensional fuzzy subsystems. An alternative approach is to generate fuzzy partitions on the feature subspace extracted from data distribution through the PCA. Ravi et al.6 applied the method to wine classification and breast cancer determination problems. They reported that fuzzy rules could be effectively reduced without lost capability of data classification. Fuzzy rules can be generated using a fuzzy clustering method such as the fuzzy c-means (FCM) algorithm.7 The objective of FCM is to partition observations into c clusters based on the distances measured from the centers of clusters. It imposes a hyperspherical shape for each cluster, regardless of the actual data distribution. Gustafson and Kessel8 extended FCM with an adaptive distance norm in order to reflect clusters of different geometrical shapes in one data set. The FCM used the constraints that the memberships of an observation in all clusters

10.1021/ie0511534 CCC: $37.00 © 2007 American Chemical Society Published on Web 12/16/2006

Ind. Eng. Chem. Res., Vol. 46, No. 3, 2007 789

must sum to 1. All of the new data were classified into the known groups, which resulted in the new operating regions being unidentifiable. Krishnapuram and Keller9 proposed a possibilistic c-means (PCM) that used a membership function, which was inversely proportional to the dissimilarity between observations and centers of cluster. Since they relaxed the constraints of the membership functions, the objective function had to be reformulated with a resolution parameter in order to avoid a trivial solution. Therefore, the interpretation of memberships in FCM as degrees of sharing was replaced by the possibilities of the observations belonging to the groups. The outliers could be detected through the sum of memberships from all clusters in PCM. However, they introduced the resolution parameter, which was estimated from the dissimilarities among clusters and the fuzzifier. In the fuzzy clustering category, the fuzzifier was an empirical value, and in most cases it was set to 2. Krishnapuram and Keller10 reported that the PCM clustering results were sensitive to the fuzzifier. A proper fuzzifier has to be determined before clustering the reference data using PCM for achieving a feasible solution. Since real plant data have a time-varying nature, process models have to take new observations into account. Adaptive FCM11,12 were applied to the anaerobic digestion process pattern recognition and fault detection. In their approaches, prototypes of FCM can be updated with new data. Since they coped with a two-dimensional data set, the collinearity of data was not an issue. Teppola et al.13 applied adaptive FCM to monitor a wastewater treatment plant with 20 variables. In order to eliminate collinear variables, PCA was used to compress reference data before fuzzy clustering. The scores were clustered instead of the original variables. In their approach, when the prototypes were updated with new observations, the consistency of new data with the reference data set was not validated. It is possible that the prototypes were updated with infeasible scores (i.e., new data did not belong to the PCA subspace). Adaptive FCM incorporating the TS fuzzy model also was applied to a biological wastewater treatment monitoring and modeling.14 The scores of the new observations were used to update prototypes of FCM according to Teppola et al.,13 whereas the consequent parts of the TS model were not updated with additional information. These approaches11-14 adapted known prototypes with new observations; however, the adaptive method for new operating conditions was not discussed. Bayesian classification has also been extensively applied to different areas of pattern recognition.15 The mixture model is a linear combination of prior probabilities and conditional probability density functions, and it is used for evaluating the unknown probabilities in Bayesian classification. Generally, the maximum likelihood (ML) method is used to estimate the parameters of the mixture model, the means and covariances of each density function, and the prior probabilities. Before the new observations are classified into the clustered groups, the conditional probabilities can be used to evaluate the goodness of the data belonging to the mixture. An outlier would have a low value for any conditional probability density functions.16 On the contrary, good data at least have high conditional probability. In this paper, historical data are used for building a piecewise linear virtual sensor model for inferring the final quality using process variables. Before building an inferential model, PCA is applied to remove collinearity within inputs and to filter out outliers from plant data. The compressed data are then clustered using Bayesian classification into antecedents of IF-THEN rules in a TS model to represent different operating regions. A

linear prediction model is trained as the consequent of each IFTHEN rule. In the online prediction phase, the process inputs should be within the PCA subspace and belong to at least one known operation region; otherwise, alarms are triggered to inform the operator about abnormalities. When the data of new events cannot be explained by PCA subspace, the subspace has to be reconstructed to cover the new events. The clusters and inferential models are updated to the new PCA subspace using the proposed method and a recursive least-square algorithm. This paper is organized as follows. Section 2 presents the approach of a nonlinear system split into a series of linear subsystems. First, the statistical data compressed tool, PCA, is briefly described. After that, the Bayesian classification method is applied to cluster the compressed inputs. Finally, a TS fuzzy model is presented for classifying the outputs into groups and building local linear subsystems. In order to accommodate the time-varying nature of processes, an adaptive TS model algorithm that updates the clusters and linear models to the new subspace is presented in section 3. Section 4 summarizes the proposed approach in which nonlinear modeling, online prediction, and post-updating model phases are discussed. In section 5, the advantages of the proposed method are assessed through a simulation example. In addition, the real plant data from a polyethylene process are applied to demonstrate the effectiveness of the proposed method. The conclusions are given in the last section. 2. Basic Theory 2.1. Principal Component Analysis. Consider the data matrix W ∈ Rm×n with m rows of observations and n columns of variables. Each column is normalized to zero mean and unit variance: X ) (W - 1W h )S-1 where W h is a mean vector, 1 is a column vector that elements are one, and S is a diagonal matrix of standard deviations. The eigenvectors (P) of the covariance matrix can be obtained from the normalized data set. The score vectors are the projection of the data matrix X to each eigenvector:

ti ) Xpi, i ) 1 ... n

(1)

The data matrix X can be decomposed as T + ... + tnpTn ) X ) t1pT1 + t2pT2 + ... + tKpTK + tK + 1pK+1 X ˆ + E (2)

with X ˆ being the projection of the data matrix X onto the subspace formed by the first K eigenvectors and E being remainder of X that is orthogonal to the subspace. The statistic Q is defined in order to examine if the new data can be explained by PCA subspace or not:

Q ) eeT ) (x - xˆ ) (x - xˆ )T ) x (I - PKPTK) xT

(3)

The loading vectors PK ∈Rm×K are the first K terms of eigenvectors of covariance matrix. Statistic Q is a measure of approximation error of new data with PCA subspace. The confidence limit of Q is defined as follows:17

[

]

cRx2Θ2h02 Θ2h0(h0 - 1) +1+ QR ) Θ 1 Θ1 Θ2 h0 ) 1 -

1

1/h0

n 2Θ1Θ3 , Θi ) λij, i ) 1, 2, 3 2 j)K+1 3Θ2

∑

(4)

790

Ind. Eng. Chem. Res., Vol. 46, No. 3, 2007

The percentile R is the probability of type I errors in hypothesis testing and cR is the standard normal deviate corresponding to the upper (1 - R) percentile. Another measure of the difference between new data and the PCA subspace is the statistic T2:

T2 ) xPKΛ-1PTKxT

(5)

The diagonal matrix Λ is first K terms of eigenvalues, where Λ ) diag [λ1, λ2, ..., λK]. The confidence limit is defined as

TR2 )

K(m - 1) F m - K K,m-1,R

(6)

p(xi,j) Pj

(7)

m

∑ ∑ P(j|xi;µ(t), Σ(t))ln(p(xi|j;µ(t), Σ(t))Pj) j)1 i)1

(9)

i)1

with XT ) [xT1 , xT2 , ..., xTm]. The log likelihood function can be defined as m

∏ p(xi,j;θ,Pj)) ) i∑ )1

ln(p(xi,j;θ,Pj))

(10)

Since class j cannot be observed directly, the expectation maximization algorithm maximizes the expectation of log likelihood function with respect to parameters θ and P, given the observed samples: m

∑

(14)

c

∑p(xi|k;µ(t), Σ(t))Pk(t)

M-Step: Compute the next estimated parameters by m

µj(t + 1) )

∑ P(j|xk;µ(t), Σ(t))xk k)1

, j ) 1, ... c (15)

m

∑ P(j|xk;µ(t), Σ(t))

m

P(j|xk;µ(t), Σ(t))(xk - µj(t))T(xk - µj(t)) ∑ k)1 m

∑ P(j|xk;µ(t), Σ(t))

k)1

Pj(t + 1) )

1

ln(p(xi|j;θ)Pj)] max Q(θ,P) ) max E[ θ,P θ,P i ) 1

(11)

(16)

m

∑ P(j|xi;µ(t), Σ(t))

m i)1

(17)

The solution of maximum log likelihood function is found by repeating E- and M-steps until every parameter has converged to within a tolerance criterion . Since the prior optimal numbers of clusters c are unknown, multiple runs of EM with various numbers of clusters are typically required. In this paper, the Bayesian information criterion18 (BIC) is used to evaluate the most suitable number of clusters:

BICc ) -2Q(θ*, P*) + νc ln(m)

m

i)1

p(xi|j;µ(t), Σ(t))Pj(t)

P(j|xi;µ(t), Σ(t)) )

Σj(t + 1) )

m

∏p(xi,j;θ,Pj)

(13)

k)1

Here P ) (P1, P2, ..., Pc) is the prior probability vector. If the parameters of conditional probability density functions and prior probabilities are known, the posteriori probabilities can be calculated from eq 8. Those parameters can be iterated from the Expectation-Maximization (EM) algorithm. Let x1, x2, ..., xm be random samples drawn from probability density function p (xi,j). Assuming statistical independence between the different samples, the joint probability density function p(X,j;θ,P) can be written as

L(θ,P) ≡ ln(

c

(8)

c

)

with µj and Σj being the mean vector and covariance matrix of class j. The parameters θ are given by µ ) [µ1, ..., µc] and Σ ) [Σ1, ..., Σc]. The above optimization problem can be solved by the following iterative algorithm: E-Step: Calculate the expectation of log likelihood function as follows:

p(xi|j;θ)Pj

∑ p(xi|k;θ)Pk k)1

p(X,j;θ,P) )

(

k)1

where p(xi,j) is the joint probability of the xi and the jth class, and θ is the parameter vector of conditional probability density function. The posterior probabilities can be gotten from Bayes rule:

P(j|xi;θ,P) )

1 1 exp - (xi - µj)Σ j-1(xi - µj)T 2 (2π)n/2|Σj|1/2 (12)

p(xi|j;µ,Σ) )

Q(µ(t), Σ(t), P(t)) )

where FK,m-1,R is an F distribution with degrees of freedom K and m - 1. The new data belong to the PCA subspace within (1 - R) confidence limits only when Q < QR and T2 < TR2. 2.2. Bayesian Classification. In this subsection, the Bayesian classification method15 is briefly described. The priori probabilities that c classes exist are Pj, j ) 1 ... c. Given the conditional probability density functions of xi in class j, p(xi|j;θ):

p(xi|j;θ) )

If a multivariate Gaussian distribution function is used as a conditional probability function:

(18)

where θ* and P* are the solutions of EM with c clusters, νc is the number of fitting parameters, and m is the number of observations. Therefore, adjusting the numbers of clusters for maximizing the expectation of log likelihood is equivalent to minimizing the BIC. 2.3. Fuzzy TS Modeling. Takagi and Sugeno2 proposed a fuzzy rule-based model, which decomposes a nonlinear system into c fuzzy rules:

Ri: If tj is Ai then yij ) fi(xj), i ) 1 ... c, j ) 1 ... m

(19)


where tj is the jth inputs, which are the score vectors from PCA; Ai and fi, respectively, are the antecedent part and the consequence inferential function of the ith rule (Ri); and yij is the outputs estimated from the ith rule and jth inputs. In all the rules, the structures of inferential functions are identical and the parameters are different to approximate a nonlinear system. The antecedent parts of TS model are generated from an FCM-like algorithm in most fuzzy modeling applications.19 However, the memberships of an observation across all classes must sum to 1. It forces new data classifying into groups, even though the outliers were observed. In this paper, the antecedents of the TS model are generated from Bayesian classification, which has posterior probabilities instead of memberships. The advantage is the outliers of new observations can be identified from the conditional probabilities of the new data. Such outliers should be grouped into unknown events for the next model updating instead of estimating outputs from the current model. Therefore, the antecedent part for each rule can be written as:

retained in the regression model formed with the training set:

RRMSEs ≡

[ ∑( )] 1

mtest

mtest

yi - yˆ i

i)1

yi

2 0.5

× 100%

(25)

The optimum number of PCs is the number of PCs that produces minimum RRMSEs of the test set. However, this method will determine more PCs than only using the data of inputs. 3. Update Inferential Model

where µi is the center of the ith cluster, and Ei is the respective regression error. The regression matrix Bi can be estimated from input/output data belonging to the ith cluster:

The PCA subspace has to be reconstructed in order to explain the data of new events. Liu20 proposed a method that updates the trained clusters to the new PCA subspace, where eventually only the data of new events need to be clustered on the new subspace. Since the updated clusters may be compatible with the new clusters, the similarities between clusters have to be measured. If the similarity degree is high enough, the compatible clusters need to be merged. After that, the posterior probabilities are recalculated from all of the clusters. Additionally, the TS models are also adapted for a time-varying process. The regression matrices of all rules not only are updated, but are also corrected using the recursive least-square algorithm with additional information. 3.1. Update Clusters to the New Subspace. Consider the addition of data of new events to the original data matrix. The number of observations is increasing from m to m*. The data matrix W*T ) [WT WTnew] is normalized as X* ) (W* 1W h *)S*-1. The mean vector and the standard deviation matrix are W h * and S*. The loading vectors P* and corresponding score vectors T* can be determined by assuming E* ≈ 0. The center (µj) and the covariance (Σj) of any class j in the original subspace can be transferred to the newer subspace:

Bi ) (CTi Ci)-1CTi Yi, Ci ≡ Ti - 1µi

h S*-1P* µj* ≈ µjA + ∆W

(26)

T TTi ) [tTj tj+1 ... ], tj ∈ {tj|p(tj|i) g pthr and P(i|tj) g Pthr}

Σj* ≈ ATΣjA

(27)

T YTi ) [yTj yj+1 ... ]

h ≡W h -W h* A ≡ PTS S*-1P*, ∆W

where yj is the corresponding outputs of tj. In the prediction phase, the outputs of the ith rule can be estimated from the score of new inputs and regression matrix, Bi:

The jth center (µj*) and covariance (Σj*) can be transferred to the new subspace through linear operations with coordinate rotation (A) and shifting (∆W h S*-1P*), without re-clustering the trained data on the new subspace. 3.2. Merging Compatible Clusters. Frigui and Krishnapuram21 proposed a measure of cluster similarity, which needs to be used in order to merge these clusters. The measure depends on the degree of sharing of each observation among all clusters. Let Gi ) {t/k ∈ T*|p(t/k |i) g pthr and P(i|t/k ) g Pthr} denote the set of all score vectors belonging to the ith cluster, where T* is the score vectors of all observations on the new subspace. The similarity measure between cluster i and cluster j can be defined as

Ai ) {tj|p(tj|i) g pthr and P(i|tj) g Pthr} for i ) 1 ... c (20) where pthr and Pthr are the predefined thresholds for conditional and posterior probabilities, respectively; the former determines the tolerance of outliers and the latter represents the shareability for each observation. In the modeling phase, the consequence inferential function for each rule is considered as a linear model, represented as

yij ) (tj - µi)Bi + Ei

Ri: If tnew is Ai then yi,pred ) (tnew - µi)Bi

(21)

(22)

(23)

where yi,pred is the inferences of outputs using the ith rule and new scores (tnew). The predictions can be obtained by Bayesian model averaging: c

ypred )

P(i|tnew)yi,pred ∑ i)1 c

(24)

∑ P(i|tnew) i)1

∑t ∈G ∪G|P(i|t*k) - P (j|t*k)| SIMij ) 1 ∑t ∈GP(i|t*k) + ∑t ∈GP(j|t*k) *

j

*

k

Since the dimensions of the PCA subspace are determined only by the inputs, the correlations between inputs and outputs are not taken into account. Therefore, the number of PCs is determined by cross-validation. The reference data are split into training and test sets. The relative root-mean-square errors (RRMSEs) of the test set is as a function of the number of PCs

i

k

*

i

k

(28)

j

When SIMij ) 1, cluster i and cluster j are identical. On the other hand, if SIMij ) 0, cluster i and cluster j are disjointed, i.e., Gi ∩ Gj ) Ø. A threshold value SIMthr is used to determine whether two clusters should be merged. When SIMij g SIMthr, cluster i and cluster j are merged into a new cluster with the

792


center and covariance as follows:

∑t ∈G ∪GP(i|t*k) t*k µi* ) ∑t ∈G ∪G P(i|t*k) *

k

i

j

(29)

*

k

Σi ) *

∑t

*

k∈Gi∪Gj

i

j

P(i|t*k) (t*j - µ*i)T(t*j - µ*i)

∑t

(30)

*

P(i|t*k) k∈Gi∪Gj

After the merging procedure, a number of EM iterations are performed for stabilizing the structure of the TS model. 3.3. Adaptive TS Model. There are two steps for adapting the TS model. First, the regression matrix Bi of rule i is updated to the new subspace. After that, the recursive least-square algorithm and the additional (m* - m) observations are used to correct the regression matrices. Classify the new quality data Ynew into each class; let Y/i T equal the quality data of the ith class (i.e., Y*Ti ) [YTi Yi,new ], i ) 1 ... c). Assuming that there are mi observations in the ith class before adding new data, the regression function is

Yi,(mi) ) (T*i,(m ) - 1µ*i) B*i,(m ) + Ei,(mi) i

(31)

i

where Ei,(mi) is the regression error. From the above equation, / ) can be the regression matrix in the new subspace (Bi,(m i) derived: T T C*i,(m ))-1 C*i,(m Yi,(mi), C*i,(m ) ≡ B*i,(m ) ) (C*i,(m i) i) i

i

i

T*i,(m ) - 1µ*i (32) i

/ where Ci,(m can be obtained from the previous subspace i) scores and eq 26:

C*i,(m ) ) (Ti,(mi) - 1µi)A

(33)

i

Therefore, the regression matrix in the new subspace before adding new information can be directly transferred from the previous subspace:

B*i,(m ) ) A-1Bi,(mi)

(34)

i

After updating the TS model of the old events data to the new subspace, the regression matrices need to be corrected using the additional data. Assuming that there are ∆mi observations added into class i, ∆mi ≡ m/i - mi. The matrix is corrected using the new data and recursive least-square algorithm:22

B*i,(m *) ) B*i,(m ) + D(Y(∆mi) - (T*(∆m ) -1µ*i) B*i,(m )) i

i

i

i

(35) T D ≡ (C*C*i,(m )-1 T*(∆m ) (I + i)i,(mi) i

T T C*i,(m ))-1 T*(∆m )-1 T*(∆m ) (C*i,(m i) i) i

i

/ / where Bi,(m and Bi,(m are the regression matrices, respeci*) i) / tively, with mi and mi observations.

4. Proposed Approach The proposed approach consists of three phases to model and predict a large-scale nonlinear system. In the modeling phase, a PCA subspace is constructed with historical data as a

dimension reduction tool in order to remove the highdimensional, collinear and cover-fitting problem of the Bayesian classification. The normal operating condition scores are clustered into several groups (i.e., a nonlinear system is split into a series of local linear subsystems). Any one of the clusters represents a quasi-steady-state operating condition, in which a linear model is built using input-output data belonging to the cluster. In the online prediction phase, new inputs are projected onto the PCA subspace to evaluate the statistic Q and T2. Since online data may not belong to the PCA subspace, if any one of the statistics exceeds its control limit, fault detection alarms should be triggered instead of predicting unreliable outputs. When collecting new event data, which may be producing new production grades, the PCA model should be reconstructed in order to cover new events. In the post-updating phase, only new event data are clustered on the new PCA subspace; the existing clusters and local linear models are directly transferred to the new subspace by the proposed method. 4.1. Modeling Phase. The scores of input variables belonging to normal operating conditions are clustered into c classes by EM iterations (i.e., eqs 13-17). The number of optimal clusters is determined by minimizing BICc in eq 18. An observation will be neglected during each iteration in order to release outliers, which cause unstable clusters, if all of the conditional probabilities are less than the predefined threshold. The threshold of conditional probability is an empirical value that depends on the data quality of the reference data set. The higher the conditional probability threshold, the more compact clusters can be obtained, whereas outliers and data of sparse production grades will be ignored. It is worth noting that output variables are not included in the EM iterations. Since their data are not available in the online prediction phase, they cannot be used to split a nonlinear system into several local subsystems. The different production grades may result in some clusters being partially overlapped and some being well separated. In order to build a robust and reasonable linear model, the inputoutput data are used with their posterior probabilities greater than a predefined threshold. If the posteriori probability threshold is lower; more input-output data are used to build a local model. In the limiting case, if the threshold is set to zero, a local model is built with almost all of the training data. It loses the purpose of splitting a nonlinear system into several linear subsystems. On the other hand, if the threshold is too high (i.e., approximating to one), it results in any one of the local models being too sensitive because only data near the cluster centers are used. It may cause the predictions around cluster boundaries to be infeasible. In this paper, the posterior probability threshold is set to 0.1 so that the data around cluster boundaries can be used for regression, whereas a nonlinear system can be split with fuzzy boundaries. When an observation belongs to several groups, the local predictions are averaged, using eq 24, into the predicted outputs. Figure 1 shows the flow diagram of the modeling phase. Since the retained numbers of PCs affect the capability of model prediction, it is necessary to split reference data into training and test sets and gradually increase the number of PCs to build the TS model from training data until the RRMSEs of test data are minimized. 4.2. Online Prediction. In real plant operations, new events are unaccounted for by the training data set. An appropriate prediction model should not only predict the final qualities but also inform the user about the reliability of the predictions. The statistics, Q and T2 from projecting new data onto the PCA subspace can be used for process online monitoring. When any one of the statistics is out of its control limit, the online


evaluate the overlap of updated and new clusters, the similarities among clusters need to be measured. If the similarity degree is high enough, compatible clusters are merged to reduce the total number of clusters. Once the number of clusters has changed, a number of EM iterations are needed to stabilize the parameters of the clusters. New output data are classified into groups according to the posterior probabilities of the corresponding inputs being greater than the posterior probability threshold. The regression matrices of updated models are corrected with additional information, as described in section 3.3. On the other hand, the new local models are generated with the input/output data of the new events. 5. Illustrative Examples 5.1. Simulation Example. The simulation example is from Bang et al.23 in which the proposed method demonstrates that not only the colinearity of input variables and the nonlinearity between input and output variables can be coped with but also adapted with the time-varying characteristic. The original example is as follows:

z ) 1, 2, 3, ..., 1000 Figure 1. Flow diagram of the proposed method for modeling phase.

x1 ) z + V1, where V1 ∼ N(0, 2.888 × 10) x2 ) z2 + V2, where V2 ∼ N(0, 2.986 × 104) x3 ) z0.5 + V3, where V3 ∼ N(0, 7.447 × 10-1) x4 ) z-0.5 + V4 where V4 ∼ N(0, 3.029 × 10-2) y ) 1- e-0.005z + V5 where V5 ∼ N(0, 5 × 10-2)

Figure 2. Flow diagram of the proposed method for online prediction phase.

observation is not consistent with the reference data, and alarms should be triggered instead of predicting unreliable outputs. Even if both of the statistics are within their control limits, the predictions are still not reliable when the input scores are outliers to existing clusters. Only when the observations are within the PCA subspace and can be classified into the known groups according to the posterior probabilities can the local outputs be estimated from the consequence parts of the TS model. A smoothed prediction is determined from these outputs with Bayesian model averaging. Figure 2 shows the flow diagram of the online prediction phase, in which points A and B respectively represent the data flow of new events and known groups. 4.3. Post-Updating Model. The flow diagram of the postupdate phase is shown in Figure 3. The PCA subspace has to be rebuilt to cover the data of new events. After that, only the new event data are clustered on the new subspace. The existing clusters on the original subspace can be updated to the new subspace utilizing the algorithm described in section 3.1. To

where x1 to x4 are the input variables, y is the output variable, z is a latent variable, and N(µ,σ) is a normal distribution with mean µ and standard deviation σ. In the original example, the latent variable was from a uniform distribution. It was modified in this example in order to represent the different operating conditions. The latent variable was generated from five normal distributions. The means and standard deviations are listed in Table 1. There were 200 training data and 20 test data generated from each normal distribution. In Table 1, the first training data set is denoted as training 1; the first test data set is denoted as test 1 in order to evaluate the model prediction capabilities. In addition, the data of test 2 and test 3 were correspondingly used to evaluate the capabilities of interpolation and extrapolation of models. The data labeled training 2 were generated to represent the new operating conditions. Figure 4 plots the output variable y versus the input variable x1 and x3 in which black and white symbols respectively represent the data of training 1 and training 2. The training 1 data set was applied to fuzzy partial least-squares (FPLS23). The inner regression of the first scores is shown in Figure 5a. It shows that the FPLS effectively interpolated the data generated from N(500, 25), which was one of the training 2 data sets; however, it failed to extrapolate the other training 2 data set from N(100, 25). It is reasonable that any data driven model cannot explain the regions out of the reference data set, no matter if it is a linear or nonlinear technique. The data of training 1 and training 2 were combined into a reference data set that FPLS was applied to. The inner regression of the first scores is shown in Figure 5b in which a latent variable was capable of explaining nonlinearities of the input-output data. The drawback is that all of the data had been

794


Figure 3. Flow diagrams of the proposed method for post-updating phase. Table 1. Latent Variable z Is Generated from Five Normal Distributions training data

test data

normal distribution

training 1

test 1

training 2

test 2 test 3

n(300, 25) n(700, 25) n(900, 25) n(500, 25) n(100, 25)

used to rebuild the FPLS model. It is inevitable in general for a global nonlinear modeling method. Comparisons of the prediction performance of PLS, FPLS and the proposed method are shown in Figure 6a-c, in which the black symbols represent the test data from test 1, test 2, and test 3 data sets and the white symbols are the predictions of the model built from the training 1 data set. It shows the prediction performance of three methods was comparable for the test 1 and test 2 data. However, all of the methods failed to predict the test 3 data because the data were not within the regions defined by the reference data set. Compared with the other methods, the proposed method used the statistic T2 and Q to identify the regions of inputs before predicting the outputs. As Figure 7 shows, the statistic T2 and Q of test 3 were out of the control limits. Therefore, the alarm of inputs was triggered instead of predicting infeasible outputs. In the post-updating model phase, the training 2 data were included. The recursive PLS (RPLS24) and the proposed method did not need to recalculate the trained data, in contrast to FPLS where the model had to be reconstructed using training 1 and 2 data sets. The prediction results of the three methods are respectively shown in Figure 6a-c with gray symbols. The detailed prediction performance and captured variances of the reference data sets are listed in Table 2. Since FPLS is a nonlinear technique, the variations of reference data can be effectively captured by a nonlinear latent variable. In contrast,

the proposed method and PLS need two LVs in order to capture the nonlinear variations of the reference data. Compared with the prediction performance of the models from training 1 data set, the proposed method is better than PLS and comparable to FPLS since the proposed method used three sets of local linear models to approach the different regions, in contrast with PLS which only used a global linear model. When updating models with the training 2 data set, RPLS still had one set of linear regression parameters, and the prediction performance was worse than the other methods. Comparisons of prediction performances from the proposed adaptive model and the reconstructed FPLS are listed in Table 2. The proposed method is better than the reconstructed FPLS. Since it consists of five local linear models, it is more flexible than the original approach with three fuzzy rules.

Figure 4. Simulation data plots of the output variable y vs the input variables x1 and x3.


Figure 5. First score plots of FPLS: (a) using the data of training set 1; (b) using all of data of two training data sets.

5.2. Industrial Application. Important product quality variables of polyethylene (PE) are density and melt index (MI). Kiparissides et al.25 stated density of PE is inversely proportionate to the reaction conditions, and MI is directly proportionate to exponents of operating conditions. Therefore, the variation of MI is more significant than density. This is consistent with real plant experience that maintaining the MI specification is more difficult than density. Stabilizing the variation of MI has to rely on online analyzers, which are expensive and unreliable, and therefore not popular in the plant. In the absence of online analyzers, laboratory measurements of MI are performed every 2-4 h, making regulation of operating conditions more difficult. Therefore, it is desirable to develop a virtual sensor model that uses the process variables to predict MI for quality control. Such a virtual sensor model must be able to describe the nonlinear relation between process variable inputs and MI, be applicable to different grades of PE, and be easily updated to accommodate the time varying nature of the plant. In this paper, the polyethylene plant is located in Kaohsiung, Taiwan. In Figure 8, the process flow diagram is shown. A detailed description can be found in previous work.20 In this paper, 14 inputs of the process model, which are listed in Table 3, were chosen according to the plant expert’s advice. The process final quality was laboratory MI that was measured every 4 h. The data of 14 inputs and 1 output were collected in 1

Figure 6. Comparisons of the prediction performance with different approaches: (a) PLS and RPLS, (b) FPLS and reconstructed FPLS, (c) proposed method and adaptive model.

month for the training data set. There were 150 observations in the data set, including three different production grades, labeled as 1, 2, and 3. Figure 9 shows these measured values, which are reactor temperatures, pressure, and laboratory MI in arbitrary units for confidentiality. The reactor bottom temperature and pressure were regulated at different set points in grade 1 as shown in Figure 9, panels a and b, respectively. Since different types of modifiers were used at that time, operators adjusted the operating conditions in order to maintain the MI within the specifications.

796


Figure 7. Statistic Q and T2 of test data.

Figure 8. Process flow diagram of polyethylene autoclave process. Table 2. Comparisons of Captured Variances of Reference Data and Prediction Performance with the Different Approaches captured variances (%) PLS RPLS FPLS reconstructed FPLS proposed method adaptive model

Figure 9. Plots of process variables and final quality in the training data set: (a) reactor temperatures and measured MI; (b) reactor pressure and measured MI.

RRMSEs (%)

LVs

X

y

test 1

test 2

test 3

2 2 1 1 2 2

98.6 91.0 77.0 81.4 98.8 97.8

73.4 85.6 77.6 91.6 78.7 94.9

6.49 8.61 5.65 6.34 5.63 5.4

7.93 7.12 4.95 4.35 4.13 4.26

56.6 16.5 67.6 15.9 38.6 11.5

Table 3. Variable Descriptions in the Polyethylene Process abbreviation

description

FI-001 FIC-001 FIC-002 FI-002 TIC-001 TIC-002 TIC-003 TIC-004 PIC-001 FI-003 FI-004 FI-005 FIC-003 FIC-004

high-purity ethylene flow rate modifier (M1) flow rate modifier (M2) flow rate reactor feeding flow rate reactor feeding temperature reactor top temperature reactor middle temperature reactor bottom temperature reactor pressure initiator flow rate for regulating TIC-002 initiator flow rate for regulating TIC-003 initiator flow rate for regulating TIC-004 low-pressure recycle flow rate purge flow rate

The cross-validation procedure in section 2.3 was used to determine the optimal number of principal components (PCs). There were 145 observations that could be explained by the PCA subspace, in which the captured variances were 91.68% with 4 PCs within 99% confidence limits. The clustering results

Figure 10. Clustering results on the t1-t2 plane of the PCA subspace.

of the projection of the first two score vectors are shown in Figure 10. The solid line for each group represents the conditional probabilities equal 10-4, which was the conditional probability threshold set for outliers in this study. The data of grades 1 and 3 were clustered into groups C2 and C5 and groups C1 and C4, respectively. For grade 3, as shown in Figure 9b, the reactor pressure was regulated at different set points during the production. The regression matrix for each linear model was


Figure 11. Comparison of the regression results with the proposed method and PLS.

Figure 13. Classification of data within the PCA subspace from the first data set.

Figure 12. Statistic Q and T2 for the first test data set.

generated according to the data from the group where the posterior probabilities of observations were greater than 0.1. The RRMSEs were 9.16% for the training data set using the proposed method. The comparisons of regression results of this method and partial least-squares (PLS) are shown in Figure 11. It is obvious that only one set of parameters in PLS cannot properly explain outputs with three different grades of products, in which the RRMSEs are 31.81%. An additional 153 observations in 1 month following the period of the above training data were collected for the first test data set. The statistic Q and T2 of the new data are shown in Figure 12. Only in the case of the Q < QR and T2 < T2R do the new data belong to the PCA subspace within (1 - R) confidence limits. In this data set, there were only 88 observations that could be properly explained by the PCA subspace. The reason is that a new grade had emerged after the observation number 113. The data belonging to the PCA subspace were classified into three groups, as shown in Figure 13, in which conditional probabilities were greater than the outlier threshold. It is worth noting that some of the observations of grade 1 were shared by C2 and C5. The Bayesian model averaging provided a smooth switching between the two operation regions. The RRMSEs of predictions within the PCA subspace were 12.84% for the proposed method and 53.12% for PLS, as shown in Figure 14. The new PCA subspace was generated using the training and the first data sets in order to explain the data of all events. The captured variances were 90.53% with 4 PCs, and

Figure 14. Comparison of the prediction results for the data within the PCA subspace using the proposed method and PLS. Table 4. Similarity Measurements between Classes

C1 C2 C3 C4 C5 C6 C7 C8

C1

C2

C3

C4

C5

C6

C7

C8

1.00

0.00 1.00

0.03 0.00 1.00

0.01 0.01 0.00 1.00

0.00 0.14 0.01 0.01 1.00

0.00 0.07 0.00 0.01 0.20 1.00

0.00 0.00 0.00 0.00 0.00 0.00 1.00

0.03 0.00 0.13 0.00 0.00 0.00 0.00 1.00

there were 290 observations within the new subspace. The cluster parameters on the previous subspace were updated to the new subspace; only data of new events were clustered on the new subspace. Figure 15a displays the first two scores, which can be explained by the previous subspace, and the respective updated clusters on the new subspace. The clustering results of new events are shown in Figure 15b. The new groups are labeled as C6, C7, and C8, which are represented as solid lines. The similarity measurements between new and updated clusters are listed in Table 4. It shows the similarities of C5 and C6 and of C3 and C8 are higher than the other pairs, which can also be observed in Figure 15b, but they are not exceeding the threshold value for merging clusters, which was set to 0.9 in this study. Therefore, it is not necessary to merge clusters in this case. The regression matrices of the previous local models were updated

798


Figure 17. Statistic Q and T2 for the second test data set.

Figure 15. Clustering results on the reconstructed PCA subspace: (a) updating cluster parameters from the previous subspace; (b) clustering data of new events on the new PCA subspace. The new clusters are represented as solid lines.

Figure 16. Comparison of the regression results with the proposed method and RPLS.

to the new one and afterward were corrected using the recursive least-square algorithm with the additional information. The regression matrixes of the new clusters were respectively generated according to the input-output data that were classified into the clusters with posterior probabilities of observations greater than 0.1, which was the threshold for data sharing in this study. Figure 16 shows the comparisons of regression results

Figure 18. Projection of first two scores for classifying data within the PCA subspace from the second test data set.

with the proposed method and recursive PLS algorithm. The RRMSEs respectively were 10.07% and 44.10%. The RPLS had reduced the RRMSEs from 53.12% to 44.10%, showing that the model was adapted with the new information, but it still could not tackle the input-output nonlinearity. A second test data set was collected following the first one, over a period of 2 months. There were 291 observations in the data set, of which 283 observations could be explained by the PCA subspace as shown in Figure 17. These data within the PCA subspace can be classified into the known clusters, in which the conditional probabilities are greater than 10-4, as shown in Figure 18. It shows that groups C2, C5, and C6 shared the new data of grade 1. Comparisons of prediction capabilities with the proposed method and RPLS are shown in Figure 19. The RRMSEs correspondingly were 9.51% and 33.05%. It is obvious that only one set of linear parameters cannot effectively deal with a nonlinear system with multiple operating conditions. 6. Conclusions An adaptive TS model on the PCA subspace for a largescale nonlinear system is proposed. The collinearity of process inputs is eliminated through the PCA; meanwhile outliers of input data can be filtered out. The nonlinear system is split into a collection of local linear subsystems that have been clustered from the compressed data. Due to the time-varying nature of


Figure 19. Comparison of the prediction results for data for the second test data set.

real plants, the PCA subspace has to be reconstructed in order to cover the data of new events. The Bayesian classification and TS model are updated to the new subspace; and the consequence functions are corrected with additional information. In the online prediction phase, the process inputs should be examined to find whether the observations belong to the operating regions from the reference data set. If not so, alarms should be triggered instead of misleading operators with unreliable predictions. The proposed method is applied to MI prediction of a polyethylene plant. Results show that nonlinearity and time-varying characteristics of the polyethylene process plant can be dealt with effectively. Acknowledgment This work was supported by the National Science Council, Republic of China, under Grant NSC-94-2213-E-268-001. Note Added after ASAP Publication. The version of this paper that was published on the Web 12/16/2006 had a minor typographical error in section 2.3 and minor errors involving parameter B in eq 34. The corrected version of this paper was reposted 12/21/2006. Nomenclature A ) linear operator rotates the coordinate of subspace Ai ) antecedent part for the ith rule in the TS model Bi ) regression matrix for the ith rule in the TS model B/i ) regression matrix for the ith rule in the TS model on the new subspace c ) number of classes Gi ) set of all score vectors belonging to the ith cluster E ) residual part of data matrix fi ) consequence inferential function for the ith rule in the TS model L(θ)) log likelihood function K ) number of principal components mi ) number of observations in the ith class mi* ) number of observations in the ith class including data of new events ∆mi ) number of new event observations in the ith class n ) number of variables P ) loading matrix (in section 2.1) P* ) loading matrix in the new PCA subspace P ) prior probability matrix (in section 2.2) P(j|xi) ) posteriori probabilities of xi in the class j

Pj ) prior probabilities p(xi|j;θ) ) conditional probability density functions of xi in the condition of class j and the parameters θ p(xi,j) ) joint probability density functions of data xi and class j QR ) confidence limit of the statistic Q corresponding to the (1 - R) percentile Q(θ,P) ) expectation of log likelihood function with respect to parameters θ and P Ri ) ith fuzzy rule S ) diagonal matrix of standard deviations S* ) diagonal matrix of standard deviations with new event data SIMij ) similarity measure between cluster i and cluster j T ) score matrix T* ) score matrix in the new PCA subspace T2R ) confidence limit of the statistic T2 corresponding to the (1 - R) percentile ti ) ith score vector W ) process input data matrix W* ) process input data matrix with new event data W h ) mean vector of process data W h * ) mean vector of process data with new event data X ) normalized data matrix X* ) normalized data matrix with new event data X ˆ ) systemic part of data matrix x ) normalized data vector Y ) quality data matrix yij ) outputs estimated from the ith rule and jth inputs Greek Letters Λ ) diagonal matrix of eigenvalues Σ ) covariance matrix of normalized process data (in section 2.1) Σi ) covariance matrix of Gaussian distribution of class i (in section 2.2) σi ) standard deviation of variable i σi* ) standard deviation of variable i with new events θ ) parameters of conditional probability density function µi ) mean vector of Gaussian distribution of class i µi* ) mean vector of Gaussian distribution of class i on the new subspace νc ) number of estimated parameter of EM Literature Cited (1) Russell, E.; Chiang, L. H.; Braatz, R. D. Data-DriVen Methods for Fault Detection and Diagnosis in Chemical Processes; Springer-Verlag: London. 2000. (2) Takagi, T.; Sugeno M. Fuzzy identification of systems and its application to modeling and control. IEEE Trans. Syst., Man Cybernetics 1985, 15, 116-132. (3) Zadeh, L. A. Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans. Syst., Man Cybernetics 1973, 3, 28-44. (4) Wang, L. Universal approximation by hierarchical fuzzy systems. Fuzzy Sets Syst. 1998, 93, 223-230. (5) Raju, G. V. S.; Zhou, J.; Kisner, R. A. Hierarchical fuzzy control. J. Int. Control 1991, 54, 1201-1216. (6) Ravi, V.; Reddy, P. J.; Zimmermann, H. J. Pattern classification with principal component analysis and fuzzy rule bases. J. Eur. Operational Res. 2000, 126, 526-533. (7) Bezdek, J. Pattern Recognition with Fuzzy Object Function; Plenum Press: New York, 1981. (8) Gustafson, D.; Kessel, W. Fuzzy clustering with a fuzzy covariance matrix. In Proceedings of the IEEE Conference on Decision and Control, San Diego, CA, 1979; pp 761-766. (9) Krishnapuram, R.; Keller, J. M. A possiblistic approach to clustering. IEEE Trans. Fuzzy Syst. 1993, 1, 98-110.

800


(10) Krishnapuram, R.; Keller, J. M. The possibilistic c-means algorithm: insights and recommendations. IEEE Trans. Fuzzy Syst. 1996, 4, 385-393. (11) Marsili-Libelli, S.; Mu¨ller A. Adaptive fuzzy pattern recognition in the anaerobic digestion process. Pattern Recognit. Lett. 1996, 17, 651659. (12) Marsili-Libelli, S. Adaptive fuzzy monitoring and fault detection. J. Int. COMADEM 1998, 1, 31-37. (13) Teppola, P.; Mujunen S.; Minkkinen P. Adaptive fuzzy c-means clustering in process monitoring. Chemom. Intell. Syst. 1999, 45, 23-38. (14) Yoo, C. K.; Vanrolleghem, P. A.; Lee, I. Nonlinear modeling and adaptive monitoring with fuzzy and multivariate statistical methods in biological wastewater treatment plants. J. Biotechnol. 2003, 105, 135-163. (15) Theodoridis, S.; Koutroumbas, K. Pattern Recognition; Academic Press: San Diego, 1999. (16) Medasani, S.; Krishnapuram, R. Categorization of image databases for efficient retrieval using robust mixture decomposition. Computer Vision Image Understanding 2001, 83, 216-235. (17) Jackson, J. E. A User’s Guide To Principal Components; Wiley: New York, 1991. (18) Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461-464.

(19) Babusˇkia, R. Fuzzy Modeling for Control; Kluwer Academic Publishers: New York, 1998. (20) Liu, J. Process monitoring using Bayesian classification on PCA subspace, Ind. Eng. Chem. Res. 2004, 43, 7815-7825. (21) Frigui, H.; Krishnapuram, R. A robust algorithm for automatic extraction of an unknown number of clusters from noisy data. Pattern Recognit. Lett. 1996, 17, 1223-1232. (22) Ljung, L. System Identification Theory for the User; Prentice Hall: Englewood Cliffs, NJ, 1999. (23) Bang, Y. H.; Yoo, C. K.; Lee, I. Nonlinear PLS modeling with fuzzy inference system. Chemom. Intell. Syst. 2003, 64, 137-155. (24) Qin, S. J. Recursive PLS algorithms for adaptive data modeling. Comput. Chem. Eng. 1998, 22, 503-514. (25) Kiparissides, C.; Verros, G.; MacGregor, J. F. Mathematical modeling, optimization, and quality control of high-pressure polymerization reactors. J. M. S-ReV. Macromol. Chem. Phys. 1993, C33 (4), 437-527.

ReceiVed for reView October 16, 2005 ReVised manuscript receiVed September 7, 2006 Accepted November 13, 2006 IE0511534

Modeling a Large-Scale Nonlinear System Using ... - ACS Publications

Recommend Documents