Process Monitoring Using Bayesian Classification on PCA Subspace

Oct 20, 2004 - with the deterministic annealing expectation maximization (DAEM) and ... ensures convergence within a few expectation maximization step...
0 downloads 0 Views 421KB Size
Ind. Eng. Chem. Res. 2004, 43, 7815-7825

7815

Process Monitoring Using Bayesian Classification on PCA Subspace Jialin Liu* Department of Information Management, Fortune Institute of Technology, 125-8, Chyi-Wen Rd., Chyi-Shan 842 Kaohsiung Country, Taiwan, Republic of China

A new approach to fault detection and isolation that combines principal component analysis (PCA) and Bayesian classification is proposed. For a given set of training data, a PCA subspace is constructed, and the score vectors are classified by solving the Bayesian classification problem with the deterministic annealing expectation maximization (DAEM) and robust mixture decomposition (RMD) algorithms. If data for new events do not belong to the existing subspace, the PCA subspace must be rebuilt to include the new events. Whereas new data form their own classes in the new subspace, existing classifications of the old data must also be updated in the newer subspace. This can be done using a simple translation and rotation if the dimensionality of the new subspace is the same as that of the old subspace. If the new subspace is of higher dimension, a good initial estimate is generated using translation and rotation. This estimate ensures convergence within a few expectation maximization steps. The proposed methodology is applied to the monitoring of a high-pressure polyethylene process with multiple operating modes. Results show that the fault detection capability can be expanded incrementally as operating data accumulate. 1. Introduction Currently, most complex processes are equipped with distributed control systems (DCSs) and supervisory control and data acquisition (SCADA) systems. Massive amounts of process data become accessible in a realtime manner. However, useful information must be mined from such data to improve the performance of the operation. For example, clustering the historical data into different operating states associated with normal and abnormal situations, followed by on-line classification of process data into the predefined states, will help operators monitor the current status of the plant. In chemical processes, there are a large number of measured and controlled variables that are highly correlated. In addition, measured noise is inevitable in real plant data. Retaining the most important variables and removing the irrelevant variables through dimension reduction methods can effectively reduce the complexity of clustering. Principal component analysis (PCA) is a popular method of compressing highdimensional inputs into a few orthonormal variables. Past researchers1,2 have constructed PCA subspaces from “healthy” plant data for the purpose of fault detection. Ku et al.3 developed the dynamic PCA (DPCA) method for extracting the dynamic correlations between variables in linear systems. Multiway PCA4 rearranges a three-way data matrix into a two-way matrix so that PCA can be applied for batch process monitoring. Functional PCA5,6 projects the trajectories of batches into functional space before the application of PCA to extract the correlation of trajectories between batches. Dong and McAvoy7 combined the principal curve algorithm8 and neural networking to propose a nonlinear PCA (NLPCA) approach for extracting the nonlinear correlations between variables. Because most processes are time-varying, Li et al.9 proposed a recursive PCA * Tel.: 886-7-6618851 ext. 3221. Fax: 886-7-6618850. E-mail: [email protected].

(RPCA) algorithm for adaptive process monitoring. The PCA model is updated in real time by the new normal data to capture normal process changes and suppress false alarms. Research on using PCA for fault detection has been reviewed comprehensively by Venkatasubramanian et al.10 To simplify the classification of high-dimensional data, the score vectors of the PCA subspace are classified instead of the raw data. Teppola et al.11 used the fuzzy c-means (FCM) algorithm with score vectors for monitoring a wastewater treatment plant under different operating conditions. Choi et al.12 applied the credibilistic fuzzy c-means (CFCM) algorithm for continuous monitoring of the emissions of a power plant. Leung and Romagnoli13 combined PCA and a possible cause-and-effect graph (PCEG) to develop a fault diagnosis method. Hwang and Han14 applied the hierarchical clustering algorithm to monitor the different operating modes in the blast process of a steel plant. Akbaryan and Bishnoi15 extracted the features of score vectors through a wavelet-based methodology and then applied a binary decision tree to classify these features for the purpose of fault diagnosis. For all of the scorebased methods of classification, the on-line data must be explained by the PCA subspace. If the score vectors do not reasonably represent the on-line data, then use of the existing classifications results in misclassification and failure of fault isolation. Chen et al.16 addressed this problem by combining ART with PCA. A PCA subspace is created for each cluster. A new data point is checked to determine whether it belongs to the PCA subspace of an existing cluster. If it does not, the data point is classified as belonging to a new cluster, and the PCA subspace of this new cluster is created. Such an approach fails to represent all of the operating states in a single subspace. The possibility of modifying old classifications with new data is neglected. Bayesian classification is extensively applied to different areas of pattern recognition, e.g., bioinformatics and medical information,17-19 image analysis,20 near-

10.1021/ie0498495 CCC: $27.50 © 2004 American Chemical Society Published on Web 10/20/2004

7816

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004

infrared (NIR) spectroscopy,21 and so on. In process monitoring applications, Wang and McGreavy22 used the Bayesian automatic classification methodology developed by NASA to cluster the operating data into classes corresponding to different operating modes for a refinery fluid catalytic cracking process. Chen and Liu23 developed a mixture PCA algorithm by applying heuristic smoothing clustering to determine automatically the numbers of clusters and constructed a PCA subspace in each cluster for fault detection. The mixture model is a linear combination of a priori probabilities and conditional probability density functions. It is used for evaluating the unknown probabilities in Bayesian classification. Generally, the maximum likelihood (ML) method is used to estimate the parameters of the mixture model, the means and covariances of each density function, and the a priori probabilities. The solution of the ML problem, usually known as the expectation maximization (EM) procedure, suffers three well-known difficulties.24 First, there are many local maxima that cause inconsistent solutions from different initial conditions for local-maximum-seeking algorithms. Second, the numbers of clusters have to be predefined. Last, outliers are inevitable in real plant data. These outliers will affect the parameters of Gaussian density functions and cause instability in the convergence of the parameter optimization. To overcome the local maximum problem, Ueda and Nakano25 proposed the deterministic annealing expectation maximization algorithm (DAEM) by minimizing the thermodynamic free energy instead of maximizing the likelihood function. They introduced a “temperature” parameter and an information entropy term in the cost function to smooth out the local maxima in the maximum likelihood function. This temperature is gradually reduced to achieve a global minimum. Biernacki et al.26 chose the best set of initial conditions to obtain maximum likelihood parameters by running a few steps from multiple initial conditions to avoid being trapped in suboptimal solutions. Ding et al.27 proposed an adaptive dimension reduction EM (ADREM) algorithm. They projected the dataset to lower dimension through PCA or any dimension reduction method and ran EM on the subspace. The solutions of the EM were expanded back to original variable space and the possible centers for each cluster were computed. These centers were then used to build a new subspace through a Gram-Schmidt orthonormalization process in which the EM is solved again. The method of determining the number of clusters typically requires multiple runs of EM with various numbers of clusters. An information index for each run, such as the Akaike’s information criterion28 (AIC), the Bayesian information criterion29 (BIC), or the normalized entropy criterion,30 is evaluated to determine the most suitable number of clusters. To alleviate the influence of outliers, Medasani and Krishnapuram31 proposed the robust mixture decomposition (RMD) algorithm. In this approach, the data are ordered according to probabilities of the mixture model, and only those with high enough probabilities are retained for EM iterations. Shoham32 used multivariate t-distributions instead of Gaussian distributions for mixture model, as t-distributions with wider tails can enhance the robustness. In this paper, a method for process monitoring using Bayesian classification on PCA subspaces is proposed.

Figure 1. Concept of fault detection using PCA.

If the data for new events do not belong to the current subspace, the subspace must be reconstructed. An algorithm based on the DAEM and RMD algorithms and the Bayesian information criterion is used to estimate parameters of the mixture model on the subspace. This paper is organized as follows: The basic theory of PCA and Bayesian classification related algorithms are presented in section 2. Section 3 describes the method of PCA-based Bayesian classification for process monitoring. In addition, an algorithm for updating the mixture model is proposed. Two illustrating examples are presented in section 4. One is a simplified mathematic example that demonstrates the method of updating parameters in the subspace. The other is an application of the proposed monitoring method to real plant data from a high-pressure polyethylene process. Finally, the conclusions are given in section 5. 2. Basic Theory 2.1. Principal Component Analysis. The concept of PCA is to represent data using fewer variables consisting of linear combinations of the original variables to perform a subspace projection. For example, in Figure 1, there are three original variables. However, because the data are spread on the plane formed by the two vectors PC1 and PC2, only two variables are sufficient to represent the data. Consider the data matrix Y ∈ Rm×n with m rows of observations and n columns of variables. Each column is normalized to zero mean and unit variance

X ) (Y - Y h )S-1

(1)

where Y h is a mean vector and S is a diagonal matrix of standard deviation, S ) diag[σ1 σ2 ‚‚‚ σn]. The eigenvectors of the covariance matrix (Σ) are defined as follows

Σ)

1 XTX m-1

Σpi ) λipi, i ) 1, ..., n

(2)

P ) [p1 p2 ‚‚‚ pn ] where λi represents the eigenvalues associated with eigenvector pi, arranged in descending order. The score vectors are the projection of the data matrix X onto each

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004 7817

eigenvector

ti ) Xpi, i ) 1, ..., n

(3)

The data matrix X can be decomposed as

X ) t1p1T + t2p2T + ‚‚‚ + tKpKT + tK+1pK+1T + ‚‚‚ + tnpnT )X ˆ +E

(4)

with X ˆ being the projection of the data matrix X onto the subspace formed by the first K eigenvectors and E being remainder of X that is orthogonal to this subspace. If the first K terms of components can explain the data matrix properly, i.e., X ≈ X ˆ , then the data matrix is said to have K principal components. X ˆ is known as the systematic part of X, and E is called the residual part. When PCA is applied to process monitoring, a PCA subspace of data for the plant under healthy conditions is built, and a proper number of principal components to represent normal operating conditions is determined, i.e., E ≈ 0 in eq 4. In the on-line monitoring stage, the statistic Q of a set of new data x is defined as

its score is within the normal operating region. Therefore, whether a new data point can be explained by the PCA subspace should be verified before a classification score on the PCA subspace is used to determine its operating status. 2.2. Bayesian Classification. The prior probabilities that c classes exist are denoted by Pj, j ) 1, ..., c. Given the conditional probability density functions of xi in class j, p(xi|j;θ), the joint probability of all xi belonging to class j is

p(xi,j;θ,Pj) ) p(xi|j;θ)Pj

(9)

where θ is the parameter vector of the conditional probability density function. According to the total probability theorem, the probability density function of c p(xi|k;θ)Pk. The a posteriori probxi is p(xi;θ) ) ∑k)1 abilities that the classes j ) 1, ..., c, exist, given the observation data x, can be obtained from Bayes rule as

P(j|xi;θ,P) )

p(xi|j;θ)Pj

p(xi|j;θ)Pj

)

(10)

c

p(xi;θ)

∑ p(xi|k;θ)Pk

k)1

Q ) eeT ) (x - xˆ )(x - xˆ )T ) x(I - PKPKT)xT (5) The loading vectors PK ∈ Rm×K are the first K terms of eigenvectors of the covariance matrix. Statistic Q is a measure of the approximation error of the new data within the PCA subspace. The confidence limit of Q is defined as follows33

[

]

cRx2Θ2h02 Θ2h0(h0 - 1) +1+ QR ) Θ1 Θ1 Θ2 1

1/h0

Here, P ) (P1, P2, ..., Pc) is the a priori probability vector. If the parameters of the conditional probability density functions and a priori probabilities are known, a posteriori probabilities can be calculated from eq 10. 2.2.1. Expectation Maximization Algorithm. Let x1, x2, ..., xm, be random samples drawn from the probability density function p(xi,j). The joint probability density function p(X,j;θ,P) can be written as

(6)

m

p(X,j;θ,P) )

p(xi,j;θ,Pj) ∏ i)1

(11)

n

Θi )

λji, ∑ j)K+1

h0 ) 1 -

with XT ) [x1T x2T ‚‚‚ xmT]. The log-likelihood function can be defined as

i ) 1-3

m

2Θ1Θ3

L(θ,P) ≡ ln[

3Θ22

The percentile R is the probability of a type I error in hypothesis testing, and cR is the integral of the normal probability density function from R to ∞. Another measure of the difference between new data and the PCA subspace is the statistic T 2 2

-1

T T

T ) xPKΛ PK x

(7)

The diagonal matrix Λ is the first K terms of eigenvalues, Λ ) diag[λ1 λ2 ‚‚‚ λK]. T 2 represents the distance between the location of the new data projected onto the subspace and the origin of subspace. The T 2 confidence limit is defined as

∏ i)1

m

p(xi,j;θ,Pj)] )

ln[p(xi,j;θ,Pj)] ∑ i)1

(12)

Because class j cannot be observed directly, the expectation maximization algorithm maximizes the expectation of log-likelihood function with respect to parameters θ and P, given the observed samples m

maxQ(θ,P) ) maxE( θ,P

θ,P

ln[p(xi|j;θ)Pj]) ∑ i)1

(13)

If a multivariate Gaussian distribution function is used, then

(8)

p(xi|j;µ,Σ) ) 1 1 exp - (xi - µj)Σj-1(xi - µj)T (14) 2 (2π)n/2|Σj|1/2

where FK,m-1,R is an F distribution with degrees of freedom K and m - 1. As illustrated in Figure 1, the Q value of point a ) (X1, Y1, Z1) and the T 2 value of point b ) (X2, Y2, Z2) will exceed the confidence limits QR and TR2, respectively. It should be noted that point a ) (X1, Y1, Z1) cannot be explained by this PCA subspace, but

with µj and Σj being the mean vector and covariance matrix, respectively, of class j. The parameters θ are given by µ ) [µ1, ‚‚‚, µc] and Σ ) [Σ1, ‚‚‚, Σc]. The above optimization problem can be solved by the following iterative algorithm: E Step. Calculate the expectation of the log-likelihood function as follows

TR2 )

K(m - 1) F m - K K,m-1,R

[

]

7818

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004

Q(µ(t),Σ(t),P(t)) ) c

m

∑ ∑P(j|xi;µ(t),Σ(t)) ln[p(xi|j;µ(t),Σ(t))Pj] j)1 i)1 p(xi|j;µ(t),Σ(t))Pj(t)

P(j|xi;µ(t),Σ(t)) )

(15)

(16)

c

∑ p(xi|k;µ(t),Σ(t))Pk(t)

k)1

M Step. Compute the next estimated parameters by m

∑ P(j|xk;µ(t),Σ(t))xk

µj(t + 1) )

k)1

, j ) 1, ..., c

m

∑ P(j|xk;µ(t),Σ(t))

k)1

(17)

m

Σj(t + 1) )

P(j|xk;µ(t),Σ(t))[xk - µj(t)]T[xk - µj(t)] ∑ k)1 m

∑ P(j|xk;µ(t),Σ(t)) k)1 Pj(t + 1) )

1

Q(θ*,P*) ≈ Q(θ,P) (18)

m

∑P(j|xi;µ(t),Σ(t))

(19)

mi)1

The solution of maximum log-likelihood function is obtained by repeating the E and M steps until every parameter has converged to within a tolerance criterion . 2.2.2. Deterministic Annealing EM Algorithm. Because the initial values of the parameters µ(0), Σ(0), and P(0) are randomly selected, the posterior probability functions P(j|xi;µ(t),Σ(t),P(t)) obtained from eq 16 are not reliable in the early stages of the iteration. This will result in the parameters becoming trapped in a local maximum. Ueda and Nakano25 proposed the deterministic annealing EM algorithm to resolve this problem. They introduced a posterior free energy function F(j|xi;µ(t),Σ(t),P(t)), given by

F(j|xi;µ(t),Σ(t),Ρ(t)) )

procedure, in which 1/β acts like the so-called temperature of annealing. 2.2.3. Robust Mixture Decomposition Algorithm. Although the DAEM can effectively overcome the local maximum problem, outliers still affect the solutions of DAEM. Medasani and Krishnapuram31 proposed the robust mixture decomposition (RMD) algorithm to alleviate the instability associated with outliers. They argued that the value of p(xi;θ) for the outlier xi should be much lower than the value for the clustered data when θ is correctly estimated. They sorted probability functions in the order p(x1;θ) g p(x2;θ) g ‚‚‚ g p(xm;θ) before each EM iteration and retained only the data with first r highest probabilities during the EM iteration for calculating the parameters of the next step. The retention ratio (r/m) can be estimated from the data quality of the dataset, or it can be included as one of the parameters to be optimized. Retention of a larger fraction of the data will lead to finer classifications and better detection of minor faults at the expense of increased numbers of clusters and poorer convergence. 2.2.4. Bayesian Information Criterion. Using the following approximation of the value of the expectation of the log-likelihood at the optimized parameter

[p(xi|j;µ(t),Σ(t))Pj(t)]β c

∑ [p(xi|k;µ(t),Σ(t))Pk(t)] k)1

νc ln(m) 2

Schwarz29 proposed use of the Bayesian information criterion (BIC) for determining the correct number of clusters

BICc ) - 2Q(θ*,P*) + νc ln(m)

(21)

where νc is the number of fitting parameters and m is the number of observations. Maximizing the expectation of log-likelihood is equivalent to minimizing the BIC. Use of too many clusters will decrease Q(θ*,P*) but also increase νc ln(m), leading to a net increase in the BIC. Alternatively, one can use the Adaike information criterion

AICc ) - 2Q(θ*,P*) + 2νc McLachlan and Peel34 have compared the behavior of the BIC and AIC. As ln(m) > 2, i.e., m > 8, the penalty term of the BIC is more significant than that of the AIC. Therefore, the BIC is more suitable for cluster validity in large datasets.

, β

0 eβ e1 (20) The posterior free energy function is a uniform distribution when β is 0. If β equals 1, the posterior free energy function reduces to the original posterior function. The posterior function changes from a uniform distribution to the original posterior free energy function as β increases from 0 to 1. The DAEM algorithm solves the EM problem by replacing P(j|xi;µ(t),Σ(t),P(t)) with F(j|xi;µ(t),Σ(t),P(t)) at small values of β to reduce the ruggedness of the posterior free energy function. The solution obtained at small β is used as the initial guess of the EM problem at larger values of β. The procedure is repeated until the EM problem at β ) 1 is finally solved. This method is analogous to a simulated annealing

3. Bayesian Classification on a PCA Subspace In this paper, an approach to process monitoring using Bayesian classification on a PCA subspace is proposed. 3.1. Fault Isolation. Fault isolation is performed to determine whether new data can be explained by one of the existing classifications. The statistics Q and T 2 of normalized on-line data can be calculated from eqs 5 and 7, respectively. If Q > QR or T 2 > TR2, then the new event does not belong to the existing PCA subspace. Hence, the new event cannot be explained by any of the current classifications. If Q < QR and T 2 < TR2, the probabilities of the score vector from the mixture model have to be greater than those of the trimmed data of RMD algorithm. Only when the score of on-line data is not an outlier of the current classification sets, the score be used to calculate the posterior probability function

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004 7819 c P(j|t) ) p(t|j;θˆ )Pj/∑k)1 [p(t|k;θˆ )Pk]. If the on-line score were used directly to evaluate the posteriors without above considerations, it might easily cause the misclassification. 3.2. Update Mixture Model to Newer Subspace. In process monitoring, there are two possible ways in which new events cannot be explained by the current classification set on the given PCA subspace. One is that the data of the new event contain a substantial component that is orthonormal to the current PCA subspace Q > QR. The other is that the data for the new event do lie on the given PCA subspace Q eQR but with T 2 > TR2. In both cases, the classification model needs to be updated. An algorithm for updating the Bayesian model to the newer PCA subspace is presented here. Consider the addition of data for new events to the original data matrix. The number of observations is increased from m to m*. The data matrix Y*T ) [YT YnewT] is normalized as X* ) (Y* - Y h *)S*-1. The mean vector and the standard deviation matrix are Y h * and S*, respectively, with S* ) diag[σ/1 σ/2 ‚‚‚ σ/n]. The first K* terms of loading vectors P* and correspondingly score vectors T* can be determined by assuming E* ≈ 0. Because the original data can be explained by both of the subspaces

y≈y j + tPTS ≈ y j * + t*P*TS*

Table 1. Parameters of the Gaussian Distribution of Example 1 µ Σ

group 1

group 2

group 3

group 4

[4 0 ] 2 0 0 0.2

[-4 0 ] 2 0 0 0.2

[0 -4 ] 0.2 0 0 2

[0 4 ] 0.2 0 0 2

[

The above equations show that the scores for the old data can be transferred to the newer subspace through a linear operation with coordinate rotation (A) and shifting (∆y j S*-1P*). Therefore, the mean of any class j in the original subspace (µj) can be transferred to the newer subspace via the relation

j S*-1P* µ/j ≈ µjA + ∆y

(24)

[]

1 (mj

mj

∑(tiT - µjT)(ti - µj) - 1)i)1

(25)

and the covariance on the newer subspace is

Σ/j )

1 (mj

]

(28)

where pK+1 to pK* are the (K + 1)th to (K*)th loading vectors of the original PCA. Because the feature of original dataset has been extracted by only the first K loadings, it is safe to assume that their features in the additional dimensions of the newer subspace are just noises with zero mean. Equation 24 is thus modified as

[]

µj 0 A′ + ∆y j S*-1P* µ/j ) l 0

(29)

It is also reasonable to assume that there is no correlation between the noises and that the variances of the noises are the eigenvalues truncated in previous PCA. Equation 27 is thus modified as

Σ/j

[

Σj 0 ) A′ · ·· 0 T

0 λK+1 ·· · ···

··· ··· ·· · 0

]

0 0 A′ ·· · λK* K*

λipipiT) × ∑ i)K+1

) ATΣjA + (SS*-1P*)T(

Assuming that there are mj observations in class j, the covariance matrix (Σj) on the original subspace is

Σj )

] [

PT p T A′ ) K+1 SS*-1P* l pK*T

(22)

(23)

] [

the parameters cannot be directly transferred using eqs 24 and 27, and the coordinate rotation matrix (A) must be modified as

Hence, the relation between the scores in the new and old subspaces is given by

t* ≈ tA + ∆y j S*-1P* A ≡ PTSS*-1P*, ∆y j≡y j-y j*

] [

(SS*-1P*) (30) With a few additional EM iterations, the cluster parameters of the original data can be clustered in the newer subspace and updated. The new data can also be classified independently in the new PCA subspace using the DAEM procedure.

mj

/T / / (t/T ∑ i - µj )(ti - µj ) - 1)i)1

(26)

By substituting eqs 23-25 into the above equation, the covariance on the newer subspace can be estimated from that of the original subspace as

Σ/j ≈ ATΣjA

(27)

Equations 24 and 27 can be used to update the parameters of mixture model to the newer subspace. The updated prior probabilities are P/j ) m × Pj/m*. If the newer subspace is of higher dimension than the original subspace, i.e., K* > K, the covariance matrix Σ/j calculated from eq 27 will be singular. In this case,

4. Examples 4.1. Example 1: Updating a Mixture Model. Four groups (groups 1-4) of random data are generated using four different Gaussian distributions. Their means and covariance matrixes are listed in Table 1. For each group, 100 data points are sampled on the X-Y plane about the mean according to the given distributions. The data on the Z axis are generated by the relation z ) x + N(0,0.01), where N(0,0.01) is normal distribution with zero mean and 0.01 standard deviation. These four groups are shown in Figure 2 by open circles. In addition, the other 200 data points, named groups 5 and 6, are generated by the same Gaussian distributions as groups 1 and 2. The difference from the previous groups is that the Z-axis data are generated by the relation z

7820

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004

Figure 4. Q and T 2 statistics of group 3.

Figure 2. All data of groups 1-6; these data have been normalized to zero means and unit variances.

Figure 5. Clustering result for groups 1-3; the dashed lines represent the mixture model updated from the previous subspace.

Figure 3. Clustering result for groups 1 and 2.

) -x + N(0,0.01). In Figure 2, the data of groups 5 and 6 are shown by solid circles. First, the PCA subspace is built with the data of groups 1 and 2. Apparently, there are two principal components. The score vectors are clustered by the DAEM algorithm, giving the result shown in Figure 3. The data of group 3 are projected onto this subspace, and the statistics Q and T 2 are shown in Figure 4. Although the data of group 3 can be explained by the two-dimensional plane, the T 2 values are outside the control limit. Thus, the PCA subspace should be reconstructed using all of the data. Because the number of PCs is the same as in the original subspace, the scores of the data in groups 1 and 2 as well as the parameters of the mixture model can be updated using eqs 24 and 27. The score vectors of group 3 form a new cluster in the new subspace. The results are shown in Figure 5, and they demonstrate that the mixture model in the newer subspace can be obtained without iteration when the number of dimensions remains unchanged. Projection of the data of groups 4-6 onto the PCA subspace formed by groups 1-3 show that those data are outside the control limits for both Q and T 2 (Figure 6), so the PCA subspace should be rebuilt by all of the data. There are three PCs in the newer subspace. Three

Figure 6. Q and T 2 statistics of groups 4-6.

additional clusters are generated using the score vectors from groups 4-6 through the DAEM algorithm. Cluster boundaries are conditional probability functions larger than 5%, shown as solid lines in Figure 7. The cluster boundaries of groups 1-3 can be obtained from the updated mixture model using eqs 29 and 30. They are shown as dashed lines in Figure 7. Table 2 shows that there are virtually no differences between these initial values and final converged values after the EM iterations. Table 3 lists the means and variances of groups

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004 7821

Figure 7. Model from DAEM algorithm (solid lines); initial values of parameters estimated by the proposed method (dashed lines). (a) First and second scores. (b) First and third scores.

1-3 projected back onto the original coordinates. They are close to the values used for generating the data, as listed in Table 1. This example illustrates how the score vectors of online data should be used for process monitoring. First, both the statistics Q and T 2 must be tested. If one of them is outside the control limit, the PCA subspace should be rebuilt. At the same time, additional clusters must be generated using the new data, and the existing mixture model must be transferred to the newer subspace. If the number of PCs in the newer subspace is same as in the previous subspace, the mixture model can be directly transferred to the newer subspace using a simple translation and rotation. If the number of PCs increases, the initial values of the mixture model parameters for the existing clusters can be obtained and updated using EM iterations. 4.2. Example 2: High-Pressure Polyethylene Process Monitoring. The polyethylene plant used for this example is located in Kaohsiung, Taiwan. There are five different production lines to produce highdensity, low-density, linear low-density polyethylene, and so on. The operating conditions are different for each grade in the high-pressure process. The high-purity ethylene (FI-001 in Figure 8) is mixed with recycled gas before entering the primary compressor. Modifiers are added (FIC-001 and FIC-002) according to the recipes

for each grade. The flow rate of the feed entering the reactor is measured by FI-002. There are several reacting zones in the reactor; the temperature profile is maintained by manipulating the initiator flow rates (FI003, FI-004, and FI-005). The positions of the temperature controllers (TIC-001, TIC-002, and TIC-003) are close to the top, middle, and bottom of the reactor. The product qualities are significantly affected by the temperature profile and pressure in reactor. The reactor pressure of approximately 1500-2000 kg/(cm2‚g) is regulated by PIC-001 through the outflow. A mixture of polyethylene and unreacted gas will be flashed in two stages. The higher-pressure gas is recycled to the entrance of the secondary compressor from the top of the high-pressure separator. Almost all of the polymer and the rest of the unreacted gas flow to the extrusion hopper. Polymer is pumped to the extruder from the bottom of the extrusion hopper. The lower-pressure gas is recycled to the entrance of the primary compressor from the top of the extrusion hopper. The reaction conversion is only around 10-15%, and most of the unreacted gas recycles continuously in this process. The byproducts are accumulated until the product qualities are outside the control limits. By that point, the operator will be notified to purge the recycled gas from FIC-004 in order to maintain the purity of the reacting gas. In this paper, data on the 13 process variables in Figure 8 were collected every 5 min for a 15-day period. After trimming the data during shutdown, there were 4142 observations in the dataset, including two grades of products labeled 1 and 2. The PCA subspace was constructed from the dataset. Three PCs were identified that explained 91.4% of covariance. The initial value of β in the DAEM algorithm was set to 0.2 and βnew ) 1.2βold was used for each annealing step to determine the fitting parameters. In addition, the retention ratio of the RMD algorithm was assumed to be 0.8. The number of classes was found to be 4 by BIC. Projection of the training data onto the PCA subspace and the clustering results are shown in Figure 9a,b. Data for grade 1 were clustered as C1_1 and C1_2. C1_1 represents the normal operating conditions of grade 1. The operator verified that the product qualities were offspecification in cluster C1_2. Clusters C2_1 and C2_2 represented the data for grade 2. C2_2 is the class of normal operation for grade 2. Class C2_1 is chronologically between C1_2 and C2_2. It is easy to infer that the root causes of abnormality in C1_2 have not been eliminated and that C2_1 represents abnormal operation. Because of the transition from abnormality, some of the grade 2 data cannot be explained well by a Gaussian probability density function. Such data become outliers trimmed by the RMD algorithm. After that, the data from grade 3 to grade 6 were collected in 21 days, giving 5650 observations. The Q and T 2 plots in Figure 10 show that the new data cannot be explained by the existing PCA subspace. The subspace must be rebuilt. Using all of the data to rebuild the subspace yields four PCs that explain 93.0% of the covariance. Four new classes labeled C3-C6 were found, as shown in Figure 11a,b. The known classes (C1_1, C1_2, C2_1, C2_2) were transferred to the newer subspace. The clustering result shown in Figure 11a,b was applied to on-line fault isolation 15 days later. The products produced were grades 1 and 4, and the data were labeled as grades 1-2 and 4-2. Figure 12 shows

7822

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004

Table 2. Comparison of Initial Values and EM Solutions for Clustering Parameters initial values

[ [ [

solutions of EM

] ]

[ [ [

[-1.5441 -0.7623 -0.1747 ] 0.2839 0.1265 0.0279 0.1265 0.0867 0.0276 0.0279 0.0276 0.0105 [1.3414 0.6923 0.1934 ] 0.2820 0.1352 0.0325 0.1352 0.1020 0.0342 0.0325 0.0342 0.0133 [0.8009 -1.2593 -0.7194 ] 0.0963 -0.1041 -0.0644 -0.1041 0.2134 0.1203 -0.0644 0.1203 0.0687

µ1 Σ1

µ2 Σ2

µ3 Σ3

] ]

[-1.5430 -0.7622 -0.1747 ] 0.2845 0.1265 0.0272 0.1265 0.0865 0.0275 0.0272 0.0275 0.0109 [1.3538 0.7032 0.1973 ] 0.2730 0.1259 0.0296 0.1259 0.0937 0.0315 0.0296 0.0315 0.0122 [0.7970 -1.2458 -0.7122 ] 0.0965 -0.1076 -0.0660 -0.1076 0.2268 0.1271 -0.0660 0.1271 0.0720

]

]

Table 3. Project Initial Values of Parameters onto the Original Coordinates, X and Y Axes

µ Σ

state 1

state 2

state 3

[4.3678 0.0065 ] 2.0864 0.0261 0.0261 0.2090

[-3.5744 -0.0225 ] 2.1468 -0.0620 -0.0620 0.2489

[-0.0113 -4.3478 ] 0.2155 -0.0082 -0.0082 2.2884

[

]

[

]

[

]

Figure 8. Process flow diagram of high-pressure polyethylene plant.

the statistics Q and T 2 for all of the data. Some of the grade 1-2 data (sample numbers 11123-11390, 11% of the data for grade 1-2) are outside the control limits. Figure 13a,b shows the results when the rest of data are classified. All of the data for grade 4-2 are classified in class C4, which represents normal operating conditions for grade 4. Most of the grade 1-2 data are classified in cluster C1_1, which also is normal operation for grade 1. The data around the new abnormality are classified in class C1_2, as shown in Figure 14. Therefore, C1_2 serves as the prewarning class for abnormal grade 1 production. As an abnormality in grade 1 production started to take place, the data would move from C1_1 to C1_2 before exiting the PCA subspace. 5. Conclusions In this paper, a process monitoring methodology that combines PCA and Bayesian classification is proposed. The noises in plant data can be filtered out and the features extracted with PCA to reduce the complexity of Bayesian classification. When data for new events cannot be explained properly by the existed PCA subspace, the subspace should be rebuilt to cover all of the events. Because the data for new events do not belong to the previous subspace, in the newer subspace, only

Figure 9. Clustering results for grades 1 and 2 in the PCA subspace. (a) First and second scores. (b) First and third scores.

the data for new events need to be clustered. The existing classifications of the old data can be updated to the newer subspace using a simple transformation plus a few EM iterations. Moreover, the most timeconsuming step of the classification, the DAEM algorithm, need only be applied to new data that do not belong to the existing PCA subspace. Fault detection

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004 7823

Figure 10. Q and T 2 statistics for different operating conditions.

Figure 12. Q and T 2 statistics for the newer PCA subspace for the trained and tested dataset.

Figure 11. Clustering results for all six grades on the newer PCA subspace. (a) First and second scores. (b) First and third scores.

knowledge can be mastered incrementally as operating data accumulate. However, meaningful classifications are obtained only if an adequate amount of data has been accumulated, so the method cannot be applied in a real-time manner. Moreover, outliers are trimmed by the RMD algorithm, resulting in the failure to isolate some minor faults. Furthermore, this method is essentially a method of fault detection but not of fault diagnosis. It will identify new modes of operation when faults cause the plant to stray into new areas of the

Figure 13. On-line fault isolation for different operating conditions. (a) First and second scores. (b) First and third scores.

sensor state space. It cannot diagnose the nature of the fault. If the method has detected a new mode of operation, plant experts should be called in to elucidate the physical interpretation of this new mode. They can examine the contribution plots to determine the faulty variables and localize the units that might be in error. Furthermore, they can examine the known modes of operation that the plant was in before the new mode and examine faults related to those modes. Successful

7824

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004 Greek Letters β ) temperature parameter in the DAEM algorithm Λ ) diagonal matrix of eigenvalues Σ ) covariance matrix of the normalized process data (in section 2.1) Σi ) covariance matrix of the Gaussian distribution of class i (in section 2.2.2) σi ) standard deviation of variable i σi* ) standard deviation of variable i with new events θ ) parameters of the conditional probability density function µi ) mean vector of the Gaussian distribution of class i (in section 2.2.2) νc ) number of estimated parameters of EM

Literature Cited

Figure 14. Abnormality prewarning for grade 1 production.

application of this approach to high-pressure polyethylene process monitoring demonstrates that the proposed technique of adaptively adding clusters is suitable for fault detection for processes with multiple operation modes. Acknowledgment This work was supported by the National Science Council, Republic of China, under Grant NSC-92-2213E-268-001. Nomenclature A ) linear operator for rotating the coordinates of subspace c ) number of classes E ) residual part of the data matrix F(j|xi) ) a posteriori probabilities of xi in the DAEM algorithm L(θ) ) log-likelihood function K ) number of principal components K* ) number of principal components with new events m ) number of observations m* ) number of observations with the data for new events n ) number of variables P ) loading matrix (in section 2.1) P ) a priori probability matrix (in section 2.2) P(j|xi) ) a posteriori probabilities of xi Pj ) a priori probabilities p(xi|j;θ) ) conditional probability density functions of xi in the condition of class j and the parameters θ p(xi,j) ) joint probability density functions of data xi and class j QR ) confidence limit corresponding to the (1 - R)th percentile r ) number of data retained S ) diagonal matrix of standard deviations S* ) diagonal matrix of standard deviations with new events TR2 ) confidence limit corresponding to the (1 - R)th percentile ti ) ith score vector X ) normalized data matrix X* ) normalized data matrix with the data for new events X ˆ ) systemic part of the data matrix x ) normalized data vector Y ) process data matrix Y h * ) process data matrix with the data for new events Y h ) mean vector of process data Y h * ) mean vector of process data with new events

(1) Wise, B. M.; Gallagher, N. B. The process chemometrics approach to process monitoring and fault detection. J. Process Control 1996, 6, 329. (2) Kourti, T.; MacGregor, J. F. Multivariate SPC Methods for Process and Product Monitoring. J. Qual. Technol. 1996, 28, 409. (3) Ku, W.; Storer, R. H.; Georgakis, C. Disturbance detection and isolation by dynamic principal component analysis. Chemom. Intell. Lab. Syst. 1995, 30, 179. (4) Nomikos, P.; MacGregor, J. F. Monitoring batch processes using multi-way principal component analysis. AIChE J. 1994, 40, 1361. (5) Ramsay, J. O.; Silverman, B. W. Functional Data Analysis; Springer-Verlag: New York, 1997. (6) Chen, J.; Liu, J. Derivation of function space analysis based PCA control charts for batch process monitoring. Chem. Eng. Sci. 2001, 56, 3289. (7) Dong, D.; McAvoy, T. J. Nonlinear Principal Component Analysis based on Principal Curves and Neural Networks. Comput. Chem. Eng. 1996, 20, 65. (8) Hastie, T.; Stuezle, W. Principal curves. J. Am. Stat. Assoc. 1989, 84, 502. (9) Li, W.; Yue, H. H.; Valle-Cervantes, S.; Qin, S. J. Recursive PCA for adaptive process monitoring. J. Process Control 2000, 10, 471. (10) Venkatasubramanian, V.; Rengaswamy, R.; Kavuri, S. N.; Kewen, Y. A review of process fault and diagnosis Part III: Process history based methods. Comput. Chem. Eng. 2003, 27, 327. (11) Teppola, P.; Mujunen, S.; Minkkinen, P. Adaptive Fuzzy C-Means clustering in process monitoring. Chemom. Intell. Lab. Syst. 1999, 45, 23. (12) Choi, S. W.; Yoo, C. K.; Lee, I. Overall statistical monitoring of static and dynamic patterns. Ind. Eng. Chem. Res. 2003, 42, 108. (13) Leung, D.; Romagnoli, J. An integration mechanism for multivariate knowledge-based fault diagnosis. J. Process Control 2002, 12, 15. (14) Hwang, D.; Han, C. Real-time monitoring for a process with multiple operating modes. Control Eng. Pract. 1999, 7, 891. (15) Akbaryan, F.; Bishnoi, P. R. Fault diagnosis of multivariate systems using pattern recognition and multisensor data analysis technique. Comput. Chem. Eng. 2001, 25, 1313. (16) Chen, D. S.; Wong, D. S. H.; Liu, J. Process monitoring using a distance-based adaptive resonance theory. Ind. Eng. Chem. Res. 2003, 41, 2465. (17) Kurnik, R. T.; Oliver, J. J.; Waterhouse, S. R.; Dunn, T.; Jayalakshmi, Y.; Lesho, M.; Lopatin, M.; Tamada, J.; Wei, C.; Potts, R. O. Application of the mixtures of experts algorithm for signal processing in a noninvasive glucose monitoring system. Sens. Actuators B 1999, 60, 19. (18) Viallefont, V.; Raftery, A. E.; Richardson, S. Variable selection and Bayesian model averaging in case-control studies. Stat. Med. 2001, 20, 3215. (19) Valafar, H.; Valafar, F. Data mining and knowledge discovery in proton nuclear magnetic resonance (H NMR) spectra using frequency to information transformation (FIT). Knowl.Based Syst. 2002, 15, 251. (20) Zribi, M. Nonparametric and unsupervised Bayesian classification with Bootstrap sampling. Image Vision Comput. 2004, 22, 1.

Ind. Eng. Chem. Res., Vol. 43, No. 24, 2004 7825 (21) Kim, M.; Lee, Y.; Han, C. Real-time classification of petroleum products using near-infrared spectra. Comput. Chem. Eng. 2000, 24, 513. (22) Wang, X. Z.; McGreavy, C. Automatic classification for mining process operational data. Ind. Eng. Chem. Res. 1998, 37, 2215. (23) Chen, J.; Liu, J. Mixture principal component analysis models for process monitoring. Ind. Eng. Chem. Res. 1999, 38, 1478. (24) Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximumlikelihood from incomplete data via the EM algorithm. J. R. Stat. Soc., Ser. B 1977, 39, 1. (25) Ueda, N.; Nakano, R. Deterministic annealing EM algorithm. Neural Networks 1998, 11, 271. (26) Biernacki, C.; Celeux, G.; Govaert, G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 2003, 41, 561. (27) Ding, C.; He, X.; Zha, H.; Simon, H. D. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM′02); IEEE Press: Piscataway, NJ, 2002. (28) Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716.

(29) Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461. (30) Celeux, G.; Soromenho, G., An entropy criterion for assessing the number of clusters in a mixture model. J. Classif. 1996, 13, 195. (31) Medasani, S.; Krishnapuram, R. Categorization of image databases for efficient retrieval using robust mixture decomposition. Comput. Vision Image Understanding. 2001, 83, 216. (32) Shoham, S.; Fellows, M. R.; Normann, R. A. Robust, automatic spike sorting using mixtures of multivariate t-distributions. J. Neurosci. Methods 2002, 127, 111. (33) Jackson, J. E. A User’s Guide to Principal Components; Wiley: New York. 1991. (34) McLachlan, G. J.; Peel, D. Finite Mixture Models; Wiley: New York, 2000.

Received for review February 25, 2004 Revised manuscript received June 22, 2004 Accepted September 1, 2004 IE0498495