Mixture Principal Component Analysis Models for Process Monitoring

Man Gyun Na, Yoon Joon Lee, In Joon Hwang. A smart software sensor for feedwater flow measurement monitoring. IEEE Transactions on Nuclear Science ...
0 downloads 0 Views 295KB Size
1478

Ind. Eng. Chem. Res. 1999, 38, 1478-1488

Mixture Principal Component Analysis Models for Process Monitoring Junghui Chen* Department of Chemical Engineering, Chung-Yuan Christian University, Chung-Li, Taiwan 320, Republic of China

Jialin Liu Center for Industrial Safety and Health Technology, Industrial Technology Research Institute, Hsin-Chu, Taiwan 310, Republic of China

A mixture principal component analysis (MixPCA) model detector is proposed. The detector creates each cluster by aggregating input exemplars. The number of clusters in the data set is automatically determined by a Gaussian filter technique called heuristic smoothing clustering. The structure building contains two stages: (1) sizing each cluster on the basis of the clustering analysis and (2) setting up the multivariable statistic confidence levels for each cluster. After the clusters are identified, MixPCA, which is a group of PCA models, will then be created. If any input pattern is under normal operation, it would fall into the accepted region whose criteria are set on the basis of the principal component analysis (PCA), T2 and Q charts. In addition, the techniques of similarity measuring and merging between clusters are also developed for the new correlation structure of the process variables when new data sets are included. The proposed model development procedure is favorable because of its simplicity of construction, intuitive modeling based on data distribution, and flexibility of updating procedures. Extensive testing on three problems demonstrates the effectiveness and excellence of the proposed methodology. 1. Introduction Pattern classification is widely used and it plays an important role in many applications, such as character recognition,1 automatic control,2 rule-based reasoning,3 and so forth. For many applications, the input attributes should be classified as a set of clusters in order to model and process uncertain and ambiguous data. Among these applications, principal component analysis (PCA) is often adopted to select features that have more discriminative power and it becomes a basis for a recognition or classification scheme.4 This feature selection process is particularly important when determining the input variables’ correlation, so PCA is suitable for clustering analysis. In this paper, a new fault detection (FD) algorithm is developed to monitor plant faults occurring in multivariable and interacting systems. The proposed method integrates data clustering with PCA to build up a useful mixture model detection system. Modern industrial plants are well-equipped with computerized measurement devices to control highly complex processes. Usually, thousands of observations are collected from the sensors every few seconds. This often creates “data overloads” or redundant information. Thus, experienced engineers are needed to sort through a cornucopia of data and to detect whether the process is normal. If the abnormal event cannot be identified in time, the malfunction would cause monetary loss, plant shutdown, or even significant impacts on personnel and equipment safety. Traditionally, statistical process control (SPC) is known as a standard systematic method that provides statistical analysis on historical data for decision mak* To whom all correspondence should be addressed.

ing. However, this approach is not appropriate to plant operators who watch multivariables and concentrate on keeping each variable at its setpoint. The major drawback of SPC is that it may lead to erroneous results when collinearities occur in the multivariable process. Several advanced SPC approaches to multivariable process monitoring have been developed,5-7 such as PCA and partial least squares. Their basic concept is to find a set of a combination of variables, or major principal components, which describes major variations and trends for the operating data. Unlike traditional SPC schemes, the principal components can reduce dimensions when process performance is monitored. However, it is difficult to detect process faults when the normal data cannot represent a single set of the combination of variables only; the system data is most likely to be distributed at several regions. In this paper, a modified clustering analysis is developed to overcome this problem. Past research shows that the neural network (NN) is a powerful tool for FD.8-12 Through training, NN is very good at phenomenal nonlinear function approximation and pattern recognition, especially when expert diagnostic knowledge and the prior relation of faultsymptom models are not clear. However, there are some disadvantages of the popular NN.13 For example, the global approximation character of back-propagation neural networks (BPN) forces all classes to be separated by a hyperplane when BPN deals with all the data sets. Thus, false classification happens when some areas are not covered by the training patterns of BPN. The radial basis function network (RBFN) approach, which is a data-driven technique, locates the center and width of each neuron with a local radial basis function based on data density, but the region for each neuron is not often

10.1021/ie980577d CCC: $18.00 © 1999 American Chemical Society Published on Web 02/13/1999

Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999 1479

appropriately bounded. Thus, it is tedious to locate the center of the neuron and determine the neuron width of RBFN. Apart from RBFN, another neural network paradigm used by a lot of researchers recently is the elliptical basis function neural network. The elliptical basis function neural network is similar to RBFN with the Gaussian basis function; however, the neural network with the elliptical basis function has more favorable and intuitive results in function approximation and classification. Poggio and Girosi14 from the viewpoint of approximation theory set up a regularization network utilizing the elliptical basis function, but the number of basis functions was not clearly defined even if a heuristic approach was applied to solving the heavy computational load for the large number of units used. Kavuri and Venkatasubramanian9,10 developed a network using the ellipsoidal activation function for a fault classification problem. The diagonal covariance matrix was used in each elliptical unit in order to reduce computational load. Hence, it was inappropriate for the hyperelliptical shape of the units to represent the data regions and correlations. Johnston and Kramer15 built elliptical basis function networks with probability density estimation to calculate a different covariance matrix in each region. The number of basis functions was preassigned and computed using the k-means clustering algorithm. As they said, too many small window basis functions used could lead to spiky density estimates, and larger window functions could lose the capability of capturing the local characteristics of the data. Since the selection of the proper number of basis functions was still an unsolved problem, the trial-and-error procedure was often used. All of the above literature reviews demonstrate that the neurons with elliptical basis functions in the network can solve functional approximation and classification problems, but the advantages of the elliptical function are not fully explored. Therefore, this paper aims at two goals: (1) to propose a new technique for selecting the proper number of basis functions and constructing an elliptical basis function in each cluster and (2) to build up a mixture PCA (MixPCA) methodology for finding a certain acceptable region for each cluster based on the historical data and for detecting if the operating condition falls out of the acceptable regions. The proposed technique based on Gaussian smooth transformation can extract the significant component directions and possible local regions from the operating data and prevent the drawback caused by the large knowledge base. The procedures can set up the elliptical basis function with different sizes, orientations, and shapes according to each local region even if the highly dimensional system is encountered. The detector called MixPCA is then constructed on the basis of the decision regions for clusters bounded by PCA. Besides, a similarity measure and merging technique are also developed for a combinational structure of the process variables when new data sets are added in order to avoid false diagnosis. This paper is organized as follows. The MixPCA structure is first detailed in section 2. Section 3 shows the auto-clustering procedures. Section 4 focuses on how to include the new correlation structure of the process variables in order to avoid false detection when new data sets are added. Three examples and their results are listed in section 5. The characteristic discussion and

Figure 1. A single node of the PCA model.

comparison with other methods are also described. Section 6 gives a conclusion. 2. Mixture PCA Network Structure Knowledge representation for complicated systems must be as clear to human users as possible. The knowledge of constructing MixPCA lies in its process of acquiring the operating regions. Each PCA cluster can be easily extracted from the operating region. Discussion of this structure is detailed in the following subsections. 2.1. Single-Cluster PCA Network. The operating region represented by a simple cluster is described by the following equation:

φ(x) ) (x - c)TS-1(x - c)

(1)

where x is the input vector of this cluster, c is the cluster center vector of this cluster, and S is the covariance matrix for the data relationship. In other words, a single cluster (or node) of MixPCA consists of two parts (Figure 1). The first part represents a multivariate Gaussian distribution with two kinds of parameters for the node: the cluster center c and covariance matrix S. The ellipsoidal unit is used to represent data distribution. The second part describes the output of the node, indicating if the input variable falls into this accepted region. That is,

If

(x - c)TS-1(x - c) e d2 then O(x) ) accept (2) else O(x) ) reject

where d2 is the adjustable region parameter related to the statistical hypothesis testing. The domain defined by the above equation is a hyperellipsoid whose principle direction may be oblique with respect to the axes of the input space. For simple explanation, the monitoring system under consideration, which has three inputs, x1, x2, and x3, where the samples all lie on a plane, is shown in Figure 2. The samples can be enveloped by an ellipse. Thus, a PCA model with two principal components adequately describes all variations in the samples. However, since the PCA model lacks model fit statistics such as the commonly used multivariate statistical process control, Q, and the measure of variation within the PCA model given by Hotelling’s T2 statistics can be applied to detecting if the samples are in this accepted region. (1) Statistic Q: The scalar value, Q, is a measurement of the goodness of fit of sample x to the model. Q, the sum of square of the residuals, is expressed as

Q ) xT(I - PlPlT)x

(3)

1480 Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999

extracted by a set of clusters (or hyperellipsoids). This set of clusters is called the MixPCA model. In MixPCA, each cluster performs a particular decision function on incoming signals as well as a set of parameters pertaining to this cluster. The input variables are process variables, and the systems are represented by several clusters whose final classification decision is represented by an output cluster. Therefore, there exist K clusters (or nodes) and the final decision output nodes can be expressed as

Cluster 1: 2 (x - c1)TS-1 1 (x - c1) e d1

w O1(x) )

{

accept, if Q1 e Q1R and T12 e T1R2 reject, otherwise · · · · · · · · ·

}

Cluster K: 2 (x - cK)TS-1 K (x - cK) e dK

Figure 2. Data distribution represented by the PCA model.

where the columns of Pl are the first l loading vectors retained in the PCA model and I is the identity matrix of an appropriate size. Pl is composed of the eigenvectors of the covariance matrix of data. All the data sets are collected from the normal operation. It has been shown that the residuals have the properties of noise if there are more measurements than states.5 (2) Statistic T2: The scalar value, T2, is a measure of the major variation of the sample x to the model. T2, the sum of square of scores, is defined as

T2 ) xPlΛ-1PlT xT

(4)

where Λ-1 is the diagonal matrix containing the inverse of eigenvalues associated with the l eigenvectors (or principal directions) retained in the model. These two variables, Q and T2, can be used for monitoring the process. The confidence control limit QR for Q can directly follow that of the traditional PCA model.4 The control limit TR2 for T2 can be calculated by means of the F-distribution:

TR2 )

l(m - 1) F m - l l,m-l,R

(5)

where m is the sample size and Fl,m-l,R has l and m-l degrees of freedom. These control limits are used with a 100(1 - R)% chance of a false alarm, where R is typically 0.99 or 0.95. Therefore, rewriting eq 2, the output O(x), is defined as

O(x) )

{

accept, reject,

if Q e QR and T2 e TR2 otherwise

}

(6)

2.2. Mixture PCA Model. In general, it may not be adequate to use only one cluster represented by an ellipsoidal function to solve all knowledge representation problems. When the given data are widely dispersed, knowledge representation of this kind requires clustering of the data in the input space which contains several distinct sizes of clusters. In this way, a complex system is decomposed into a few subsystems explicitly

w OK(x) )

{

accept, if QK e QKR and TK2 e TKR2 reject, otherwise

}

where ch is the cluster center vector for node h, Sh is the covariance matrix for the data relationship in node h, Qh is simply the sum of square of the residuals for cluster h, and Th2 is the sum of square of scores for cluster h. As the values of these parameters change, the ellipsoid-shaped function varies accordingly; thus, it is a critical point to determine the appropriate number of nodes and the corresponding parameters. In the next section, the proposed auto-clustering technique is developed to extract a minimal number of clusters and their parameters without the conventional trial-anderror approach. Now, the clusters can adequately describe how the data of the operating region is distributed. To build a good monitoring system using the proposed method, it is important to consistently update the system model for new data sets. Some process operating conditions may have nothing to do with the initial historical data sets. Hence, the updating procedures that extract new clusters for the new data sets as well as the previous data sets should be implemented to avoid the samples from being classified incorrectly and the failure of detecting an alarm condition. Now, the statistical hypothesis testing criteria are set up for part of the data in each cluster, not all data points. In this way, the problem of the traditional PCA method, which is always done on all the data sets without considering the characteristics of distribution, is overcome if the system data are located at several regions. The cluster output is accepted when the correspondent input patterns fall into the ellipsoid center (near the ellipsoidal center) formed by the cluster. On the other hand, input patterns which are far away from the ellipsoidal center (that is, outside the ellipsoidal center) yield a rejected state. 3. Clustering Analysis Clustering algorithms are used extensively not only for organization and categorization data but also for data compression and model construction. One of the most simple techniques accomplished by using a Parzen

Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999 1481

window is building an identical sphere window function around each training pattern in the data set and summing them up for each of the K classes.16 However, there exist several problems in this approach, including heavy computational load for these basis functions used, poor abilities of capturing the characteristics of the local data distribution, and so forth. For a thorough explanation in this aspect, a detailed treatment can be found in the literature.15 The primary disadvantage for this technique is that the size of the window function is the same. Different learning strategies have been proposed in the design of the density estimation, depending on how the centers of the clusters are specified. One simple approach is to assume that the fixed locations of the centers may be chosen randomly from the input data sets. This approach is considered sensible, provided that data distribution is in a representative manner.17 Another approach utilizes the elliptical basis functions in moving the location of their centers in a self-organized learning rule, like the k-means clustering algorithm18 and fuzzy clustering.19 The fixed number of clusters is required so that these methods can help in avoiding excessive usage of cluster centers and locate the cluster center at the dense regions of patterns. Furthermore, some work undergoes an error-correction learning strategy, like the gradient-descent procedure, to adjust centers. The trial-and-error procedure is inevitable for selecting the proper number of basis functions. Automatic extraction methods of an unknown number of clusters have been discussed. The unsupervised robust C-prototypes algorithm20 which combines fuzzy c-mean clustering (FCM) and statistical estimation can remove and merge the compatible clusters. However, it is needed to predefine the maximum number of clusters. Recently, the combined algorithm of FCM and information analysis is used to select the proper number of clusters;21 however, high computational load occurs when one counts each possible clustering condition in order to determine the proper number of clusters. A heuristic smoothing clustering (HSC) algorithm based on the Gaussian filter is proposed to automatically determine the proper number of clusters in the given data sets. Lin et al.22 originally used the one-dimensional Gaussian smoothing algorithm to automatically determine the spread parameter for smoothing a onedimensional signal. The HSC algorithm will be extended to a two-dimensional system in this paper. Given a two-dimensional signal h(x,y), its Gaussian smoothing H(x,y) is defined as

H(x,y) ) h(x,y)*g(x,y,σ) )

[

∫∞ ∫∞ -∞

[

1

-∞

h(µ,ν) exp x2πF

]



1

x2πF

]

[

exp

)

[

∫∞ ∫∞ -∞

-∞

]

dν (7)

where * denotes convolution with respect to x and y, g(x,y,σ) is the Gaussian function with the spread parameter σ, and µ and ν are the dummy variables. Note that, for the purpose of fast computation and characterization of the local variation of the signal, the correlation coefficient σxy of x and y can be negligible in the bivariate Gaussian function if x and y in eq 7 are calculated by projecting the data set onto their two principal directions which are mutually independent

-(x - µ)

h(µ,ν)

x2πF3

]



1

x2πF

[

exp

[

exp

]

-(x - µ)2 2σ2

]

-(y - ν)2 2σ2

×

dν (8)

Likewise, Hy(x,y) can be obtained on the basis of the above same procedure. The following steps can explain the implemented algorithm: Step 0: Count the data intensity h(x,y) and compute the smoothed result H(x,y) for the given two-dimensional data sets. Step 1: Let i ) 0, σi ) 1, and pi ) ∞. Step 2: Calculate Hx(x,y) and Hy(x,y) at σ ) σi. Step 3: Count pi+1, the number of zero-crossing in Hx(x,y) and Hy(x,y), whose sign changes from plus to minus. If pi+1 g pi or pi+1 e 1, stop. Step 4: Update σi+1 ) xdi+1, where di+1 ) the maximum distance among all pairs of neighboring peaks. Step 5: Set i ) i + 1 and compute H(x,y). Go to step 2. Note that pi+1 ) pi means that the number of clusters at the ith iteration and at the (i-1)th iteration are the same even if σ is increased. This may result in the change of the number of peaks. When pi+1 > pi, the number of clusters at the ith iteration is smaller than that at the (i-1)th iteration, which is contrary to our expectation for more smoothing. For details of the above methods, refer to the above-mentioned reference. According to this heuristic algorithm, the cluster center ci can be computed and the partitioning clusters are

µij )

||xj - ci||-2 K

∑ ||xj - ck||-2 k)1 1ejem

-(y - ν)2 2σ2

∂g(x,y,σ) ∂x

Hx(x,y) ) h(x,y)*

Ωi ) {xi| max (µij)}

-(x - µ)2 2σ2

and orthogonal. Like image processing, h(x,y) is usually represented as a two-dimensional array of pixel intensities. Let Hx(x,y) and Hy(x,y) represent the first partial derivative of H(x,y) with respect to x and y respectively. A zero-crossing in Hx(x,y) and Hy(x,y) is a local maximum in H(x,y) if the sign of Hx(x,y) changes from positive to negative and the sign of Hy(x,y) changes from positive to negative. Hx(x,y) can be calculated as

(9) i ) 1, 2, ..., K; j ) 1, 2, ..., m

where µij is a numerical value in [0, 1] and expresses the degree of belonging between the data xj and the ith cluster; Ωi is the set of data which has the maximum degree of belonging to a given cluster i. Figure 3 presents an example of the proposed autoclustering method. In Figure 3a, the original data set contains three Gaussian clusters. However, the exact number of clusters is not clear. Using the HSC algorithm, the exact number of clusters which is expected to contain no spurious clusters is specified. Figure 3b shows the contour of the final Gaussian smooth result of the original data intensity, and Figure 3c plots the three elliptical clusters, corresponding to the principal directions. Three locations of the corresponding centers are constructed. Note that the proposed clustering algorithm defines the clusters and the corresponding

1482 Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999

Figure 3. The application example of the HSC algorithm: (a) 300 data points represented by three distribution groups; (b) three exact clusters extracted by HSC; (c) each circle corresponding to the principal directions; (d) estimated probability density function.

Figure 5. The application example of regrouping: (a) historical data set represented by four distribution groups; (b) new data set represented by three distribution groups; (c) combinational plot for new and historical data sets; (d) merging similarity clusters from seven to six groups.

initial selection by HSC. Therefore, using HSC can overcome the inappropriate initial positions of FCM. 4. Similarity and Regrouping of Clusters

Figure 4. The application example of the FCM algorithm: (a) data clustered by two groups; (b) data clustered by three groups. The solid points represent the found cluster center points.

centers based on the density of the collected data. It will have newly generated centers if the more data are accumulated around some region. In Figure 3c, the ellipsoids are plotted by two standard derivations in each direction. They indicate the location of the clusters, not the limit bounds. After the clustering process is done, the mesh plot in Figure 3d shows that the estimated density functions of all data points toward the selected cluster centers can be found. Unlike the proposed algorithm, the popular FCM algorithm preassigns either two or three clusters for the same data set (Figure 4, parts a and b). In this way, nothing ensures that FCM converges to an optimum solution. Accordingly, the performance all bets on the initial number of clusters selected. The characteristics of HSC will be further demonstrated in example 1. One can also apply HSC to a more than twodimensional system by projecting the data set onto the first two principal components to find the local maximum points without much computational load in the multidimensional smoothing transform. The centers of rough values are thus obtained. Then, FCM can be applied to fine-tuning the cluster centers based on the

To build a good monitoring system using the proposed method, it is important to consistently update the system model for new data sets. Some process operating conditions selected later (Figure 5 b) may have nothing to do with the initial historical data sets (Figure 5a). Hence, the updating procedures for the new data sets should be included. The implementation schedule can be explained by the following two strategies: similarity measuring and merging. Similarity measure is a tool used to decide the similarity degree between the two groups. If the similarity degree is high enough, the merging stage is needed to reduce the total number of clusters. In Figure 5c, it indicates that the data distribution of cluster 3 in the historical data is pretty close to that of cluster 2 in the new data. According to similarity measure, cluster 2 in the new data set and cluster 3 in the old data set can be considered as the same group. Therefore, merging comes into play to join them together. Figure 5c shows that there are seven (i.e., three new and four old ones) clusters before merging takes place. After the clusters are regrouped, cluster 2 in the new data set and cluster 3 in the old data set merge together and yield six clusters (Figure 5d). The new parameters of the combinational clusters are calculated on the basis of the parameters of all clusters (S and c) without rerunning the whole data sets. These will be explained in the following subsections. 4.1. Similarity Measure. Many papers have elaborated on similarity measure from different viewpoints.23-25 However, similarity between any two groups depends on the subjective point of view from different observers and/or the emphasis from different weighting members in the groups. For transparency and efficiency of this system we consider that distance measure should

Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999 1483 Table 1. Similarity Measure for Two Groups of Data Distribution

K

old data

new data

Sij

C1

C2

C3

C4

C1 C2 C3

0.7319 0.7415 0.4298

0.7593 0.5964 0.8108

0.6815 0.9364 0.5515

0.5299 0.8609 0.5926

ci )

SIMij ) 1

n

[

m

(10)

m

(max{xlk} - min{xlk})2]1/2 ∑ l)1 k)1 l)1

where the denominator term in the equation at the right-hand side denotes the Euclidean distance of the minimum and the maximum values of each input variable k on all the data for the operating region, and the nominator term refers to a metric dc for cluster Ci and Cj, which is defined as

dc2(Ci,Cj) ) ||ci - cj||2 + tr((Si - Sj)T(Si - Sj)) (11) where the first term of the equation at the right-hand side is the Euclidean distance as the dissimilarity measure between two cluster centers ci and cj. Because similarity cannot be determined only based on the distances between cluster centers, the second term is used to explain the size difference between the two clusters. The notation “tr” represents trace. In eq 10, it is clear to see that the similarity degree of cluster i and cluster j on all the data sets falls into 0 < SIMij e 1. When SIMij ) 1, cluster i and cluster j are identical. When SIMij f 0, cluster i and cluster j are disjointed. The data in Figure 5c are applied and illustrated in Table 1 using this method. Cluster 3 in the original data and cluster 2 in the new data have relatively strong similarity when compared to others between two clusters. Thus, a threshold value SIMthr is needed to determine the merging limit. 4.2. Merging. A new technique is developed to merge and analyze the grouped data set in multidimensional space based on the metric dc. Let C ) {C1, ..., CK} be a set of K clusters and c ) {c1, ..., ck} be that of the corresponding cluster centers. Because of the removing and/or merging of some of the clusters, the new K′ clusters can be derived without rerunning the whole data sets for the new clusters. Like the cost function for FCM, the generalization cost function for the proposed method can be defined by K′

J(U,C1,...,CK′) )

K′ K

Ji ) ∑∑µijmdc2(Ci,Cj) ∑ i)1 i)1 j)1

(12)

where 0 e µij e 1 represents the degree of belonging between the new cluster Ci and old cluster Cj; m is a weighting exponent. To optimize J with respect to all its input arguments, we use the Lagrange multiplier technique and obtain the following update equations:

K

µijm ∑ j)1 K

be explicitly included. Besides, the similarity measure should count on the covariance of each cluster that determines the degree of sharing of each point among all clusters. The similarity measure between cluster i and cluster j can be defined as

dc(Ci,Cj)

µijmcj ∑ j)1

Si )

µij )

(

K′



dc2(Ci,Cj) 2

k)1d c

)

µijmSj ∑ j)1 K

∑ j)1

(13)

µijm

-1/(m-1)

, i ) 1, ..., K′, j ) 1, ..., K

(Ck,Cj)

The above equation is referred to as a general FCM algorithm. When Sj becomes an identity matrix, these equations are the exact FCM algorithm. The merging scheme considers information about the shape and the size of each cluster through the covariance matrices. It reorganizes the model architecture when the new features are incorporated and the number of clusters needed would be reduced for some tasks. Dense regions shown in Figure 5c are partly shared by one or more clusters. Figure 5d shows the initial number of clusters (K ) 7) in the two distribution data sets merged by applying the merging scheme with the number of cluster K′ specified to be 6 when merging cluster 3 in the old data with cluster 2 in the new data. 5. Examples For the test of our algorithm, three examples are given here. The benchmark data sets of the first two examples are created artificially. They are applied either to a two-dimensional or to a three-dimensional system, as visualization is very easy in these cases. The last example focuses on the Tennessee Eastman process problem, a more challenging testbed for process monitoring and detection. 5.1. Example 1: Two-Dimensional Problem. A total of 1000 data points uniformly distributed within a doughnut-shape region is generated (Figure 6). Like the flower problem that is a standard situation in fault diagnosis,9,26-27 this problem contains two different regions: center and annular regions which represent the normal operation and various fault classes, respectively. The aim of this problem is to show how MixPCA, with the clustering technique HSC, extracts the proper nodes quickly to construct a network without preassigning the number of nodes. Since no prior information of the data set is available as to the correctness or incorrectness of the number of clusters or separable boundaries, the traditional ways must somehow be accomplished on the basis of the observation of data. However, the data illustrated in Figure 6a shows that pattern classes are not easily discernible. Only the trial-and-error procedure for selecting the number of clusters is possible in traditional learning. Using the proposed method, 49 initial clusters are obtained initially (Figure 7a). After the spread parameters are updated, the number of clusters is reduced to nine (Figure 7b). Finally, six possible positions of the cluster centers are computed (Figure 7c).

1484 Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999

Figure 6. The application of data clustering in example 1: (a) 1000 data points uniformly distributed in an annular region; (b) final configuration of the elliptic functions where the bold lines are corresponding to the principal directions.

Figure 8. Clustering for the historical data set in example 2: (a) 200 data points drawn from the historical records; (b) two clusters extracted by HSC; (c) final configuration of the elliptic functions where the bold lines correspond to the principal directions.

Figure 7. The contour of H(x,y) by HSC in example 1: (a) 49 cluster points are located in the first iteration (next σ ) 3.2818); (b) nine clusters are found in the second iteration (next σ ) 5.2780); (c) six clusters are set up at the end (next σ ) 5.8721). The cluster centers are represented by the cross points.

Only by three iterations is the number of clusters quickly converged. In Figure 7, it shows that the contour result of H(x,y) for the data intensity becomes smoother with the reduction of the number of clusters (the spread parameter σ is changed from 1 to 5.8721). Therefore, HSC is capable of correct classification of all the data without tedious trial-and-error procedures. The MixPCA model, with six components of the elliptical forms shown in Figure 6b, is fitted to these data. 5.2. Example 2: Three-Dimensional Problem. An artificial data set in the second example consists of two clusters. Each cluster has 100 data points. This classification problem is a little bit more complicated than example 1 since the data clusters are discussed in the three-dimensional space (Figure 8a). The aim of this problem is to show that the proposed method can be applied to the multidimensional input space without any difficulty. Before the application proceeded, the raw data are projected into their first two principal directions to get their scores. Using HSC, two clusters are found. Figure 8b indicates that the clusters appear in two distinct

Figure 9. Clustering for historical data set in example 2: (a) 400 data points drawn from the historical (circle) and new (triangle) records; (b) two old clusters and two new clusters; (c) final configuration of clusters after the similar clusters are merged.

regions in the plane defined by the first two latent variables. Then, the results in HSC are reprojected back to the original space to find the estimation of the clusters (Figure 8c). FCM can further refine the clusters using these initial values. In fact, the estimated centers in HSC are so accurate that no refining is necessary in our practical study. To test the updating technique, another set of data that contains two clusters is added. Each cluster also has 100 data points. The data set is displayed in Figure 9a. The similarity measure is used as a criterion for determining which clusters should be merged together. The result is shown in Table 2. Figure 9b indicates that a dense region is shared by cluster 1 of the historical data and cluster 1 of the new data. From Table 2, the degree of similarity of the two clusters is relatively high (SIMthr ) 0.85). Under such circumstances, the merging algorithm (eq 13) is suggested in joining these two similar clusters together (Figure 9c).

Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999 1485

Figure 10. The data set for the normal condition (circle) and two newly added fault conditions (square). Table 2. Similarity Measure for Two Groups of Data Distribution

Figure 11. Q and T2 plots for fault 1 projected onto each node of the mixture PCA model with 95% (dashed line) and 99% (solid line) confidence limits.

old data

new data

Sij

C1

C2

C1 C2

0.8537 0.6133

0.5954 0.5470

Table 3. Confidence Control Limits for Q and T2

Q0.95 Q0.99 T20.95 T20.99

node 1

node 2

node 3

0.0153 0.0268 6.2378 9.4602

1.1827 2.0686 6.1170 9.4131

0.0012 0.0022 6.2437 9.4611

After the MixPCA model is constructed, two small changes are made in the normal data sets. These changes can be treated as fault conditions for this system. One hundred data are generated again for each fault condition. Figure 10 shows the original data sets for the normal conditions plus two newly added fault conditions. It is obvious to see that fault 1 is far away from the normal regions. However, it is hard to eyeball the differences directly between the normal condition (C3) and fault 2. The control limits, TR2 and QR, on the normal operating data of each node are listed in Table 3. The results of each PCA node shown in Figure 11 indicate that, after 100 data points of fault 1 are added, something has changed in the system. All Q and T2 values exceed the 95% and 99% control limits. (Note that in Figures 11 and 12, the dotted line is the 95% confidence level and the solid line is the 99% confidence level.) In other words, the values of Q and T2 suggest that fault 1 is alien to the normal condition of the system. As for fault 2 (Figure 12), the Q plot for node 3 indicates that the process is out of control, although the T2 plot for node 3 does not. As Q is a measurement of the goodness of fit of data to the model, the Q plot in Figure 12 suggests that Fault 2 could be the process drifting away from the original system. 5.3. Example 3: Tennessee Eastman Process Control Problem. The Tennessee Eastman process control problem (TEPCP)28 is used here to evaluate the proposed monitoring technique. There are 41 measurements and 12 manipulated variables from the controller. To illustrate our procedure and make a fair comparison with other techniques,30 such as PCA and NLPCA, the

Figure 12. Q and T2 plots for fault 2 projected onto each node of the mixture PCA model with 95% (dashed line) and 99% (solid line) confidence limits.

simulation data for the fault and the normal conditions generated with the base control structure is done on the basis of McAvoy and Ye’s work.29 For normal operating conditions, two disturbances considered are the A/C ratio and B composition in the C feed. All measurements with noise exist. The parameters and values of the operating conditions and the control strategy are based on Downs and Vogel’s Tennessee Eastman problem. In this problem, only the one fault, kinetic drift, is used to verify the accuracy of MixPCA. According to McAvoy’s work,30 the processes operated at two G:H mass ratio conditions are considered as normal conditions. Note that the process is running in different modes, so the process relationship between variables would be nonlinear. The traditional PCA is less likely to get good results because it utilizes one single PCA model to cover the whole process running in different modes. Six process variables, A feed flow rate, C feed flow rate, reactor temperature, purge flow rate, product separator underflow rate, and separator

1486 Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999

Figure 13. Two clusters determined by HSC in two operating modes of TEPCP. Table 4. Percentage of Variance Captured by Each PCA Node no. of principal components

node 1

node 2

1 2 3 4 5 6

37.40 66.65 80.94 90.15 97.02 100.00

49.93 75.00 85.10 91.44 97.16 100.00

cooling water outlet temperature, are collected from the normal operation. The MixPCA model is developed using 200 samples from the normal data. Using the proposed HSC method, it clearly illustrates that the first two score data sets contain two Gaussian clusters of various sizes and orientations (Figure 13). Two cluster nodes are needed to construct the model.

As shown in Table 4, four principal components capture nearly 90% of the variation or information of the data set in each node. Thus, the whole data set in the six original variables can be decomposed into two subclasses with four combinations of variables with very little loss of information. The control limits, TR2 and QR, on the normal operating data of each node are listed in Table 5. A total of 40 data points with kinetic drift are selected from two different G:H mass ratios for testing. Because the aim of the process monitoring is to detect the abnormal situation as early as possible, the data of the initial period that just drift away from the target are used. The data are projected (or fed) into MixPCA. The Q residuals plots for 50:50 and 90:10 G:H mass ratios are shown in Figures 14a and 15a, respectively. It is evident that the residuals have increased remarkably. The T2 plots shown in Figures 14b and 15b also indicate that the process is out of the normal condition. The results in this example indicate that there exists the same capability of MixPCA and NLPCA because both of them can capture the nonlinear relationship between the process variables. The research of McAvoy et al. developed NLPCA whose nonlinear principal curves can explain the data variation. Although the two nonlinear principal components are extracted to explain 98.73% of the data variance, it usually requires more computation and inevitably leads to convergence to the local minimum for training the network. As opposed to their optimization method, under the same conditions our design procedure still yields very good results. On the basis of the data density distribution and the spirit of the concept of PCA without training the network, the system is composed by two small cluster nodes. Since the least-squares estimation method in PCA computation is used to determine the parameters, the computational load is significantly reduced. Furthermore, the traditional T2 and Q charts in PCA can still be used here to explain the major variation of data and random noise in the data, respectively. Also from the viewpoints of the time series, the moving window method can still be applied. The same procedures in the proposed method can be used except that the data matrix is composed of

Figure 14. The disturbance data (G:H ) 50:50) in two PCA nodes: (a) T2 plots and (b) Q plots. Both contain 95% (dashed line) and 99% (solid line) confidence limits.

Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999 1487

Figure 15. The disturbance data (G:H ) 90:10) in two PCA nodes: (a) T2 plots and (b) Q plots. Both contain 95% (dashed line) and 99% (solid line) confidence limits. Table 5. Confidence Control Limits for Q and T2 in TEPCP

Q0.95 Q0.99 T20.95 T20.99

node 1

node 2

0.6340 1.0444 10.1514 13.9600

0.7690 1.2483 10.1514 13.9600

time-lagged variables. A detailed discussion of using the time-shifted duplicate variables has been discussed in dynamic PCA.6,31

monitoring the abnormal event through training and self-adjusting when new event knowledge is added. Besides, the proposed technique brings the following advantages for detecting and monitoring systems: (1) simple and easy applications; (2) no need for complex and deep domain or process knowledge; (3) good ability to handle nonlinear problems; (4) never forgetting past learning patterns or features; (5) updating of new knowledge or process correlation without retraining all data sets. Acknowledgment

6. Conclusions The proposed combinational method, MixPCA, takes the advantages of PCA and HSC. PCA contributes a lot to compressing noise and extracting the process features to form a simpler and smaller informative subspace for measuring data sets, so it is suitable for multivariable process monitoring and detection. The HSC clustering algorithm based on the Gaussian filter is used to categorize the data distribution and extract the unknown number of clusters. Therefore, to enhance the detecting ability of PCA to handle the real life situation, one may incorporate the HSC into PCA to get precise and fast classification with the input variables. The proposed clustering method, HSC, additionally gives a proper number of clusters without much iteration according to the data intensity distribution. Therefore, HSC often results in better representation of the structures inherent in the data than FCM clustering, which tends to find clusters with a preassigned number of clusters. In addition to HSC, MixPCA includes the traditional analysis skills in PCA, T2 and Q, which characterize the normal space, and applies them onto each node directly. Besides, the merging technique, a general FCM, combines clusters together on the basis of the information about the shape and the size of each cluster through the covariance matrices calculated in each cluster without recomputing all data. The testing result of this method based on the recorded data of the extensive examples is favorable for

The authors would like to thank the reviewers for their helpful suggestions on revising this paper. Also, this work is supported by the National Science Council, Republic of China, under Grant NSC-87-2218-E033-012, and by the Ministry of Economic Affairs of the Republic of China. Nomenclature H(x,y) ) h(x,y)’s Gaussian smoothing Hx(x,y) ) partial derivative of H(x,y) with respect to x Hy(x,y) ) partial derivative of H(x,y) with respect to y J ) cost function of the merging group P ) loading matrix QR ) confidence limit corresponding to the (1 - R) percentile S ) covariance matrix of data Sh ) covariance matrix of cluster h SIMij ) similarity measure between cluster i and cluster j T2R ) confidence limit corresponding to the (1 - R) percentile dc2 ) distance measure for two clusters h(x,y) ) two-dimensional signal x ) process measurement vector c ) cluster center vector ch ) cluster center vector of cluster h Greek Symbols Ωi ) a set of data which has a maximum degree of belonging to a given cluster i

1488 Ind. Eng. Chem. Res., Vol. 38, No. 4, 1999 Λ ) diagonal matrix with eigenvalues in the diagonal direction µij ) degree of association of the jth data point with the ith cluster, or degree of association of the jth cluster with the ith cluster φ(x) ) Gaussian activation function

Literature Cited (1) Kwan, H. K.; Cai, Y. A Fuzzy Neural Network and Its Application to Pattern Recognition, IEEE Trans. Fuzzy Syst. 1994, 2, 185. (2) Newton, S. C.; Mitra, S. Adaptive Fuzzy System for Control and Clustering of Arbitrary Data Patterns. In Proceedings of the IEEE International Conference on Fuzzy Systems; IEEE: Piscataway, NJ, 1992; p 363. (3) Abe, V. U.; Lan, M. S. A Neural-Network-Based Fuzzy Classifier. IEEE Trans. Syst. Man Cybernet. 1995, 25, 353. (4) Jackson, J. E. A User’s Guide to Principal Components; Wiley: New York, 1991. (5) Wise, B. M.; Ricker, N. L.; Veltkamp, D. J.; Kowalski, B. R. A Theoretical Basis for the Use of Principal Components Method for Monitoring Multivariate Processes. Process Control Qual. 1990, 1, 41. (6) Kresta, J. V.; MacGregor, J. F.; Marlin, T. E. Multivariate Statistical Monitoring of Process Operating Performance. Can. J. Chem. Eng. 1991, 69, 35. (7) Kourti, T.; MacGregor, J. F. Multivariate SPC Method for Process and Product Monitoring. J. Qual. Technol. 1996, 28, 409. (8) Hoskins, J.; Kaliyur, K.; Himmelblau, D. Fault Diagnosis in Complex Chemical Plant Using Artificial Neural Networks. AIChE. J. 1991, 37, 137. (9) Kavuri, S.; Venkatasubramanian, V. Representing Bounded Fault Classes Using Neural Networks with Ellipsoidal Activation Functions. Comput. Chem. Eng. 1993, 17, 139. (10) Kavuri, S.; Venkatasubramanian, V. Using Fuzzy Clustering with Ellipsoidal Units in Neural Networks for Robust Fault Classification. Comput. Chem. Eng. 1993, 17, 765. (11) Bernieri, A.; Betta, G.; Pietrosanto, A.; Sansone, C. A Neural Network Approach to Instrumental Fault Detection and Isolation. IEEE Trans. Instrum. Meas. 1995, 44, 747. (12) Rengaswamy, R.; Venkatasubramanian, V. A Syntactic Pattern Recognition Approach to Process Monitoring and Fault Diagnosis. Eng. Appl. Artif. Intell. J. 1995, 8, 35. (13) Chen, J.; Bruns, D. D. WaveARX Neural Network Development for System Identification Using a Systematic Design Synthesis. Ind. Eng. Chem. Res. 1996, 34, 4420. (14) Poggio, T.; Girosi, F. Networks for Approximation and Learning. Proc. IEEE 1990, 78, 1481. (15) Johnston, L. P. M.; Kramer, M. A. Probability Density Estimation Using Elliptical Basis Functions. AIChE J. 1994, 40, 1639.

(16) Parzen, E. On Estimation of Probability Density Function and Mode. Ann. Math. Stat. 1962, 33, 1065. (17) Lowe, D. Adaptive Radial Basis Function Nonlinearities, and the Problem of Generalisation. Proceedings of the First IEE International Conference on Artificial Neural Networks, IEE: London, UK, 1989; pp 171-175. (18) Krishnaiah, P. R., Kanal, L. N., Eds.; Classification, Pattern Recognition and Reduction of Dimensionality; Handbook of Statistics; North-Holland: Amsterdam, The Netherlands, 1982; Vol. 2. (19) Bezdek, J. C. Pattern Recognition with Fuzzy Objective Function Algorithms; Plenum Press: New York, 1981. (20) Frigui, H.; Krishnapuram, R. A Robust Algorithm for Automatic Extraction of an Unknown Number of Clusters from Noisy Data. Pattern Recognit. Lett. 1996, 17, 1223. (21) Chen, J.; Wong, D. S. H.; Jang, S.-S.; Yang, S.-L. Product and Process Development Using Artificial Neural Network Model and Information Theory. AIChE J. 1998, 44, 876. (22) Lin, H.-C.; Wang, L.-L.; Yang, S.-N. Automatic Determination of the Spread Parameter in Gaussian Smoothing. Pattern Recognit. Lett. 1996, 17, 1247. (23) Liu, X. Entropy, Distance Measure and Similarity Measure of Fuzzy Sets and Their Relations. Fuzzy Sets Syst. 1992, 52, 305. (24) Pappis, C. P.; Karacapilidis, N. I. A Comparative Assessment of Measures of Similarity of Fuzzy Values. Fuzzy Sets Syst. 1993, 56, 171. (25) Hyung, L. K.; Song, Y. S.; Lee, K. M. Similarity Measure between Fuzzy Sets and between Elements. Fuzzy Sets Syst. 1994, 62, 291. (26) Vaidyanathan, R.; Venkatasurbramanian, V. Representing and Diagnosing Dynamic Process Data Using Neural Networks. Eng. Appl. Artif. Intell. J. 1992, 5, 11. (27) Vaidyanathan, R.; Venkatasurbramanian, V. On the Nature of Fault Space Classification Structure Developed by Neural Networks. Eng. Appl. Artif. Intell. J. 1992, 5, 289. (28) Downs, J. J.; Vogel, E. F. A Plant-wide Industrial Process Control Problem. Comput. Chem. Eng. 1993, 17, 245. (29) McAvoy, T. J.; Ye, N. Base Control for the Tennessee Eastman Problem, Comput. Chem. Eng. 1994, 18, 383. (30) Dong, D.; McAvoy, T. J. Nonlinear Principal Component Analysis - based on Principal Curves and Neural Networks. Comput. Chem. Eng. 1996, 20, 65. (31) Ku, W.; Storer, R. H.; Georegakis, C. Disturbance Detection and Isolation by Dynamic Principal Component Analysis. Chem. Intell. Lab. Syst. 1995, 30, 179.

Received for review September 8, 1998 Revised manuscript received December 11, 1998 Accepted December 14, 1998 IE980577D