3822
Ind. Eng. Chem. Res. 2002, 41, 3822-3838
Pattern Matching in Multivariate Time Series Databases Using a Moving-Window Approach Ashish Singhal† and Dale E. Seborg* Department of Chemical Engineering, University of California, Santa Barbara, California 93106
A novel methodology is proposed for matching patterns in time-series databases based on unsupervised learning and multivariate statistical techniques. The new approach provides a preliminary screening of large amounts of historical data in order to generate a candidate pool of similar periods of operation. This much smaller number of records can then be further evaluated by someone familiar with the process. A new distance similarity factor is proposed that complements the standard principal component analysis similarity factor. The two similarity factors and a moving-window approach are used to characterize the degree of similarity between the current period of interest and windows of historical data. A simulation case study has demonstrated that the proposed pattern matching technique can successfully distinguish between 28 operating conditions for a variety of disturbances, faults, and process changes. 1. Introduction Enormous amounts of data are routinely collected and stored for a wide variety of technical and business activities. In industrial plants, thousands of process variables are measured and recorded on a frequent basis, as often as every second. Product quality, production, and maintenance information are recorded less frequently and typically are stored in different databases. Thus, massive amounts of plant data are available for purposes of analysis and diagnosis. Data collection has become a mature technology, but the analysis of historical databases is an area of active research.1-3 A historical database contains potentially valuable process information, but it is notoriously difficult to extract. Manufacturing databases have, therefore, been called “data rich but information poor”. In recent years, the terms data mining and knowledge discovery have been used for a variety of activities, conferences, and commercial products that pertain to the analysis of large databases.1,2 However, relatively few engineering applications have been reported. Furthermore, the challenging and important problem of matching patterns in multivariate time-series data has received little attention. This paper is concerned with a general pattern matching problem: given an arbitrary set of multivariate time-series data, how can similar patterns be located in large historical databases? For an arbitrary data set, it cannot be assumed that either training data or a process model is available. A novel pattern matching methodology is proposed based on unsupervised learning and multivariate statistical techniques. Because the proposed methodology is quite general, it is applicable to a wide variety of pattern matching problems that include (i) abnormal situation analysis for plant operations, (ii) early detection of plant maintenance problems, and (iii) evaluation of business data such as marketing and sales data. 1.1. Pattern Matching in Time-Series Data. It is convenient to divide pattern recognition techniques into * Corresponding author. E-mail: seborg@engineering. ucsb.edu. † E-mail:
[email protected].
two categories: supervised and unsupervised learning techniques. Supervised learning techniques require training data for each type of pattern that is being considered. If training data are available, then wellknown pattern recognition techniques can be employed such as neural networks, statistical approaches, and rule-based systems.4-8 However, for the general pattern matching problem considered in this paper, it would be unduly restrictive to assume that previous occurrences of the current (arbitrary) pattern are available to serve as training data. For this reason, only unsupervised learning methods are considered in this paper. The survey paper by Davis et al.9 provides a comprehensive review of both supervised and unsupervised methodologies for data analysis including statistical techniques, neural networks, and knowledge-based systems. Next, we consider a variety of methods that can be used for pattern matching. Neural network techniques such as adaptive resonance theory and self-organizing maps7,10 have been successfully employed for pattern recognition and clustering, but they require considerable computational effort for large problems such as matching multivariate time-series patterns. Wavelets and sets of “shape primitives” have been used for trend identification of signals.11,12 For example, Wong, et al.12 have used fuzzy shape primitives, hidden Markov models, and neural networks for classification of abnormal data for a small number of process variables. However, this shape primitives approach does not capture the behavior of noisy multivariate time-series data, especially when the correlation between the process variables is high and the dimensionality of the data set is large. Kermit and Eide13 used the Haar wavelet transform and neural networks to classify different signals containing auditive information based on the capture of small signal segments present in specific types of sound. Their approach, known as the O-algorithm, compared segments from candidate audio signals against predefined templates stored in the network. Last et al.14 used signal processing techniques and the information-theoretic fuzzy approach to extract rules from historical data. Clustering techniques have been widely used for unsupervised pattern recognition.15 For example, Wang and McGreavy16 used clustering methods to classify
10.1021/ie010517z CCC: $22.00 © 2002 American Chemical Society Published on Web 04/03/2002
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002 3823
abnormal behavior of a fluid catalytic cracking process. Clustering of multivariate time-series data is an important problem for batch processes and also for continuous processes when the start and end times of abnormal situations are known a priori. However, clustering techniques are not promising for large databases of multivariate time-series data, where process dynamics blur the distinction between clusters. If a process model is available, pattern recognition can be based on simulated response data as well as experimental data. For example, the model can be used to simulate abnormal or fault situations that are not present in the available database. Then pattern classificaton can be based on the analysis of model residuals as well as process data. Basseville17 and Basseville and Nikiforov18 describe a variety of methods for detecting changes based on multivariate statistics and time-series models. However, these methodologies become increasingly complex for high-dimensional historical data. Furthermore, the assumption that an accurate process model is available is quite restrictive. A number of techniques have been developed for pattern matching in univariate time-series data.19-21 These methods mainly rely on the calculation of Euclidean distances of transformed data (such as discrete Fourier transforms). For multivariate applications, if training data are available, supervised pattern classification or fault diagnosis can be based on multivariate statistical techniques such as principal component analysis (PCA).22-24 Thus, PCA can also be used for fault diagnosis, as well as fault detection.22,25-28 For the present research on pattern matching using similarity factors, the closest literature references are the papers by Raich and C¸ inar.28-31 They developed innovative methods for discriminating between different types of faults using standard PCA metrics and the PCA similarity factor. Their approach relies on the building of PCA models using representative data for each type of disturbance or fault that is to be considered. Our methodology differs from that of Raich and C¸ inar30,31 because we are concerned with pattern matching rather than fault diagnosis. Consequently, we do not assume that known fault patterns are available a priori. Also, Raich and C¸ inar30,31 calculate PCA metrics for each sample in the current data set while our methodology is based on the matching of two sets of data and characterization of their degree of similarity. Furthermore, our proposed pattern matching strategy requires neither training data nor a priori knowledge. Similarity factors are also useful for pattern matching because they provide a metric for comparing two sets of data. Johannesmeyer et al.32 proposed a pattern matching strategy based on PCA similarity factors and a new similarity factor for alarm summaries. Kano et al.27,33 used dissimilarity factors for process monitoring based on eigenvalue decompositions of the data sets. Their dissimilarity factor27 can also be used for comparing data sets and is, therefore, applicable to pattern matching. This paper introduces a new distance similarity factor to characterize the distance between the subspaces spanned by two data sets. The standard PCA similarity factor and the new distance similarity factor provide the basis for the proposed pattern matching strategy. In a related paper by Johannesmeyer et al.32 and an earlier version of this paper by Singhal and Seborg,34 the authors assumed that the start and end times of the
different operating periods in the historical data were known a priori. In the present paper, a moving-window approach is proposed to locate similar periods of operation in a large historical database without knowledge of the start and end times. 2. Pattern Matching Using Existing Methodologies Pattern matching in historical data involves comparisons of two different data sets. Suppose that it is desired to find historical data sets that are similar to a particular data set of current interest. This latter data set will be referred to as the snapshot data (or template data). This section reviews standard PCA techniques that can be used to compare snapshot data with windows of historical data. 2.1. PCA Similarity Factors. PCA is a multivariate statistical technique which calculates the principal directions of variability in data and transforms a set of correlated variables into a new set of uncorrelated variables.35,36 The new uncorrelated variables are linear combinations of the original variables. PCA is widely used for monitoring of multivariable processes. The survey papers by Davis et al.,9 Kourti and MacGregor,37 and Zhang et al.24 and books by Jackson,35 Wang,38 and Chiang et al.39 describe the PCA methodology and its applications to a wide variety of processes. Consider an m × n data matrix, X, for n process variables with m measurements of each variable. Assume that the data for each variable have been scaled to zero mean and unit variance. Data matrix X can be expanded using principal components pi, score vectors ti, and a residual matrix, E:
X ) t1p1T + t2p2T + ... + tkpkT + E
(1)
Each of the k principal components is an eigenvector of Σ, the covariance matrix of X,
Σpi ) λipi
i ) 1, 2, ..., k
(2)
where λi is the ith eigenvalue of the covariance matrix
Σ)
XTX m-1
(3)
The principal components are numbered so that p1 corresponds to the largest eigenvalue and pk the smallest. The variability in the data matrix can be related to the principal components because the variability explained by a principal component is proportional to its eigenvalue.35 Although a variety of methods are available for choosing k, the number of principal components,35,40 we will use the simple method of selecting the number of principal components that describe 95% of the variance in the data. This choice facilitates fast computations for large databases. The first step in PCA process monitoring is the development of a PCA model using data that correspond to “normal operation”. This model can then be used for subsequent monitoring in the following manner. Consider a new 1 × n measurement vector at time j, x(j), that is projected onto the PCA model to give the corresponding 1 × k score vector, t(j):
t(j) ) x(j) P
(4)
3824
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002
where P ) [p1, p2, ..., pk]. The elements of t(j) and the principal components can then be used to provide an estimate of the current data point
xˆ (j) ) t(j) PT
(5)
The difference between this estimate and the actual measurement vector, x(j), is the PCA model error or residual, e(j):
e(j) ) x(j) - xˆ (j)
(6)
Process monitoring based on PCA typically involves calculation of two statistics, the Q statistic and Hotelling’s T 2 statistic.41,42 The Q statistic at time j is calculated from the current residual:
Q(j) ) e(j) e(j)T
(7)
It provides a measure of how well the new measurement is described by the PCA model. Hotelling’s T 2 statistic can be calculated from
T 2(j) ) t(j) Λ-1tT(j)
(8)
where Λ is the diagonal eigenvalue matrix with k elements, λi, defined in eq 2. The T 2 statistic provides a measure of the variation within the PCA model for each measurement, x(j). Confidence limits for the Q and T 2 statistics43 can be calculated based on the assumption that the residuals in eqs 1 and 6 are independent and identically distributed (IID). Similarity factors based on the T 2 and Q statistics provide measures of the similarity between two data sets. Consider a historical data set H and a current snapshot data set S. A PCA model is built for the snapshot data set S, and the 95% confidence limits for the T 2 and Q statistics are calculated. These limits are denoted as T952 and Q95, respectively. The data for historical data set H are projected onto the PCA model for data set S, and the T 2 and Q values are calculated for each of the m samples in H. The T 2 and Q similarity factors are defined based on the numbers of 95% chart limit violations of these statistics. If the number of violations for data set H is less than or equal to a critical number r95, then data sets S and H are considered to be similar. The critical number of limit violations is calculated in the following manner.44 The numbers of Q and T 2 limit violations for a data window of size m are assumed to follow a binomial distribution with probability parameter . This premise is consistent with the IID assumption for the data used to build the PCA models. Parameter is the probability that a single data point violates the 95% chart limits. By definition, ) 0.05 if the limits on the T 2 statistic are 95% limits. The maximum number of T 2 limit violations that are allowed at a 95% confidence level is denoted by r95 and is referred to as the critical number of violations. It can be calculated as the inverse of the cumulative binomial distribution at the probability value of 0.9544
r95 ) max r
(9)
such that
∑ (k k)0 r
m
)
k(1 - )m-k < 0.95
The similarity factors are defined as follows.
1. T 2 similarity factor: Data set S is considered to be similar to data set H if the number of times that the T 2 statistic exceeds T952 is less than the critical number of violations, r95. For a data window of m ) 1024 observations, r95 ) 63. This value of m is used in the case study. 2. Q similarity factor: Data set S is considered to be similar to data set H if the number of times the Q statistic exceeds the Q95 value is less than the critical number, r95. The critical number of violations is calculated in the same manner as that for the T 2 similarity factor. Thus, r95 ) 63 for 1024 observations. 3. Combined discriminant similarity factor: This similarity factor combines the information contained in the T 2 and Q statistics to provide discrimination between fault types.28 At the 95% confidence level, the combined discriminant, C, is calculated as
( )
C)β
(11)
where β is a weighting factor between 0 and 1. In the absence of any additional information, the T 2 and Q statistics are weighted equally (β ) 0.5). The cutoff for the C statistic is, therefore, equal to 1. Data set S is considered to be similar to H if the number of times that the C statistic is greater than unity is less than r95. The critical number of violations is calculated in the same manner as that for the T 2 similarity factor and, thus, r95 ) 63. Krzanowski45 developed a method for measuring the similarity of two data sets using a PCA similarity factor, SPCA. Consider two data sets that contain the same n variables but not necessarily the same number of measurements. We assume that the PCA model for each data set contains k principal components, where k E n. The number of principal components, k, is chosen such that k principal components describe at least 95% of the total variance in each data set. The similarity between the two data sets is then quantified by comparing their principal components. The appeal of the similarity factor approach is that the similarity between two data sets is quantified by a single number, SPCA. Consider a current snapshot data set S and a historical data set H with each data set consisting of m measurements of the same n variables. Let kS be the number of principal components that describe at least 95% of the variance in data set S and kH be the number of principal components that describe at least 95% of the variance in data set H. Let k ) max(kS, kH), which ensures that k principal components describe at least 95% of the variance in each data set. Then subspaces of the S and H data sets can be constructed by selecting only the first k principal components for each data set. The corresponding (n × k) subspaces are denoted by L and M, respectively. The PCA similarity factor compares these reduced subspaces and can be calculated from the angles between principal components45
SPCA ) (10)
( )
Q T2 + (1 - β) Q95 T952
1
k
k
∑∑cos2 θij
ki)1j)1
(12)
where θij is the angle between the ith principal component of data set S and the jth principal component of data set H. The similarity factor can also be expressed
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002 3825
in terms of the subspaces L and M as45
SPCA )
trace(LTMMTL) k
(13)
Because L and M contain the k most significant principal components for S and H, SPCA is also a measure of the similarity between data sets S and H. 2.2. Dissimilarity Factor. Kano et al.27 recently developed a dissimilarity factor to compare data sets for purposes of process monitoring. Let the historical and snapshot data sets be denoted as H and S, respectively. The first step in the calculation of the dissimilarity factor is the eigenvalue/eigenvector decomposition of the covariance matrix of the combined data set X, which is defined as
X}
[] H S
˜ S ˜ λH i ) 1 - λi
(15)
˜ S ˜ where λH i and λi are the eigenvalues of the covariance matrices of the transformed data. If the data sets are similar to each other, then their eigenvalues must be close to 0.5, but if the data sets are different, then the largest and the smallest eigenvalues will be close to 0 and 1 respectively.27 Therefore, the dissimilarity factor, D, is defined as the deviation of the eigenvalues from a central value of 0.5,
4
n
∑ nj)1
2 ˜ (λH j - 0.5) )
4
Φ ) x(x jH - x j S)ΣS/-1(x jH - x j S)T
n
(λSj˜ - 0.5)2 ∑ n j)1
(16)
where n is the number of variables in each data set. When the data sets are similar to each other, D is close to zero, and when they are dissimilar, D ≈ 1. 3. Proposed Pattern Matching Methodology In this section we present a new similarity factor for data-set comparison and propose a pattern matching methodology to locate periods of historical data similar to the snapshot, without knowledge of the start and end times of the various operating conditions in the historical database. We also describe methods for generating the candidate pool of records that can be presented to a person familiar with the process for detailed analysis. 3.1. Distance Similarity Factor. In this section, a new distance similarity factor, Sdist, is introduced that compares two data sets that may have similar spatial orientation but are located far apart. The new similarity factor is particularly useful when two data windows have similar principal components but the values of the process variables may be different because of disturbances of varying magnitudes or setpoint changes. The
(17)
j H are sample mean vectors. ΣS is the where x j S and x covariance matrix for data set S, and ΣS/-1 is the pseudoinverse of ΣS. It can be calculated using a singular value decomposition. The number of singular values used to calculate the pseudoinverse is k ) max(kS, kH). The new distance similarity factor Sdist is defined as the probability that the center of the historical data set, x j H, is at least a distance Φ from the snapshot data set S:
(14)
The data sets H and S are then projected on the eigenvectors matrix of the covariance matrix of X and scaled by the corresponding eigenvalues to produce the transformed data sets H ˜ and S ˜ , respectively. Then, the covariance matrices of the transformed data sets are calculated. Finally, the eigenvalues and eigenvectors of the covariance matrices are calculated. Kano et al.27 have shown that the covariance matrices of both transformed data sets have the same eigenvectors, and their eigenvalues are related as
D}
distance similarity factor can be used to distinguish between these cases. The Mahalanobis distance,46 Φ, from the center of the historical data set (x j H) to the center of the current snapshot data set, x j S, is defined as
Sdist }
xπ2∫ e
∞ -z2/2
φ
[
)2 1-
dz
(18)
∫-∞φ e-z /2 dz
1 x2π
2
]
The error function in eq 18 can be evaluated using standard tables or software packages. Because the distance similarity factor provides a natural complement to the PCA similarity factor, the proposed pattern matching approach is based on both SPCA and Sdist. This approach compares the data sets at two levels: (i) orientation of their subspaces and (ii) the distance between them. It is interesting to note that the widely used Mahalanobis distance metric in the pattern matching literature and Hotelling’s T 2 statistic used in process monitoring are related to each other. If k singular values are used to calculate ΣS/-1, then the square of the Mahalanobis distance Φ is equal to Hotelling’s T 2 statistic described by eq 8 for the point x j H. The novelty of the proposed distance similarity factor is that it assigns a probability value lying between 0 and 1 to the Mahalanobis distance between the snapshot and historical data sets by using the Gaussian probability distribution. Although the Mahalanobis distance, the T 2 statistic, and the Gaussian probability distribution are wellknown, the combination of these concepts to produce a simple distance similarity factor, Sdist, is a novel contribution. 3.2. Moving-Window Approach. The proposed pattern matching strategy is summarized in the flowchart of Figure 1. First, the user defines the snapshot data that serve as a template for searching the historical database. The snapshot specifications consist of (i) the “relevant” process variables and (ii) the duration of the abnormal situation. These specifications can be arbitrarily chosen by the user; no special plant tests or preimposed conditions are necessary. To find periods of operation in historical data that are similar to the snapshot data, a window of the same size as the snapshot data is moved through the historical data. The similarity between the snapshot and the historical data in the moving window is characterized by the SPCA and Sdist similarity factors. The historical data windows with the largest values of similarity factors are collected in a candidate pool. The individual data windows in the candidate pool are called records. After the candidate pool has been formed, a person
3826
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002
Figure 1. Proposed pattern matching approach.
Figure 2. Pattern matching in the historical data using a movingwindow approach.
familiar with the process can then perform a more detailed examination of the records. In the proposed pattern matching approach, the rate at which the window is moved through historical data determines the accuracy of pattern matching and the computational effort required. For example, suppose that the historical data window is moved one observation at a time, with each new observation replacing the oldest one. Then the pattern matching would be very effective, but the computational load would be very high because the number of comparisons between the historical data windows and the snapshot data would be almost the same as the number of observations in the entire historical database. To reduce the computational load, the moving window can be moved by w observations at a time, so that w new observations replace the w oldest observations. The integer w is referred to as the window movement rate. The accuracy of the pattern matching tends to decrease as w increases. Our experience indicates that choosing w to be one-tenth to onefifth of the length of the snapshot data window provides a satisfactory tradeoff. 3.3. Selection of the Candidate Pool. The candidate pool is constructed by selecting the “most similar” historical data windows. Figure 2 illustrates pattern matching using the moving-window methodology. Suppose that a historical data window has been selected for the candidate pool based on its SPCA and Sdist values. In general, this record will contain data from two operating periods, as shown in Figure 2. For the simulation case study, the record is considered to have been correctly identified if the center of the moving data window is located in the same type of operating period
Figure 3. Scatter diagram for PCA and distance similarity factors when the snapshot data are for operating condition F1 (catalyst deactivation). The candidate pool can be selected from the points within the clusters. The clusters are analyzed in Table 5.
as the snapshot data. This situation is illustrated in Figure 2 where the snapshot data are for an operating condition called F10. Then a correct match between the moving historical window and the F10 snapshot data occurs when the center of the moving window lies inside a F10 period of historical data. To avoid redundant counting of data windows that represent the same period of operation in the simulated historical database (i.e., historical data windows that are “too close” to each other), the historical data window that has the largest values of the similarity factors is chosen among the historical data windows that are within m observations of each other. This restriction results in the selection of data windows that are at least m observations apart. In the case study, m is chosen to be the number of data points in the snapshot data, i.e., m ) 1024. This value is also the duration of each operating period in the historical database. In the proposed methodology, SPCA and Sdist are used to measure the degree of similarity between the snapshot and historical data windows. Thus, it is convenient to plot the data windows that have the largest values of SPCA and Sdist on a scatter diagram, as shown in Figures 3 and 4. (Note that historical data windows that
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002 3827
where 0 E R E 1. The combined similarity factor, SF, is calculated for each of the historical data windows, and then the data windows are ranked according to decreasing values of SF. Then, a person who is familiar with the process can sequentially examine the data records that have the largest SF values. For example, the person could choose to evaluate a certain number of records (say, 5-10 records) or simply evaluate records until the common features associated with similar events become apparent. This approach is very simple to implement but requires specification of the weighting factor, R. In our experience, SPCA is able to capture the similarities in data sets better than Sdist. Therefore, SPCA is given twice as much weight as Sdist (i.e., R ) 2/3). In the absence of any a priori information, a value of R ) 1/2 can be used. The simulation results in section 5 indicate that this proposed pattern matching strategy performs well for a range of R values. 3.4. Performance Measures for Pattern Matching. Two important metrics are proposed to quantify the effectiveness of a pattern matching technique, but first several definitions are introduced: NP: The size of the candidate pool. NP is the number of historical data windows that have been labeled “similar” to the snapshot data by a pattern matching technique. The data windows collected in the candidate pool are called records. N1: The number of records in the candidate pool that are actually similar to the current snapshot, i.e., the number of correctly identified records. N2: The number of records in the candidate pool that are actually not similar to the current snapshot, i.e., the number of incorrectly identified records. By definition, N1 + N2 ) NP. NDB: The total number of historical data windows that are actually similar to the current snapshot. In general, NDB * NP. The first metric, the pool accuracy p, characterizes the accuracy of the candidate pool: Figure 4. Scatter diagram for PCA and distance similarity factors when the snapshot data are for operating condition F_9 (ramp change in the coolant feed temperature). The candidate pool can be selected from the points within the clusters. The clusters are analyzed in Table 6.
are closer than m observations have been omitted for the sake of clarity.) Each symbol in Figures 3 and 4 represents a historical data window for a simulation study that will be discussed in section 4. In practice, the user only sees the unlabeled records of Figures 3a and 4a. It is reasonable to select the clusters of historical data windows that have the largest values of SPCA and/ or Sdist for the candidate pool. For example, one could choose one of more of the numbered clusters in Figures 3b and 4b. In Figures 3b and 4b, the data windows that are correctly identified as being similar to the snapshot data are marked as “O” and the incorrectly identified ones are marked as “×”. As an alternative to visually clustering the data on the scatter diagram, standard clustering algorithms15 could be used. In some practical applications, it would be desirable to have a methodology that automatically generates the candidate pool without having a person to inspect a scatter diagram such as Figure 3 or Figure 4. One simple approach for automating the analysis is based on a weighted combination of the two similarity factors:
SF } RSPCA + (1 - R)Sdist
(19)
p}
N1 × 100% NP
(20)
A second metric, the pattern matching efficiency η, characterizes how effective the pattern matching technique has been in locating similar records in the historical database. It is defined as
η}
N1 × 100% NDB
(21)
When the pool size NP is small (NP < NDB), then the efficiency η will be small because N1 E NP. A theoretical maximum efficiency, ηmax, for a given pool size NP can be calculated as follows:
η
max
{
NP × 100% for NP < NDB } NDB for NP G NDB 100%
(22)
Because an effective pattern matching technique should produce large values of both p and η, an average of the two quantities (ξ) is used as a measure of the overall effectiveness of pattern matching.
ξ}
p+η 2
(23)
3828
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002
In pattern matching problems, the relative importance of p and η metrics is application dependent. For example, a busy engineer may want to locate a small number of previous occurrences of an “abnormal situation” without having to waste time evaluating incorrectly identified records (“false positives”). In this situation, a large value of p is more important than a large value of η, and NP should be small (e.g., 2-5). In other types of applications, it might be desirable to locate most (or even all) previous occurrences of an abnormal situation, for business or legal reasons. This type of situation would arise during the investigation of a serious plant accident or after the recall of a defective product. Here, η is more important than p, and a relatively large value of the candidate pool size, NP, is acceptable. Fortunately, the proposed pattern matching technique can accommodate both types of applications, as demonstrated in section 5.5. 4. Simulation Case Study: Continuous Stirred Tank Reactor An extensive simulation case study is used to evaluate the performance of alternative pattern matching techniques for a wide variety of operating conditions and fault scenarios. A nonisothermal continuous stirred tank reactor (CSTR) with cooling jacket dynamics and a variable liquid level was simulated in order to generate historical data. The classic first-order irreversible reaction, A f B, was assumed. A schematic diagram of the CSTR and feedback control system is shown in Figure 5. A dynamic model for the CSTR can be derived based on the assumptions of perfect mixing and constant physical parameters.47 The mass, energy, and component balances around the reactor and cooling jacket are
QFCAF - QCA dCA ) -k0e-E/RTCA + dt Ah dT k0e ) dt
(24)
-E/RT
CA(-∆H) (QFTF - QT) + + FCp Ah UAC(TC - T) (25) FCpAh
dTC QC(TCF - TC) UAC(T - TC) ) + dt VC FCCpCVC
(26)
dh QF - Q ) dt A
(27)
where the process variables and model parameters are defined in the Notation section. The control valve dynamics are modeled by the first-order transfer function
Gv )
K τs + 1
(28)
where K ) 1/16 mA-1 and τ ) 2 s. The flow rate through each control valve is given by the following relation:
Flow ) Cvf(l)
x
∆Pv gs
(29)
where the symbols are defined in the Notation section.
Figure 5. Schematic of the CSTR system with cascade control. Table 1. Nominal Operating Conditions and Model Parameters for the CSTR Case Study Q ) 100 L/min QC ) 15 L/min TF ) 320 K TCF ) 300 K T ) 402.35 K TC ) 345.44 K CAF ) 1.0 mol/L CA ) 0.037 mol/L h ) 0.6 m
A ) 0.1666 m2 k0 ) 7.2 × 1010 min-1 ∆H ) -5 × 104 J/mol FCp ) 239 J/(L‚K) FCCpC ) 4175 J/(L‚K) E/R ) 8750 K UAC ) 5 × 104 J/(min‚K) VC ) 10 L
A linear valve characteristic (f(l) ) l) is assumed, and the pressure drop across each control valve is assumed to be constant for the entire flow range. The combination of these two assumptions results in a valve with a linearly installed flow characteristic.48 The control structure and the controller parameters have been described in detail by Johannesmeyer.49 The nominal operating conditions and model parameters are given in Table 1 and were also used by Johannesmeyer.49 The historical database for the CSTR case study was designed to include both normal operating periods, setpoint changes, and a wide variety of process changes and disturbances that will be referred to as “faults”. Many fault detection and diagnosis studies have been conducted using CSTR models,50,51 and a large number of possible fault conditions can be considered. The 28 operating conditions in Table 2 were chosen in order to simulate the wide range of disturbance and fault types that can be encountered in a typical historical database. Fault conditions included process disturbances (e.g., ramp change in TCF, step or sinusoidal changes in QF, etc.), instrumentation faults (e.g., dead coolant flow measurement, bias in reactor temperature measurement, etc.), and process changes (e.g., valve stiction, heat exchanger fouling, catalyst deactivation, etc.). Setpoint changes in the reactor temperature were also considered. In Table 2 the notation F_i is used to denote a change in fault i, in the negative direction. For example, snapshot F_4 is identical to snapshot F4, except that the measurement bias is negative, instead of positive. 4.1. Generation of the Historical Database. The historical database for the CSTR was generated by simulating the process for a period of approximately 39 days. The sampling period for all measurements was 5 s. This simulation resulted in over 666 000 data points for each of the 14 process variables in Table 3. The last four measurements in Table 3 are the controller output signals. For example, hC is the signal in mA from the level controller and QCC is the controller output in mA from the coolant flow-rate controller. Various pattern
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002 3829 Table 2. Operating Conditions and Faults ID
operating condition
description
N
normal operation
F1
catalyst deactivation
operation at the nominal conditions; no disturbances the activation energy ramps up
F2
heat exchanger fouling
the heat-transfer coefficient ramps down
F3
dead coolant flow measurement
F4 and F_4 F5 and F_5 F6 and F_6 F7 and F_7 F8 and F_8 F9 and F_9 F10 and F_10 F11 and F_11 F12
bias in the reactor temperature measurement coolant valve stiction + F7
the coolant flow-rate measurement stays at its last value the reactor temperature measurement has a bias dead band for stiction ) 5% of the valve span step change in the feed flow rate
F13 S1 and S_1 O1 O2 O3 O4
step change in QF ramp change in CAF ramp change in TF ramp change in TCF step change in PCU step change in PD damped oscillations in fthe feed flow rate autoregressive disturbance in the feed flow rate setpoint change in T high-frequency oscillations in the feed flow rate intermediate-frequency oscillations in the feed flow rate intermediate-frequency oscillations in the feed flow rate low-frequency oscillations in the feed flow rate
reduced set
full set
variable
CA T TC h Q QC QF
x x x x x x
x x x x x x x
CAF TF TCF hC QC TC QCC
reduced set x x x x x
N/A the ramp rate for E/R is +3 K/min the ramp rate for UAC is -125 J/(min‚K)/min N/A (4 K N/A (10 L/min
the feed concentration ramps up or down the feed temperature ramps up or down the coolant feed temperature ramps up or down step change in the upstream pressure in the cooling line step change in the downstream pressure in the reactor outlet line the feed flow changes as e-t/33 sin(2π/10) L/min QF(k) ) 0.8QF(k - 1) + w(k); w(k)∼ N(0,1)
the ramp rate is (6 × 10-4 (mol/L)/min the ramp rate is (0.1 K/min
setpoint change for the reactor temperature sustained oscillations of frequency of 3 cycles/min sustained oscillations of frequency of 1 cycle/min sustained oscillations of frequency of 0.5 cycles/min sustained oscillations of frequency of 0.2 cycles/min
(3 K
Table 3. Measurement Sets for the CSTR Case Study variable
nominal value
full set x x x x x x x
matching techniques were evaluated for the simulation case study using the full measurement set. A comparison was also made for a reduced measurement set in Table 3, where measurements of some key process variables such as feed and reactor concentrations and the coolant feed temperature were not available for analysis. The historical database was generated in the following manner. For each 120 min operating period, the mode of operation (i.e., fault type, setpoint change, or normal operation), as well as its magnitude and direction, was chosen randomly. The fault magnitude was chosen randomly to be between 25% and 125% of the nominal value in Table 2. After the mode of operation and any necessary parameters (i.e., fault direction and magnitude) were selected, the simulation ran for a total of 120 min. Each 120 min period was divided into 85.3 min for the selected operating mode and 34.7 min for the process to return to the nominal steady state. Thus, faults occurred one at a time (i.e., no simultaneous faults) and had the same duration. The simulation consisted of 463 operating periods, 409 periods of abnormal operation and 54 periods of normal operation
the ramp rate is (0.1 K/min (2.5 psi (5 psi 10 L/min N/A
10 L/min 10 L/min 10 L/min 10 L/min
during the 39 days of simulation. Each operating period contained 1024 samples at 5 s intervals. Gaussian process and measurement noise were added to process variables in the simulation. The magnitude of the measurement noise for each process variable is given in Table 4. The amplitude of each fault, disturbance, and setpoint change was chosen randomly, as noted earlier. 5. Results and Discussion In this section, the performances of different pattern matching techniques for the CSTR case study are compared. As noted in section 3.2, a data window that was the same size as the snapshot data (S) was moved by 200 observations at a time through the historical database. The ith moving window was denoted as Hi. Each snapshot data set was scaled to zero mean and unit variance. When the snapshot data, S, were compared to a data window Hi, the historical data were scaled using the scaling factors for the snapshot data. Similarity factors were then calculated for each Hi. After the entire database was analyzed for one set of snapshot data, the analysis was repeated for a new snapshot data set. A total of 28 different snapshot data sets, one for each of the 28 operating conditions in Table 2, were used for pattern matching. The nominal values of the fault conditions were used to generate the snapshot data. 5.1. Selection of the Candidate Pool. As described in section 3.3, either the scatter plots or the weighted similarity factor (SF in eq 19) can be used to select records for the candidate pool. Examples of the scatter
3830
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002
Table 4. Noise Variances (in Appropriate Units) for the Measurements of Table 3 CA 5.56 ×
T 10-6
4.44 ×
TC 10-1
4.44 ×
h 10-1
1.60 ×
CAF
TF
TCF
5.56 × 10-6
4.44 × 10-1
4.44 × 10-1
Table 5. Analysis of the Data Windows in the Clusters of Figure 3 (F1 Snapshot Data) SPCA cluster #1
summary cluster #2
summary cluster #3
summary
0.839 0.835 0.839 0.825 0.768 0.768 0.766 0.764 0.763 0.765 0.761 0.764 0.531 0.527 0.530 0.537 0.527 0.530 0.531 0.532 0.537 0.528 0.525 0.529 0.526 0.531 0.532 0.528 0.528 0.527
Sdist
SF ) (2SPCA+ Sdist)/3
Q 10-3
4.44 × hC
QC
SPCA cluster #1
summary cluster #2
summary cluster #3
summary cluster #4
0.796 0.800 0.790 0.784 0.756 0.661 0.660 0.662 0.657 0.657 0.655 0.632 0.623 0.624 0.623 0.376 0.375 0.371 0.360 0.359 0.360 0.361 0.360 0.360 0.360 0.359 0.358 0.358 0.800 0.784
summary
plot approach for two different snapshot data sets (F1 and F_9) are shown in Figures 3 and 4. Depending on the user’s preference, clusters #1, #2, and/or #3 could be selected for the candidate pool from Figures 3 and 4. These clusters are analyzed in Tables 5 and 6 where the points within each cluster are ranked according to descending SF values with R ) 2/3. It is clear that clusters that have a higher value of SPCA produce more accurate pattern matching than those clusters that have a higher value of Sdist. To avoid the subjectivity associated with the interpretation of the scatter plots, the records will be rank ordered by decreasing SF value in order to generate the candidate pool for the proposed approach. 5.2. Overall Comparison of Pattern Matching Techniques. Different pattern matching methods are compared in terms of the pool accuracy (p) and search efficiency (η), which were defined in eqs 20 and 21. The simulation results are reported as average values for the 28 operating modes, where each operating condition was designated, in turn, as the “abnormal” situation for the snapshot data. In Table 7, the proposed pattern matching methodology is compared to several existing methods: the T 2, Q, combined discriminant, and Kano’s dissimilarity factor. The pool sizes for the T 2, Q, and the combined discriminant methods were obtained by selecting the
1.00 ×
QF
10-2
4.44 × 10-1
TC
QCC
Table 6. Analysis of the Data Windows in the Clusters of Figure 4 (F_9 Snapshot Data)
correctly identified?
0.273 0.650 yes 0.277 0.649 yes 0.256 0.644 yes 0.278 0.642 yes NP ) 4, N1 ) 4, N2 ) 0, p ) 100% 0.282 0.606 no 0.280 0.605 no 0.279 0.604 yes 0.279 0.603 yes 0.276 0.601 yes 0.271 0.600 yes 0.278 0.600 yes 0.258 0.595 yes NP ) 8, N1 ) 6, N2 ) 2, p ) 75% 0.321 0.461 yes 0.315 0.456 no 0.309 0.456 yes 0.292 0.455 no 0.312 0.455 yes 0.303 0.454 no 0.300 0.454 yes 0.297 0.454 yes 0.285 0.453 no 0.303 0.453 yes 0.307 0.452 no 0.294 0.451 yes 0.297 0.450 yes 0.286 0.450 yes 0.283 0.449 yes 0.283 0.447 no 0.279 0.445 yes 0.280 0.445 no NP ) 18, N1 ) 11, N2 ) 7, p ) 61%
QC 10-1
SF ) (2SPCA + Sdist)/3
Sdist
correctly identified?
0.222 0.605 yes 0.203 0.601 yes 0.212 0.598 yes 0.207 0.592 yes 0.226 0.580 yes NP ) 5, N1 ) 5, N2 ) 0, p ) 100% 0.250 0.524 yes 0.251 0.523 yes 0.243 0.522 no 0.249 0.521 yes 0.248 0.521 yes 0.234 0.514 yes 0.241 0.501 no 0.250 0.499 no 0.246 0.498 no 0.233 0.493 no NP ) 10, N1 ) 5, N2 ) 5, p ) 50% 0.244 0.332 no 0.241 0.331 no 0.245 0.329 yes 0.248 0.323 yes 0.249 0.322 no 0.245 0.322 no 0.242 0.321 no 0.244 0.321 no 0.243 0.321 no 0.243 0.321 no 0.244 0.321 no 0.245 0.320 no 0.242 0.319 no NP ) 13, N1 ) 2, N2 ) 11, p ) 15% 0.156 0.478 no 0.149 0.466 no NP ) 2, N1 ) 0, N2 ) 2, p ) 0%
Table 7. Comparison of Pattern Matching Methods method T 2 similarity factor Q similarity factor combined discriminant similarity factor dissimilarity factor (Kano et al.27) SPCA only SPCA-Sdist method
optimum NP
N1
N2
p (%)
η (%)
ξ (%)
116 56 82
12 8 13
104 48 69
19 16 18
72 52 79
45 34 49
15
8
7
50
51
51
39 17
15 12
24 5
37 69
89 79
63 74
historical data windows for which the value of the similarity factor was equal to 1, as defined in section 2.1, and then eliminating windows that were closer than m ) 1024 observations to each other. The T 2, Q, and the combined discriminant similarity factors produced large candidate pools which resulted in low accuracy (p < 20%). On the other hand, these methods had relatively large η values because of the large pool sizes. Thus, these methods lack the ability to distinguish between different operating conditions effectively. The dissimilarity factor of Kano et al.27 performed better than the T 2, Q, and combined discriminant similarity factors, producing more accurate results (larger p value) but with a low efficiency, η. The optimum pool size for the dissimilarity factor was
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002 3831
Figure 7. Comparison of the pattern matching performance for SPCA and the dissimilarity factor of Kano et al.27 (w ) 200). Figure 6. Effect of varying candidate pool size NP on pattern matching using SPCA alone (w ) 200).
obtained by calculating the mean values of p, η, and ξ for the 28 operating conditions, for a wide range of pool sizes, and then selecting the value of NP for which the average of ξ for the 28 operating conditions was a maximum. The computational requirements for the calculation of the similarity factors are modest. For example, it takes less than 4 min on a Pentium III/550 MHz computer running MATLAB version 5.3 to sweep through the historical database at a window movement rate of w ) 100 and to calculate SPCA and Sdist for all of the historical data windows. Figure 6 shows the variation of p, η, and ξ for different values of NP when the SPCA method is used by itself (i.e., R ) 1 in eq 19). These values are the mean values for the 28 operating conditions of Table 2. At very low values of NP, records that are very similar to the snapshot data are included in the candidate pool. As NP increases, the accuracy of pattern matching, p, decreases, while the efficiency, η, increases. The value of NP at which ξ was a maximum was chosen as the optimum value of NP. Note that the value of ξ remains relatively constant over a wide range of NP values in Figure 6. Therefore, depending on the application, the user can choose a small value of NP to obtain a high p value or a large NP to have a high η value without a significant change in ξ. As shown in Table 7 and Figure 7, SPCA performs better than other methods (except SPCA - Sdist) because it produces a higher value of ξ, and both the p and η curves for SPCA are above the corresponding curves for the dissimilarity factor. These results show that SPCA is more accurate than the dissimilarity factor. As described in section 5.1, SPCA and Sdist were calculated for each of the moving historical data windows with a window movement rate of w ) 200. The historical data windows corresponding to the largest values of SF in eq 19 were collected in the candidate pool. The SPCA value was weighted twice as much as the Sdist value (R ) 2/3), because the characteristics of each fault are captured more by the orientation of the subspaces than the distance between them. The effect of the candidate pool size, NP, on the pattern matching performance for
Figure 8. Effect of varying candidate pool size NP on pattern matching using the SPCA-Sdist combination (R ) 2/3 and w ) 200).
the SPCA-Sdist method is illustrated in Figure 8. The values of p, η, and ξ are average values for the 28 operating conditions of Table 2. The pool accuracy, p, decreases with increasing NP, while the efficiency, η, increases asymptotically toward 100% as NP becomes large. Note that η will be low for NP < NDB by definition since η E ηmax (eq 22). The value of ξ is the average of p and η and has a maximum at NP ) 17. The pattern matching results for the SPCA-Sdist method are superior to those obtained with SPCA alone, as shown in Table 7. 5.3. Comparison of Results When the Start and End Times of the Faults Are Known. In the previous section, the new moving-window approach was used for pattern matching where the start and end times of each operating condition were not known. By contrast, in previous studies,32,34,49 the start and end times were assumed to be known a priori. A comparison of pattern matching between the proposed moving-window technique and the ideal case where the start and end times of each operating condition in the historical data are
3832
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002
Table 8. Effect of SPCA-Sdist Pattern Matching Method (r ) 2/3 and w ) 200) for Known Starts and Ends of Disturbances and the Reduced Measurement Set p η ξ optimum NP N1 N2 (%) (%) (%)
method unknown start and end times of disturbances known start and end times
17
12
5
69
79
74
14
11
3
79
74
77
Table 9. Detailed Results for the SPCA-Sdist Method (NP ) 17, r ) 2/3, and w ) 200) snapshot
NDB
N1
N2
p (%)
η (%)
ξ (%)
average N F1 F2 F3 F4 F_4 F5 F_5 F6 F_6 F7 F_7 F8 F_8 F9 F_9 F10 F_10 F11 F_11 S1 S_1 F12 F13 O1 O2 O3 O4
17 54 23 32 30 14 8 15 16 15 14 15 12 15 20 9 12 15 14 14 14 9 14 12 12 19 12 12 12
12 16 13 11 17 11 7 12 10 14 12 14 12 12 14 9 10 12 5 13 14 9 12 9 9 16 12 12 12
5 1 4 6 0 6 10 5 7 3 5 3 5 5 3 8 7 5 12 4 3 8 5 8 8 1 5 5 5
69 94 76 65 100 65 41 71 59 82 71 82 71 71 82 53 59 71 29 76 82 53 71 53 53 94 71 71 71
79 30 57 34 57 79 88 80 63 93 86 93 100 80 70 100 83 80 36 93 100 100 86 75 75 84 100 100 100
74 62 66 50 78 72 64 75 61 88 78 88 85 75 76 76 71 75 33 85 91 76 78 64 64 89 85 85 85
known a priori is shown in Table 8. Note that only the start and end times of operating conditions are known,
but the mode of the operation and the fault magnitude are not known. From Table 8, it is clear that the performance of the SPCA-Sdist pattern matching technique when the start and end times are not known is comparable to the ideal case where they are known. Thus, the moving-window approach is able to identify similar operating periods in the historical data very efficiently. 5.4. Results for Individual Operating Conditions. In the following discussion, R ) 2/3 and w ) 200 are used as design parameters for the proposed SPCASdist pattern matching method. The detailed results for different operating conditions and NP ) 17 are presented in Table 9. Operating conditions F_4, F_10, F12, F13, and S1 have the lowest p values and are thus the most difficult operating conditions to match in the historical data. Detailed misclassification results for Table 9 are presented in Table 10. For example, operating condition F_4 is misclassified with F1 because both result in a lower reactor temperature measurement, a key variable. Also, a step change in the reactor temperature setpoint (S1) is misclassified with operating conditions that involve large, sudden changes in the reactor temperature. Figure 9 shows the similarity of representative response data for operating conditions F_4 and S1 and the nominal fault magnitude. Fault F_10 is misclassified with F2 because both affect the coolant flow controller in a similar fashion by increasing the signal from the controller. Operating conditions such as F12 and O4 that have oscillations in the feed flow rate are misclassified with each other because the frequency of oscillation is very close in both cases. To develop further insight into misclassification issues associated with the CSTR case study, similarity factors were calculated for all possible pairs of the nominal operating conditions in Table 2. These results are shown in Tables 11 and 12. For each row, the operating condition in the first column is considered to be the
Table 10. Misclassification Results for Table 9 snapshot
correct (N1)
incorrect (N2)
misclassifications (operating condition and number)
N F1 F2 F3 F4 F_4 F5 F_5 F6 F_6 F7 F_7 F8 F_8 F9 F_9 F10 F_10 F11 F_11 F12 F13 S1 S_1 O1 O2 O3 O4
16 13 11 17 11 7 12 10 14 12 14 12 12 14 9 10 12 5 13 14 9 9 9 12 16 12 12 12
1 4 6 0 6 10 5 7 3 5 3 5 5 3 8 7 5 12 4 3 8 8 8 5 1 5 5 5
F13 (1) N (1), F2 (1), F5 (1), F_10 (1) N (1), F_8 (1), F10 (2), F_10 (2) none N (1), F2 (3), F8 (1), F_10 (1) F1 (5), F2 (1), F_6 (1), F_8 (3) N (1), F8 (4) N (1), F_8 (5), F_10 (1) F5 (1), F_10 (1), F11 (1) F1 (1), F8 (1), F_8 (1), F10 (1), F_10 (1) F1 (1), F4 (1), F_8 (1) N (1), F1 (1), F2 (1), F8 (1), F_11 (1) N (2), F5 (3) F_5 (1), F10 (1), F_10 (1) N (3), F2 (2), F10 (1), F_10 (1), S_1 (1) N (1), F2 (2), F7 (1), F_10 (1), S1 (2) N (2), F1 (3) N (1), F2 (7), F_5 (1), F8 (1), F_8 (1), F10 (1) N (1), F1 (1), F2 (1), F10 (1) F_6 (1), F10 (1), F_10 (1) F13 (2), O4 (6) F12 (4), O3 (1), O4 (3) N (1), F2 (1), F_4 (2), F8 (1), F_10 (1), F_11 (1), S_1 (1) N (2), F_4 (1), F_10 (2) O2 (1) F13 (4), O3 (1) F13 (5) F12 (5)
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002 3833
Figure 10. Historical period of operation similar to the snapshot data, found using the SPCA-Sdist method. Both data sets correspond to operating condition F1 (catalyst deactivation), but the starting times and fault magnitudes are different.
Figure 9. Similar profiles of relevant variables for two operating conditions listed in Table 10.
snapshot for the purpose of scaling. The boldface values in the tables indicate the SPCA > 0.9 Sdist > 0.9. If only SPCA is used, the results in Table 11 show that the positive and negative changes for operating conditions F4 through F11 have a high possibility of being misclassified with each other. However, pairs of operating conditions such as F4 and F_4 have small Sdist values. Consequently, using Sdist and SPCA together helps one to distinguish between these positive and negative changes. In fact, the combination of similarity factors provides a significant improvement over using SPCA alone, as shown in Table 7. The results presented in Tables 11 and 12 for the nominal values of the operating condition provide considerable insight into the misclassification problem. For example, operating conditions F2, F4, and S_1 show similarities because they result in an increase in the coolant flow QC and a decrease in the coolant temperature TC, as mentioned above. Operating conditions F6 and F12 show similarities for SPCA only, but are not similar when SPCA and Sdist are used together. Thus, Tables 11 and 12 indicate the operating conditions that
one might expect to be misclassified because of their similar patterns. However, random fault and disturbance magnitudes used in the simulations result in other misclassifications, as shown in Table 10. Figure 10 shows a window of the historical data that was correctly found to be similar to the F1 snapshot data (catalyst deactivation). The snapshot data are shown as a reference for comparison with the historical data. Note that the match is not exact because (i) the start and end times of the faults are not known and (ii) the fault magnitudes are different for the snapshot and historical data sets in Figure 10. Recall that the magnitudes of the faults in the historical data vary between 25% and 125% of the nominal values used for the snapshot data sets. However, the SPCA-Sdist method was able to correctly locate similar periods of operation in the historical data very effectively. 5.5. Effect of Design Parameters. For the proposed moving-window approach and the SPCA-Sdist method, R and w are two design parameters that influence pattern matching. We first consider the effect of these design parameters on pattern matching. The definition of SF in eq 19 requires the user to specify the value of the weighting parameter R. In the preceding section, we recommended using R ) 2/3 because SPCA is more effective than Sdist in characterizing the similarity between data sets. However, on the basis of his/her preference, the user could choose a different value of R ranging from 0 (Sdist only) to 1 (SPCA only). Table 13 demonstrates that the proposed pattern matching strategy is effective for a range of values of R. In the absence of any a priori information, R ) 1/2 should be used. In the moving-window approach, the window of historical data has the same length as the snapshot data, 1024 points. The window moves w data points forward in the historical data for the next comparison. The window movement rate w, by which the window is moved, affects the performance of the pattern matching
Table 11. Value of the PCA Similarity Factor SPCA for the Nominal Magnitudes of the Operating Conditions
3834 Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002
Table 12. Value of the Distance Similarity Factor Sdist for the Nominal Magnitudes of the Operating Conditions
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002 3835
3836
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002
Figure 12. Effect of varying candidate pool size NP on pattern matching using the SPCA-Sdist combination for the reduced measurement set (R ) 2/3 and w ) 200). Figure 11. Pattern matching results for different values of w for the SPCA-Sdist method (R ) 2/3). Table 13. Effect of Design Parameter r on Pattern Matching Using the SPCA-Sdist Method (w ) 200) R
optimum NP
N1
N2
p (%)
η (%)
ξ (%)
0 1/4 1/3 1/2 2/3 3/4 1
17 11 14 15 17 11 39
9 8 10 11 12 8 15
8 3 4 4 5 3 24
51 75 69 70 69 75 37
59 57 66 71 79 57 89
55 66 67 71 74 66 63
Table 14. Effect of Design Parameter w on Pattern Matching Using the SPCA-Sdist Method (r ) 2/3) w
optimum NP
N1
N2
p (%)
η (%)
ξ (%)
100 200 300 500
14 17 15 17
11 12 11 11
3 5 4 6
75 69 73 67
72 79 75 76
74 74 74 72
method. Pattern matching results for three different values of w are summarized in Table 14. The values of w were chosen so that they correspond to approximately 1/ , 1/ , 1/ , and 1/ of the snapshot window size of 1024 10 5 3 2 data points. Figure 11 shows the variation of p, η, and ξ with pool size NP, for different values of w. In Figure 11 and later figures, the values of p, η, and their average ξ are the mean values for the 28 operating conditions in Table 2. It is evident from Table 14 and Figure 11 that pattern matching results are not very sensitive to the choice of w, and the performance slowly decreases as w increases. Because the results for w ) 100 and 200 are very close, we recommend that w be approximately one-tenth to one-fifth the length of the snapshot window. Also note that the moving window can never match the exact start and end times of operating conditions in the historical data because w is only approximately 1/10, 1/5, and 1/2 the snapshot window length of m ) 1024. The remaining 24 observations slowly increase the mismatch between the moving window and the actual start and end times of the operating conditions in a cumulative manner.
5.6. Reduced Measurement Set. The pattern matching results for the reduced measurement set of Table 3 are shown in Figure 12. The best performance is achieved for NP ) 18, which gives p ) 58%, η ) 69%, and ξ ) 64%. From a comparison of Figures 11 and 12, it is clear that the absence of key measurements such as the feed and reactor concentrations and the coolant feed temperature has an adverse impact on pattern matching. However, satisfactory results are still obtained. 6. Conclusions A novel methodology has been developed for pattern matching in multivariate time-series databases. The proposed approach locates periods of historical data that are similar to an arbitrarily selected current period that is of interest. For example, the current period could represent a period of abnormal process operation. The diagnosis of the current period could be greatly facilitated if similar periods of operation could be located in the historical data. The proposed pattern matching methodology is both data driven and unsupervised (does not require process models or training data). The user must only specify the “snapshot data”, i.e., the relevant measured variables and the duration of the period of interest. The new approach is based on a PCA similarity factor and a new metric for the distance between two data sets. The computational load is modest, which allows processing of large amounts of process data in a relatively short period of time. In an extensive CSTR simulation case study, the proposed approach performed better than the existing PCA methodologies for a wide range of operating conditions, faults, and disturbances. The combination of PCA and distance similarity factors provides an effective way of matching patterns in large databases that consist of multivariate time-series data. Because the proposed methodology is unsupervised and is of a general nature, it has potential applications to a wide range of engineering, scientific, and business applications.
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002 3837
Acknowledgment Financial support from the UCSB Process Control Consortium, OSI Software, Inc., and Chevron Research & Technology Co. is gratefully acknowledged. Notation A ) cross-sectional area of the reactor (dm2) AC ) heat-transfer area (dm2) CA ) concentration of species A in the reactor (mol/L) CAF ) concentration of species A in the reactor feed stream (mol/L) Cp ) heat capacity of the reactor contents (J/g‚K) CpC ) heat capacity of the coolant (J/g‚K) Cv ) valve coefficient (L/min‚psi12) E ) activation energy (J/mol) f(l) ) valve characteristic gs ) specific gravity of the fluid h ) liquid level in the reactor (dm) ∆H ) heat of reaction (J/mol) k0 ) pre-exponential factor (min-1) l ) fraction that the valve is open (0 ) closed, 1 ) fully open) m ) number of data points in the snapshot data window LVc ) critical number of high limit or low limit violations N1 ) number of records in the candidate pool that are actually similar to the current snapshot N2 ) number of records in the candidate pool that are not similar to the current snapshot NDB ) total number of historical data records that are similar to the current snapshot NP ) candidate pool size ) N1 + N2 p ) pool accuracy (%) ) N1/NP × 100% ∆Pv ) pressure drop across the control valve (psi) Q ) flow rate of the reactor outlet stream (L/min) QC ) coolant flow rate (L/min) QF ) feed flow rate of the reactor feed stream (L/min) R ) universal gas constant (J/mol‚K) T ) reactor temperature (K) TC ) temperature of the coolant in the cooling jacket (K) TCF ) inlet coolant temperature (K) TF ) reactor feed temperature (K) U ) heat-transfer coefficient (J/min‚K‚dm2) w ) window movement rate Greek Symbols η ≡ pattern matching efficiency (%) ) N1/NDB × 100% F ≡ density of the reactor contents (g/L) FC ≡ density of the coolant (g/L) ξ ≡ average of pool accuracy and pattern matching efficiency (%) ) (p + η)/2
Literature Cited (1) Agrawal, R., Stoloroz, P., Piatetsky-Shapiro, G., Eds. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining; AAAI Press: Menlo Park, CA, 1998. (2) Apte´, C. Data Mining: An Industrial Research Perspective. IEEE Trans. Comput. Sci. Eng. 1997, 4 (2), 6-9. (3) Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Commun. ACM 1996, 39, 27-34. (4) Bishop, C. M. Neural Networks for Pattern Recognition; Clarendon Press: Oxford, U.K., 1995. (5) Duda, R.; Hart, P. Pattern Classification and Scene Analysis; John Wiley: New York, 1973. (6) Fukunaga, K. Introduction to Statistical Pattern Recognition; Academic Press: New York, 1990. (7) Looney, C. G. Pattern Recognition Using Neural Networks: Theory and Algorithms for Engineers and Scientists; Oxford University Press: New York, 1997.
(8) Shu¨rmann, J. Pattern Classification: A Unified View of Statistical and Neural Approaches; John Wiley: New York, 1996. (9) Davis, J. F.; Piovoso, M. J.; Kosanovich, K. A.; Bakshi, B. R. Process Data Analysis and Interpretation. In Advances in Chemical Engineering; Wei, J., Anderson, J. L., Bischoff, K., Seinfeld, J., Eds.; Academic Press: New York, 1999; Vol. 25, pp 1-199. (10) Carpenter, G. A.; Grossberg, S.; Rosen, D. B. ART-2A: An Adaptive Resonance Algorithm for Rapid Category Learning and Recognition. Neural Networks 1991, 4, 493-504. (11) Vedam, H.; Venkatasubramanian, V. A Wavelet TheoryBased Adaptive Trend Analysis System for Process Monitoring and Diagnosis. Proc. Am. Control Conf. 1997, 309-313. (12) Wong, J. C.; McDonald, K. A.; Palazoglu, A. Classification of Abnormal Plant Operation Using Multiple Process Variable Trends. J. Process Control 2001, 11, 409-418. (13) Kermit, M.; Eide, Å. J. Audio Signal Identification Via Pattern Capture and Template Matching. Pattern Recognit. Lett. 2000, 21, 269-275. (14) Last, M.; Klein, Y.; Kandel, A. Knowledge Discovery in Time Series Databases. IEEE Trans. Syst. Manage. Cyber., Part B 2001, 31, 160-169. (15) Kaufman, L.; Rousseeuw, P. R. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley: New York, 1990. (16) Wang, X. Z.; McGreavy, C. Automatic Classification for Mining Process Operational Data. Ind. Eng. Chem. Res. 1998, 37, 2215-2222. (17) Basseville, M. On-Board Component Fault Detection and Isolation Using the Statistical Local Approach. Automatica 1998, 34, 1391-1415. (18) Basseville, M.; Nikiforov, I. V. Detection of Abrupt Changes; Prentice Hall Inc.: Englewood Cliffs, NJ, 1993. (19) Agrawal, R.; Lin, K.; Sawhney, H. S.; Shim, K. Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time Series Databases. Proceedings of the 21st International VLDB Conference; Morgan Kaufman: San Francisco, CA, 1995; pp 490-501. (20) Faloutsos, C.; Ranganathan, M.; Manolopoulos, Y. Fast Subsequence Matching in Time Series Databases. SIGMOD Record, 1994, 23, 419-429. (21) Keogh, E.; Chakrabarti, K.; Pazzani, M.; Mehrotra, S. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Info. Syst. 2001, 3, 263-286. (22) Wold, S. C. Pattern Recognition by Means of Disjoint Principal Component Models. Pattern Recognit. 1976, 8, 127-139. (23) Yoon, S.; MacGregor, J. F. Fault Diagnosis With Multivariate Statistical Models. Part I: Using Steady-State Fault Signatures. J. Process Control 2001, 11, 387-400. (24) Zhang, J.; Martin, E.; Morris, A. J. Fault Detection and Classification Through Multivariate Statistical Techniques. Proceedings of the American Control Conference; IEEE: Piscataway, NJ, 1995; pp 751-755. (25) Dunia, R.; Qin, S. J. Joint Diagnosis of Process and Sensor Faults Using Principal Component Analysis. Control Eng. Pract. 1998, 6, 457-469. (26) Ku, W.; Storer, R. H.; Georgakis, C. Disturbance Detection and Isolation by Dynamic Principal Component Analysis. Chemom. Intell. Lab. Syst. 1995, 30, 179-196. (27) Kano, M.; Nagao, K.; Ohno, H.; Hasebe, S.; Hashimoto, I. Dissimilarity of Process Data for Statistical Process Monitoring. Proceedings of the IFAC Symposium on Advances in the Control of Chemical Processes (ADCHEM); Elsevier Science: Kidlington, U.K., 2001; Vol. I, pp 231-236. (28) Raich, A.; C¸ inar, A. Statistical Process Monitoring and Disturbance Isolation in Multivariate Continuous Processes. Proceedings of IFAC-ADCHEM’94; Elsevier Science: Kidlington, U.K., 1995; pp 452-457. (29) Raich, A.; C¸ inar, A. Multivariate Statistical Methods for Monitoring Continuous Processes: Assessment of Discrimination Power of Disturbance Models and Diagnosis of Multiple Disturbances. Chemom. Intell. Lab. Syst. 1995, 30, 37-48. (30) Raich, A.; C¸ inar, A. Statistical Process Monitoring and Disturbance Diagnosis in Multivariate Continuous Processes. AIChE J. 1996, 42, 995-1009. (31) Raich, A.; C¸ inar, A. Diagnosis of Process Disturbances by Statistical Distance and Angle Measures. Comput. Chem. Eng. 1997, 21, 661-673. (32) Johannesmeyer, M. C.; Singhal, A.; Seborg, D. E. Pattern Matching in Historical Data. AIChE J., in press.
3838
Ind. Eng. Chem. Res., Vol. 41, No. 16, 2002
(33) Kano, M.; Hasebe, S.; Hashimoto, I.; Ohno, H. A New Multivariate Statistical Process Monitoring Method Using Principal Component Analysis. Comput. Chem. Engr. 2001, 25, 11031113. (34) Singhal, A.; Seborg, D. E. Matching Patterns from Historical Data Using PCA and Distance Similarity Factors. Proceedings of the 2001 American Control Conference; IEEE: Piscataway, NJ, 2001; pp 1759-1764. (35) Jackson, J. E. A User’s Guide to Principal Components; John Wiley: New York, 1991. (36) Jolliffe, I. T. Principal Component Analysis; SpringerVerlag: New York, 1986. (37) Kourti, T.; MacGregor, J. F. Multivariate SPC Methods for Process and Product Monitoring. J. Qual. Technol. 1996, 28, 409-428. (38) Wang, X. Z. Data Mining and Knowledge Discovery for Process Monitoring and Control; Springer: London, 1999. (39) Chiang, L. H.; Russel, E. L.; Braatz, R. D. Fault Detection and Diagnosis in Industrial Systems; Springer: London, 2001. (40) Valle, S.; Li, W.; Qin, S. J. Selection of the Number of Principal Components: The Variance of the Reconstruction Error Criterion with a Comparison to Other Methods. Ind. Eng. Chem. Res. 1999, 38, 4389-4401. (41) Kourti, T. J.; Lee, J.; MacGregor, J. F. Experiences with Industrial Applications of Projection Methods for Multivariate Statistical Process Control. Comput. Chem. Eng. 1996, 20, s745s750. (42) Martin, E. B.; Morris, A. J. An Overview of Multivariate Statistical Process Control in Continuous and Batch Performance Monitoring. Trans. Inst. Meas. Control 1996, 18, 51-60.
(43) Wise, B. M.; Gallagher, N. B. PLS Toolbox 2.1: User’s Manual; Eigenvector Research, Inc.: Manson, WA, 2000. (44) Singhal, A.; Seborg, D. E. Dynamic Data Rectification Using the Expectation Maximization Algorithm. AIChE J. 2000, 46, 1556-1565. (45) Krzanowski, W. J. Between-Groups Comparison of Principal Components. J. Am. Stat. Assoc. 1979, 74 (367), 703-707. (46) Mardia, K. V.; Kent, J. T.; Bibby, J. M. Multivariate Analysis; Academic Press: London, 1979. (47) Russo, L. P.; Bequette, B. W. Effect of Process Design on the Open-Loop Behavior of a Jacketed Exothermic CSTR. Comput. Chem. Eng. 1996, 20, 417-426. (48) Seborg, D. E.; Edgar, T. F.; Mellichamp, D. A. Process Dynamics and Control; John Wiley: New York, 1989. (49) Johannesmeyer, M. C. Abnormal Situation Analysis Using Pattern Recognition Techniques and Historical Data. M.Sc. Thesis, University of California, Santa Barbara, CA, 1999. (50) Sorsa, T.; Koivo, H. Application of Artificial Neural Networks in Process Fault Diagnosis. Automatica 1993, 29, 843-849. (51) Vaidyanathan, R.; Venkatasubramanian, V. Representing and Diagnosing Dynamic Process Data Using Neural Networks. Eng. Appl. Artif. Intell. 1992, 5, 11-21.
Received for review June 18, 2001 Revised manuscript received December 10, 2001 Accepted December 13, 2001 IE010517Z