3612
Ind. Eng. Chem. Res. 2001, 40, 3612-3622
Fault Diagnosis of Single-Variate Systems Using a Wavelet-Based Pattern Recognition Technique Fardin Akbaryan and P. R. Bishnoi* Department of Chemical and Petroleum Engineering, The University of Calgary, 2500 University Drive NW, Calgary, Alberta T2N 1N4, Canada
A pattern recognition-based methodology is presented for fault diagnosis of a single-variate and dynamic system. A group of wavelet coordinates discriminating the classes of events most efficiently among other wavelet coordinates are determined according to the linear discriminant basis (LDB) method and a principal component analysis (PCA) technique. The proposed feature extractor couples the LDB method with the double wavelet packet tree in order to determine the best configuration of pattern windows causing the most discrimination among classes. The lifting scheme-based wavelet filters are used so that the required computation time is reduced significantly without degrading the robustness of the method. To reduce the size of the feature space, the wavelet coordinates are projected into a new low-dimensional space, by using a PCA technique, where minimum correlation exists among the new space variables. The tuning of some parameters, which affect the performance of the approach, is also discussed. The feature classifier is a binary decision tree that employs a soft-thresholding scheme for recognition of a noisy input pattern. The performance of the proposed technique is examined by a classification benchmark problem, and the faults classification problems for the Tennessee Eastman process. It is observed that the proposed pattern recognition methodology succeeds satisfactorily to classify the noisy input pattern into the known classes of events. 1. Introduction A system operates in the faulty condition when its behavior deviates considerably from normal and predefined operating strategies. Equipment failure, sensor degradation, set-point change, and disturbances in the input streams are the instances of faulty states for a system. The first group of faults, known as deterministic faults, is generated by a fixed magnitude cause and is usually damped by using a robust control strategy. Various magnitudes of a deterministic cause, even at different operating points, produce faults with similar trends. The second group, known as stochastic faults, results from causes whose magnitude changes randomly with time. Besides, the controlling scheme cannot drive back the system to a steady-state operating condition. A stochastic fault, even at the same initial operating point, could have different patterns. Fault diagnosis is an important part of the process supervisory routines that determines the states of the system (faulty or normal) as well as the types of faults. The analytical model-based,1-5 causal analysis,6,7 and pattern recognition8-13 are the main groups of fault diagnosis approaches. Chemical processes are often characterized by nonlinear behavior, noisy inputs, and unknown parameters. Thus, a model describing the system behavior, either mathematically or qualitatively, will be quite complicated.1-3 However, computer-based pattern recognition extracts a wealth of information from the large amount of process data quite satisfactorily without concern about the nature of a system. Some of the fault diagnosis methods assume that the fault occurrence drives the system to a new steady-state condition. Then the system characteristics at two dif* To whom correspondence should be addressed. Tel: (403) 220-6695. Fax: (403) 282-3945. E-mail:
[email protected].
ferent operating points are used for diagnosis purposes.4,5,11,12 If the system happens to reach its initial steady-state condition, these methods will not be useful for fault diagnoses. As another shortcoming, these approaches cannot deal with the stochastic faults because the system cannot reach a steady-state point. If transient trends of system variables are used as patterns, the fault diagnosis method will be free from considering the steady-state conditions.8-10,13 This implies that the diagnosis method is applicable equally for any type of fault and final system condition. In the present work, we propose a supervised pattern recognition methodology for fault diagnosis of singlevariate and dynamic systems. The patterns are transient trends of a process variable resulting from a disturbance in the system. The technique assesses the similarity of new, unknown patterns with the prototypes of each class, and the most similar class is considered as the source of faulty behavior. Because transient trends contain valuable information scattered within the time and frequency domain, the feature extractor must be efficient equally for these domains. A multiscale wavelet-based transform serves in this work as the feature extractor. The Fourier transform (FT) suffers from an inability to extract information of the time domain.14 Although the short-time FT (STFT) is able to process temporal features, its performance is inferior to the wavelet transform (WT) especially for short-lived segments of a pattern.14 The linear discriminant basis (LDB) method15 is modified in this work and used as the basis of the proposed feature extractor. In addition to the whole-line wavelet filters used in the original LDB method, the lifting scheme (LS)-based wavelet filters14 are employed by the feature extractor. As the main advantage, the LS filters require less computation time than the whole-line wavelet filters. The informa-
10.1021/ie000779l CCC: $20.00 © 2001 American Chemical Society Published on Web 07/13/2001
Ind. Eng. Chem. Res., Vol. 40, No. 16, 2001 3613
tion content of a signal depends on the length of the data sequence. A large window of data gives more information about the general trend of a pattern, whereas small-size windows focus more on the local structure of a pattern. The proposed feature extractor is able to choose the best set of nonoverlapping windows adaptively so that the selected features for each class are maximally discriminated. This helps the feature classifier define a more robust decision scheme. The proposed feature classifier is based on the binary decision tree (DT) approach implemented in many classification routines. A DT-based classifier is trained easily and needs a few a priori assumptions. The proposed tree classifies the extracted features according to a soft-thresholding technique. The tree determines the a posteriori probabilities that extracted features may belong to different classes. For ease of understanding, the frameworks of multiscale feature extraction, the wavelet-based transforms, and the induction technique for the DT classifier are introduced. The proposed methodology for feature extraction and classification, given a set of noisy data, is then described. To demonstrate the efficacy of the proposed algorithm, it is applied to simulated data.
into more different frequency bands at each time scale so that a set of overcomplete wavelet coefficients would be generated.14 By WPT, any subspace regardless of its type is decomposed into two coarser subspaces.
Ωm,f ) Ωm-1,2f x Ωm-1,2f+1
f ) 0, 1, ..., 2m - 1 (2) where m represents the scale and f denotes the frequency counter for subspaces at each scale. Each subspace is called a packet, and binary packet tree is an ensemble of all of these subspaces. Thus, a technique must be implemented to choose the best set of basis functions from the group of redundant subspaces. The formulation for selecting the best packets, for classification purposes, is discussed below. Best Basis Selection. Saito15 proposed the LDB methodology that determines a set of best packets that maximize the discrimination among different classes of data. In this technique, the importance of each packet is measured quantitatively by a statistical distancebased criterion termed the discriminant information function (DIF) D(p,q). The p and q are two nonnegative vectors with ∑kpk ) ∑kqk ) 1. The DIF could be modeled by the j-divergence criterion:
2. Background on Pattern Recognition
pi
n
Pattern recognition is an algorithm that determines the most appropriate class of event(s) for the given unlabeled input pattern. To recognize an unknown pattern, two main steps are usually followed: (1) feature extraction and (2) feature classification. 2.1. Feature Extraction. The transient behavior of a chemical process shows the effects of different physicochemical events such as process dynamics, sensor noise, faults, and external loads. These events, known as features, can be observed over different time and frequency ranges. Filtering is a conventional technique for extracting the features. A filtering technique, which explores the entire range of frequency and time domain simultaneously, is more reliable for extracting the features. Multi-resolution analysis (MRA) of a pattern is considered as a reliable basis for filtering the pattern in time and frequency domains.10,14,16 Linear timefrequency representation (TFR) of a pattern aims to present a pattern y(t) in terms of a weighted summation of some basis functions:
m ) L, L - 1, ..., 0
D(p,q) )
pi log ∑ q i)1
+ qi log
i
y(t) )
ciψi(t) ∑ i)1
(1)
where ψi(t) stands for the basis function, ci is the weighting factor, and N is the number of sample points. The MRA of a pattern combines the pattern’s TFR at different sampling rates, i.e., resolution, so that fine details and a general trend of the pattern with a desirable accuracy can be achieved.14 The WT14 is considered as a highly efficient approach for the MRA. The WT tiles the time-frequency plane effectively such that the main features of a pattern, located at various frequencies and times, are extracted with minimum redundancy. The wavelet basis functions are localized well in time and frequency domains. 2.1.1. Wavelet Packet. The wavelet packet transform (WPT) is a generalized version of WT that decomposes even the high-frequency bands kept intact in the WT. Unlike the WT, the WPT decomposes the pattern
pi
(3)
The LDB method uses the time-frequency energy maps of classes in order to compute the function D. The timefrequency energy map of class c is a table of positive real values, denoted by the indexes m, f, and k. Nc
Γc(m,f,k) )
∑ i)1
Nc
i (dm,f,k )2/
2 |y(c) ∑ i | i)1
(4)
where Nc is the number of patterns for the cth class. i is the kth wavelet coefficient located in the The dm,f,k fth packet of the mth scale. The coefficient is obtained by transforming the ith pattern into a wavelet packet tree. The DIF for each packet is defined by 2m NC-1 NC
NC )) D({Γc(m,f,...)}c)1
∑ ∑ ∑ D(Γi(m,f,k),Γj(m,f,k))
k)1 i)1 j)i+1 N
qi
(5)
NC is the number of classes. The details of computation steps for the LDB method can be found in work by Saito.15 The LDB reduces the complexity of the classification algorithm by retaining the most discriminant features. When the information content is dispersed throughout the entire time-frequency plane, retaining only a selected group of features may be far from the optimal solution. Englehart16 used the PCA method for seeking the best combination of all of the features in a lower dimensional space. 2.1.2. Dynamic WPT. Bakshi and Stephanopoulos17 proposed the time-varying wavelet packet analysis, which is utilized mainly for on-line compression of nonstationary signals. The WT and WPT require a minimum number of data points to decompose the given signal into the next coarser scale. The WPT based on Haar wavelet filters, for instance, needs two data points to construct a two-level packet tree. As the number of samples increases to four, two more packet trees would
3614
Ind. Eng. Chem. Res., Vol. 40, No. 16, 2001
be added to the ensemble of packet trees. When packet trees of similar depth are arranged in a row, a double wavelet packet tree (DWPT) would be constructed. The nodes of this tree are the single packet trees that are established online during the sample collection. Configuration of DWPT depends on the types of wavelet filters and the length of signal. The best packet selection algorithms could be applied to find not only the best packets of each single tree but also the best set of packet trees within the double trees. 2.2. Feature Classification. DT is a supervised classifier, with some desirable properties,18 that has been employed in a broad range of classification tasks. The outputs of DT are as accurate as those of the other classification algorithms such as artificial neural networks.19-20 A DT consists of a series of decision and terminal nodes. At each decision node a specified test is performed on a selected element (attribute) of the input pattern. Depending on the test result, the pattern descends to another node until the pattern reaches a terminal node (leaf). A DT is constructed by recursive partitioning of data space, represented by training patterns, until stopping criteria are met at each of the terminal nodes. The binary trees are preferred to nonbinary ones because the former is not biased in favor of attributes with many outcomes.21 Moreover, the binary trees split the data space into more regions than the nonbinary trees do. The classification error depends largely on the selection of appropriate attributes for each decision node. A test on an attribute that divides the data set nontrivially is considered as a potential candidate for categorizing the input instance. The incremental tree induction (ITI) model21 employs a form of Kolmogorov-Smirnov distance (KSD) to score each test for partitioning a set of examples at every decision node. For the continuous attributes, the suggested KSD would be
KSD(T,Ak) ) max
1eC