9184
Ind. Eng. Chem. Res. 2010, 49, 9184–9194
Unsupervised Process Fault Detection with Random Forests Lidia Auret and Chris Aldrich* Department of Process Engineering, UniVersity of Stellenbosch, PriVate Bag X1, Matieland, 7602 Stellenbosch, South Africa
Process monitoring technology plays a vital role in the automation of mineral processing plants, where there is an increased emphasis on safe, cost-effective, and environmentally responsible operation. Members of an important class of advanced diagnostic systems are data-driven and deal with potentially large numbers of variables at any given time by generating diagnostic sequences in lower-dimensional spaces. Despite rapid development in this field, nonlinear process systems remain challenging, and in this investigation, a novel approach to the monitoring of complex systems based on the use of random forest models is proposed. Random forest models consist of ensembles of classification and regression trees in which the model response is determined by voting committees of independent binary decision trees. In this study, a framework for diagnosing steady- and unsteady-state faults with random forests is proposed and demonstrated with simulated and realworld case studies. 1. Introduction The chemical process industries are facing stiff challenges in the form of increasing energy costs, increasingly stringent environmental regulations, and global competition. Although advanced control is widely recognized as essential to meeting these challenges, implementation is hindered by more complex, larger-scale circuit configurations; the tendency toward plantwide integration; and in some cases, an increased lack of trained personnel. In these environments, where process operations are highly automated, algorithms to detect and classify abnormal trends in process measurements are critically important. Statistical projection methods, such as principal component analysis (PCA) and partial least-squares, are commonly used for this purpose. PCA, in particular, has been used widely in process monitoring and is typically used to build a model and a residual subspace from historical process data associated with normal operating conditions. This allows the process to be monitored by squared distances in the two subspaces, based on T2 and Q statistics.1 Only a few principal components are usually required to detect and identify process faults when the correlation between process variables is high, as is often the case with plant data. These methods assume the process variables to be uncorrelated in time, which can frequently be unrealistic, and as a result, several extensions of these methods have been proposed to cope with systems for which these assumptions are violated. This includes moving PCA2 and dynamic PCA,3 as well as timelagged PCA.4 Despite these innovations, nonlinear methods are not yet well-established. This is even more so with equipment or process condition monitoring, where the dynamic behavior of the systems can show strongly nonlinear behavior. More generally, fault diagnostic methods typically allow the projection of data onto some feature space (forward mapping), as well as reconstruction of the data from the projected features (reverse mapping). Reconstruction of the data is used for both fault detection and fault identification, that is, finding the process variables associated with and possibly causing the faulty condition. In principle, it is not necessary to use the same methods to extract features from the data and to do the forward and reverse * To whom correspondence should be addressed. Fax: +27(21)8082059. E-mail:
[email protected].
mappings, although this is often conveniently done by methods such as principal component analysis or nonlinear equivalents thereof. Based on the framework depicted in Figure 1, machine learning methods are readily extended to fault diagnostic applications in process systems. This allows greater flexibility, in that the three operations comprising feature extraction and the forward and reverse mappings are decoupled and can therefore be tailored to the specific system being monitored. Within this framework, therefore, a novel approach based on the use of random forest (RF) models in multivariate statistical process control and process monitoring is considered, and its potential is demonstrated by three illustrative case studies. Although random forest models have recently started to attract attention in applications based on supervised fault detection,5,6 their use in an unsupervised capacity has received very little consideration to date. Therefore, this study focuses on using
Figure 1. General approach to fault diagnosis, where data X ∈ R n×m are projected onto a score space T ∈ R n×a and a residual space E ∈ R n×m.
10.1021/ie901975c 2010 American Chemical Society Published on Web 08/20/2010
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
unsupervised random forest methods with projection and reconstruction-based fault diagnosis. The rest of this article is organized as follows: In section 2, random forest models are briefly reviewed, after which the construction of fault diagnostic models with random forests is considered in section 3. In the three sections following thereafter, the application of random forests to process fault detection and identification is investigated, and in the last section, the conclusions ensuing from this study are summarized. 2. Random Forest Models Random forests are nonlinear regression or classification models consisting of ensembles of regression or classification trees, such that each tree depends on a random vector sampled independently from the data.7 2.1. Classification Trees. A classification tree8 divides the feature space into recursive binary partitions, allocating a fixedvalue prediction to each partition. To select optimal splitting variables and split positions ς for recursive partitioning, node purity criteria are used.8 For a classification problem with class outcome κ ) 1, 2, ..., C, the following is defined for region Rη with Nη observations (node η) pˆηκ )
∑
1 I(yi ) κ) Nη x ∈R i
(1)
η
pˆηκ is the fraction of samples allocated to node η with class outcome κ, and I( · ) is an indicator function. If observation yi belongs to class κ, the indicator function returns 1; if not, it returns 0. Observations reporting to node η are classified as class κ(η), the majority class in that node, where κ(η) ) arg max pˆηκ κ
(2)
The node impurity measure Qη for node η of tree u can be defined as K
Qη )
∑ pˆ
ηκ(1
- pˆηκ)
(3)
κ)1
where Qη is typically the Gini index8 often used in classification problems. Other indices can also be used, if so desired. This procedure is known as the classification and regression tree (CART) algorithm. 2.2. Random Forests. A random forest consists of K decision trees, where each tree is generated based on a random vector. With random forests, these random vectors describe the training samples and available node splitting variables for each tree. The random forest construction algorithm7 is summarized as follows: • For k ) 1 to K (size of ensemble) o Construct a bootstrap sample with replacement Xk from the learning set X, of the same size as the learning set o Grow a random forest tree uk on Xk by employing the CART algorithm, with the following modification at each node - Select M input variables from Xk to use as possible split variables - Calculate best split position ς* from these M variables - Split the node into two child nodes o Repeat the above tree growing algorithm until the following stopping criteria are achieved - A node size of one (for classification) or five (for regression)
9185
- A node with homogeneous class membership or response values • Output the ensemble of trees {uk}1K • A new prediction is made as follows o The prediction of each tree hk is calculated o For classification, the majority vote over all K trees is assigned o For regression, the average response value over all K trees is assigned Only two model parameters are required when constructing random forests, namely, the number of trees K and the number of random split variables M. A choice of M as the floored square root of m (the total number of predictor variables) is suggested by Breiman.7 2.3. Random Forest Feature Extraction. Random forests can be used to construct proximity matrices for data on the basis of whether samples report to the same leaf nodes or not. A leaf node essentially spans a hyperrectangle in the input space; if two samples report to the same hyperrectangle, they must be proximal. The algorithm for constructing a proximity matrix9 is summarized as follows: • Construct a random forest model on learning set X with response Y • Create an empty similarity (proximity) matrix S • For each tree k ) 1 to K o For each sample combination (i,j), determine the terminal nodes to which they report o If a sample combination (i,j) report to the same terminal node, increase Sij by 1 o Repeat for all possible sample combinations • Scale the similarity matrix S by dividing by the number of trees K; the similarity matrix is symmetric and positivedefinite, with entries ranging from 0 to 1 and diagonal elements equal to 1 The similarity matrix can be converted into a dissimilarity matrix: D ) 1 - S. The dissimilarity matrix can be considered as Euclidian distances in high-dimensional space, and multidimensional scaling (MDS)10 can be used to obtain a lowerdimensional representation of the distances between data points. Given a dissimilarity matrix, multidimensional scaling attempts to find coordinates for data points that preserve the dissimilarity as Euclidian distances in the coordinate space. The preservation of dissimilarities is measured by a stress function φ (eq 4), the sum of squared differences between point dissimilarities and the new coordinate distances.11 Classical multidimensional scaling finds an explicit solution to the stress function by using eigen decomposition of the proximity matrix. The eigenvectors are the MDS coordinates, and the eigenvalues give an indication of every coordinate’s contribution to the squared distance between points. These MDS coordinates constitute the random forest features. φ(T) )
∑ (D
ij
- ||Ti - Tj||)
(4)
i*j
A feature extraction technique must be able to compute significant features for data sets where no response variables are present. This forms part of unsupervised learning, whereas supervised learning attempts to relate input variables to a given response. Random forest classification and regression are supervised learning methods and require modification of the unsupervised learning problem before they can be applied.12 This is done by augmenting the data with an artificial or contrasting class. The unsupervised learning problem then becomes a supervised problem requiring separation of the two
9186
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
classes by constructing a random forest model for this purpose. From this model, proximities can be calculated for the original unlabeled data, and multidimensional scaling can be applied to produce a set of features (coordinates) and their corresponding eigenvalues. These features now represent a projection of the original unlabeled data. The approach is summarized as follows: • For an unlabeled learning set with input variables X, create a synthetic data set X0 by random sampling from the product of marginal distributions of X • Label the response of X inputs as class 1 and the response of X0 inputs as class 2 • Concatenate input matrices X and X0 as Z with concatenated response Y • Construct a random forest classification model to predict Y (class labels 1 and 2) given Z • Create an empty similarity (proximity) matrix S • For each tree k ) 1 to K o For each sample combination of Xi and Xj, determine the terminal nodes to which they report o If a sample combination of Xi and Xj, report to the same terminal node, increase Sij by 1 o Repeat for all possible sample combinations • Scale the similarity matrix S by dividing by the number of trees K; the similarity matrix is symmetric and positivedefinite, with entries ranging from 0 to 1 and diagonal elements equal to 1 • Determine the dissimilarity matrix D ) 1 - S • Use the dissimilarity matrix D as input to classical multidimensional scaling, retaining a scaling coordinates T as random forest features To date, the application of unsupervised random forest feature extraction as outlined above has been limited to the investigation of certain properties of RF features and their use in cluster analysis.12 In contrast, this study extends RF feature extraction in a forward and reverse mapping scheme, to facilitate unsupervised fault diagnosis, as explained in more detail in the next section. 3. Application of Random Forests to Fault Diagnosis of Steady-State Systems 3.1. Feature Extraction, Mapping, and Demapping. Random forest feature extraction provides feature scores for training data only and cannot explicitly calculate scores for data that do not appear in the training set. To calculate the scores and reconstruction of unseen data, explicit mapping (forward models) and demapping (reverse models) must be constructed. In this study, both forward and reverse mappings are done with random forests. Because random forest models can handle only one output at a time, mapping requires as many random forests as there are feature variables, whereas demapping requires as many random forests as there are input variables. 3.2. Confidence Limits for Score and Residual Distances. Because the scores associated with normal operating conditions (NOC) might not have a Gaussian distribution and could even have a clustered structure, one-class support vector machines were used to determine confidence limits for the scores in feature space.13 A one-class support vector machine finds a function that is positive in dense data regions and negative where there are no data, by mapping the scores to a high-dimensional space and finding a hyperplane in this space that separates the data from the origin. Such a model requires a parameter ν that controls the tradeoff between the complexity of the decision surface and the number of training data lying within the estimated region.
-νp +
min
w∈F,F∈R,ξ∈R m
subject to
{
1 m
m
∑ξ
i
(5)
i)1
||w||1 ) 1 [w · φ(x)] g F - ξi ξi g 0, i ) 1, ..., m
In eq 5, w defines a separating hyperplane in high-dimensional space F, F is some threshold or significance level, ξ is the margin error, ν is a decision surface complexity parameter, and φ(x) is a nonlinear mapping function on the vector x. A simple grid search with cross-validation can be employed to automatically determine ν.14 The score distance (s) of a data point is then the negative one-class support vector machine probability of classification to the region defined by the training data. A detection threshold sR for the one-class support vector machine score distances can be determined by a simple percentile approach, with a 1 - R confidence. The simple grid search approach involves 5-fold crossvalidation using normal operating conditions, testing a range of ν values from 2-19 to 29. The cross-validation criterion is the average one-class support vector machine prediction accuracy of the left-out data for each fold. The maximum ν value that still ensures 70% average prediction accuracy is selected as model parameter. (The 70% criterion proved visually successful in accounting for irregular decision boundaries for various data sets and reflects the irregularity of the decision boundary, not the fault detection confidence limit). The percentile approach of defining sR ensures a user-definable confidence limit. Residual distances (r) can be computed as the squared sum of differences between original and reconstructed process variables. A detection threshold rR can again be defined with a simple percentile approach. 3.3. Contribution Plots. The approach outlined in the preceding sections will facilitate detection of faults in a score or residual space of the data, but would not give an indication of the original variables that might have given rise to the fault in the first place. To identify the process variables associated with fault conditions, one can quantify the squared deviation of each process variable reconstruction from its original value. This gives a residual distance contribution Cr,j for each process variable j. The offline training algorithm for random forest fault diagnosis is given as follows: • For feature extraction, perform the following steps o For a scaled data set X of normal operating conditions, calculate random forest features with the random forest feature extraction algorithm o Determine the reduced dimensionality a that captures significant variance using the crossing criterion o Define the random forest feature scores T as a first score vectors of random forest features o Train mapping model with X as input and T as output - Random forest regression, a models • For feature characterization, perform the following steps o Train a one-class support vector machine model, calculate the probability values for all data points and define the score distance (s) as negative one-class support vector machine probability o Determine a detection threshold for the score distances using the percentile approach • For variable reconstruction, perform the following steps o Train demapping model with T as input and X′ as output
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
9187
Figure 2. Synthetic data (top left) showing training (NOC) data and three fault conditions: A, B, and C. Corresponding RF features for A-C are shown at top right, bottom left, and bottom right, respectively. δs indicates the missing alarm rate based on score distances, and black lines indicate 99% score distance limits based on training data score distances.
- Random forest regression, m models o Calculate the squared prediction errors Q Qj ) (Xj-Xj′)2
(6)
o Calculate the residual distance r m
r)
∑Q
j
(7)
j)1
o Determine detection threshold rR for the residual distances using the percentile approach • For the contribution calculations o Calculate the residual distance contributions Cr,j ) (Xj-Xj′)2
(8)
The proposed methodology is subsequently demonstrated by means of three case studies. 4. Case Study 1: Synthetic Data In the first example, a bivariate distribution of two variables X1 and X2 is considered, as indicated in Figure 2. The data indicated by the blue circle are considered to be normal operating or reference data, whereas all of the other data are considered to be faulty data, with different fault conditions indicated by different colors, namely, red (A), green (B), and magenta (C). Mapping of the three fault conditions by the RF model is shown in Figure 2 in the top right and bottom panels. In these panels, the solid dark lines indicate the 99% confidence limits of the normal data (blue in Figure 2, top left panel), as determined with the one-class support vector machine approach described earlier (eq 5). The features T1 and T2 were extracted from an average proximity matrix derived from five unsupervised RF models. Because two variables X1 and X2 were mapped to two features, T1 and T2, four RF regression models were trained: two for the
Figure 3. Residual distance diagnostics for synthetic data. (δr indicates the missing alarm rate based on residual distances, and gray lines indicate 99% residual distance limits based on training data residual distances.)
forward mapping and two for the reverse mapping. In all instances, parameters K ) 1000 and M ) 1 were used during the construction of the RF models. Note that the fault conditions inside the normal data are easily discernible in the score space (red and green markers in the top right and bottom left panels, respectively). The fault condition in which the data (magenta markers) lie far outside the normal operating region is hardly distinguishable from the normal data on the feature map. The reason for this is that data can be mapped only to the constrained responses defined by the training data. This is an artifact of decision tree response constraints. A decision tree prediction (in regression) can only be an average of some subsample of the original training data responses. Hence, the predictions will necessarily lie within the range of the minimum and maximum of training responses. Although fault condition C is poorly detected in the feature space of the RF, it is very strongly detected in the residual subspace of the RF (see Figure 3), because the reconstructed data will very closely approximate the training (NOC) data, leading to large errors and, hence, easy fault detection. This inability to extrapolate beyond the boundaries of the training data can therefore be exploited in the residual subspace to allow sensitive detection of certain fault conditions. 5. Case Study 2: Fault Diagnosis in Steady-State Flotation of Platinum Group Metals In the second case study, random forest feature fault diagnosis is applied to a multivariate data set of flotation cell images obtained from a platinum group metals plant in South Africa. The data set consists of two conditions: normal operating conditions (NOC) and a simulated fault. The data set reflecting normal operating condi-
9188
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
Figure 5. Structure of an autoassociative network for nonlinear principal component analysis, after Kramer.17 Figure 4. Visualization of normal operating conditions and fault data for flotation images data with principal component scores. (Variance per component is shown in parentheses).
• Calculate principal component scores using principal components P (a columns of eigenvector matrix V)
tions contains 183 samples of 30 variables. These variables were obtained from video images of the froth and include bubble velocities (variables 1-6), stability measures (variables 7 to 9), bubble sizes (variables 10 and 21-29), color variables derived from the froth images (variables 11-20), and the process variable air flow (variable 30). A fault set was created by decorrelating the red, green, blue, and gray color values. This could simulate a contaminant that has different color absorption properties than the expected minerals. The data set associated with the fault also consists of 183 samples of 30 variables. To visualize the difference between the NOC and fault data, the principal components and scores were calculated for the combined NOC and fault data. The scores of the first three components (accounting for 65.4% of cumulative variance) are shown in Figure 4. From this plot, it is apparent that there is considerable overlap between the NOC and fault data. The random forest method was compared with fault diagnosis based on principal component analysis, as well as principal components extracted from the data with autoassociative neural networks. The two principal component approaches are briefly summarized here. 5.1. Linear Principal Component Analysis. Principal component analysis (PCA), a linear technique, is the most widely used of all dimension reduction techniques,15 operating by finding orthogonal vectors in the measurement space onto which data are projected. The criterion for determining these principal components is to maximize the variance of the data projected onto said components.1 Each principal component is a linear combination of measured variables, uncorrelated to all other components. The first component accounts for the most variance, the second component for the next most variance, and so on. The determination of these components is an optimality guaranteed problem in the form of spectral decomposition of the covariance matrix of the data.16 • For data set X (n observations by m variables), construct the covariance matrix Σ 1 o Σ) XTX (9) N-1
5.2. Nonlinear Principal Component Analysis (NLPCA). Kramer17 investigated a nonlinear PCA extension based on autoassociative networks. The feed-forward network has an input layer of process variables, a mapping (G) hidden layer of nonlinear activation functions, a bottleneck hidden layer representing nonlinear PCA scores, a demapping (H) hidden layer of nonlinear activation functions, and an output layer of reconstructed process variables (see Figure 5). Once training of the network weights has been completed through conjugate gradient optimization, the network can be split into separate mapping and demapping networks for unseen data. To compare the performance of different fault diagnostic methods (linear PCA, nonlinear PCA, and RF), the missing alarm rate is used. The missing alarm rate δ is defined as the number of known fault samples that are not detected for a given data set. A missing alarm rate can be calculated for both the score distances and residual distances (δs and δr, respectively). One-class support vector machines were used to determine confidence limits for the score spaces of all three techniques. Confidence limits based on 99th percentiles were calculated for all statistics. For visualization purposes, two model components (a ) 2) were used for all three models. As before, the RF parameters were K ) 1000 and M ) 5, with the average proximities of five forests used as input to multidimensional scaling. For the RF models, the squared correlation coefficient of the original and reconstructed variables was 0.92. In the linear PCA model, the first two principal components account for 60.9% of the variance of the data, whereas the first two nonlinear principal components extracted with the autoassociative neural networks resulted in a squared correlation coefficient of 0.68 based on the original and reconstructed variables. A 30:6:3:6:30 configuration (one input layer, three hidden layers, and an output layer identical to the input layer) was used for the autoassociative neural network, with hyperbolic tangent activation functions for the mapping and demapping layers. Weight optimization was done with conjugated gradient descent and weight decay. The results are shown in Figures 6 and 7. Figure 6 shows that none of the techniques are successful at detecting the fault based purely on score distances, with PCA, NLPCA, and RF missing alarm rates (based on score distances) of 91.3%, 80.9% and 96.7%, respectively. As with the previous case study, it is noted that the RF scores for fault data are again constrained to a region within the normal operating scores. The constrained response artifact is again at play.
• Calculate the eigenvectors V and eigenvalues Λ for the covariance matrix Σ using eigenvalue decomposition o Σ ) VΛVT (10) • Determine the reduced dimensionality a that captures significant variance
o T ) XP
(11)
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
9189
Figure 6. Linear PCA, nonlinear PCA, and RF scores and residual distances for flotation images data. Confidence limits are based on 99th percentiles of appropriate statistics.
The fault conditions showed up strongly in the residual space of the random forest models, as indicated by the missing alarm rate (based on residual distances) of 18.6%. Neither PCA nor NLPCA can successfully detect the fault condition with residual distances, with relevant missing alarm rates of 94% and 82.6%, respectively. To identify the potential cause of the fault, contribution plots can be investigated. The contribution plots based on residual distances for the three approaches are given in Figure 7. The red, green, blue, and gray color values that were decorrelated are represented by variables 11, 12, 13, and 14, respectively. PCA identifies three process variables (11, 12, and 14) correctly; however, it must be remembered that these contributions are based on only 6% of fault condition samples, as the relevant PCA missing alarm rate was 94%. NLPCA identifies the four process variables (11, 12, 13, and 14) correctly, based on 17.4% of all samples. RF identifies six process variables, including the four correct variables (11, 12, 13, and 14) and two incorrect variables (15 and 18), based on 81.4% of all samples. The incorrectly identified variables have lower contribution magnitudes than the correctly identified variables. Overall, all three methods can thus be said to have satisfactorily identified the relevant fault process variables.
6. Case Study 3: Tennessee Eastman Process The Tennessee Eastman process is a simulated chemical process developed to provide a realistic benchmark on which to evaluate the performance of process monitoring methods.18 The Eastman Chemical Company created a simulation of an actual chemical process with five major units and eight components.19 Simulation data of the Tennessee Eastman process with plantwide control based on proportional and proportional-integral control are available for normal operating conditions and 20 fault conditions (http://brahms.scs.uiuc.edu). The flow sheet for the process is given in Figure 8. Gaseous reactants A, C, D, and E, with inert B, are fed to a water-cooled reactor, where liquid products G and H and liquid byproduct F are formed. The reactor product is cooled in a condenser and then separated in a vapor-liquid separator. The vapor phase from the separator is recycled through a compressor to the reactor, with a portion purged to avoid accumulation of inert B and byproduct F in the system. The liquid from the separator is pumped to a stripper to remove remaining reactants in the stream for recycle. Liquid products G and H report to the product stream. Process data consist of 41 measured variables (process and composition measurements) and 11 manipulated variables.
9190
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
Figure 7. Residual contributions for flotation images fault data, with residual contributions from 99th-percentile normal operating conditions indicated with horizontal black lines.
Figure 8. Process flow diagram for the Tennessee Eastman process.18
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
9191
a
Table 1. Process Variables of the Tennessee Eastman Process
a
variable
description
variable
description
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
A feed, stream 1 (PM) D feed, stream 2 (PM) E feed, stream 3 (PM) total feed, stream 4 (PM) recycle flow (PM) reactor feed rate (PM) reactor pressure (PM) reactor level (PM) reactor temperature (PM) purge rate (PM) separator temperature (PM) separator level (PM) separator pressure (PM) separator underflow (PM) stripper level (PM) stripper pressure (PM) stripper underflow (PM) stripper temperature (PM) stripper steam flow (PM) compressor work reactor cooling water outlet temperature (PM) separator cooling water outlet temperature (PM) reactor feed component A (CM) reactor feed component B (CM) reactor feed component C (CM) reactor feed component D (CM)
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
reactor feed component E (CM) reactor feed component F (CM) purge component A (CM) purge component B (CM) purge component C (CM) purge component D (CM) purge component E (CM) purge component F (CM) purge component G (CM) purge component H (CM) product component D (CM) product component E (CM) product component F (CM) product component G (CM) product component H (CM) D feed flow, stream 2 (MV) E feed flow, stream 3 (MV) A feed flow, stream 1 (MV) total feed flow, stream 4 (MV) compressor recycle valve (MV) purge valve (MV) separator product liquid flow (MV) stripper product liquid flow (MV) stripper steam valve (MV) reactor cooling water flow (MV) condenser cooling water flow (MV)
MV ) manipulated variable, PM ) process measurement, CM ) composition measurement.
Table 2. Process Faults of the Tennessee Eastman Process fault 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16-20
description A/C feed ratio (B composition constant), stream 4 B composition (A/C feed ratio constant), stream 4 D feed temperature, stream 2 reactor cooling water inlet temperature condenser cooling water inlet temperature, stream 2 A feed loss, stream 1 C header pressure loss (reduced availability), stream 4 A-C feed composition, stream 4 D feed temperature, stream 2 C feed temperature, stream 4 reactor cooling water inlet temperature condenser cooling water inlet temperature reaction kinetics reactor cooling water valve condenser cooling water valve unknown
type step change step change step change step change step change step change step change random random random random
variation variation variation variation
random variation slow drift sticking sticking unknown
Normal operating conditions are represented by 500 samples, with a further 960 samples available as validation data for normal operating conditions. Each of the 20 fault data sets consists of 960 samples, with the fault occurring after 161 time steps within these 960 samples. The input variables and faults are summarized in Tables 1 and 2. Table 3 presents results for fault diagnosis on Tennessee Eastman process data using various feature extraction methods, namely, kernel principal components analysis (KCA from Zhang20), kernel-independent components analysis (KICA from Zhang20), PCA, and RF. The confidence limits for the score and residual distances were defined to give a 1% false alarm rate on normal operating condition data used for validation. The RF parameters were K ) 1000 and M ) 7, with the average proximities of five forests used as input to multidimensional scaling. For the RF models, the squared correlation coefficient of the original NOC and reconstructed NOC variables
was 0.9121 for three model components and 0.9748 for 30 model components. For the PCA model, the first 30 principal components accounted for 89% of the variance of the NOC data. From the results summarized in Table 3, KICA appears to be the most successful approach, based on missing alarm rates, with performance on this measure equal to or better than that of the other techniques for 15 faults. All techniques show better detection based on residual distances than score distances. Even though RF shows better or equal detection performance for only 6 of the 20 faults, the margin with which the other techniques are better than RF is less than 10% for 12 of the 20 faults. RF can thus be considered comparable to the existing feature extraction techniques for this case study. In addition, no significant change in missing alarm rates is apparent whether 3 or 30 model components are extracted using RF models. 7. Practical Considerations 7.1. Using Large Data Sets for Model Construction. The RF fault diagnostic method as outlined in this work requires the calculation of a number of n × n proximity matrices during offline training, where n is the number of samples of NOC data available for training. This order-n2 computational expense can be limiting when large training data sets are considered. A possible strategy for the decreasing offline computational load is the use of subsampling of the training data. Online application of the RF method is not hindered by this order-n2 expense, as only RF regression models are employed online, and RF predictions for unseen data are computationally inexpensive. 7.2. Selection of Number of Features. Although it is clearly advantageous to select as small a number of features to be used in the fault diagnostic model as possible, as this reduces computational expense, guidelines for doing so are a challenge with no general solution. Obtaining RF proximities is a nonlinear transformation, whereas obtaining features from these proximities with classical multidimensional scaling is a linear transformation based on eigen decomposition similar to that used in
9192
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
Figure 9. RF parallel analysis for NOC data (Tennessee Eastman process). Figure 10. Average RF missing alarm rates for Tennessee Eastman process data. Averages were calculated from all fault conditions at each number of model components.
PCA. As a consequence, an approach similar to that used to select the number of components in PCA can also be used in the selection of MDS features. Parallel analysis18 is one such approach, where the eigenvalue profile of a data set is compared to a profile that assumes independent observation variables. The premise is that a comparison of these profiles will reveal how many features significantly capture the original proximities, as opposed to capturing random variations in the data. Parallel analysis can be used directly in the RF fault diagnostic approach, based on the eigenvalue profiles associated with multidimensional scaling. In this case, two sets of eigenvalues are generated, namely, a structured set based on the RF proximities of the original data and an unstructured set based on random forest proximities of a randomly permuted version of the original data. By plotting both the structured and unstructured eigenvalues, the reduced feature space dimension can be approximated by the component index where the structured eigenvalues cross the unstructured eigenvalues. For the Tennessee Eastman problem, crossing occurs at around seven random forest features, even though this accounts for less than 8% of cumulative variance of the NOC proximities, as indicated in Figure 9. Seven components were therefore used in the RF algorithm. To investigate the effect of the number of features on the performance of the RF fault diagnostic model, average missing
Figure 11. Average RF false alarm rates for Tennessee Eastman process data. Averages were calculated from the first 160 samples of all fault conditions at each number of model components.
and false alarm rates for a range of model components were calculated for the Tennessee Eastman process. The results are presented in Figures 10 and 11. The false alarm rate ζ in Figure 11 is defined as the number of NOC samples that are wrongly flagged as fault conditions.
Table 3. Missing Alarm Rates for Tennessee Eastman Process, Based on 99% Control Limita KPCA
KICA
PCA
RF(3)
RF(30)
fault
δs
δr
δs
δr
δs
δr
δs
δr
δs
δr
margin
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 0.02 0.96 0.91 0.73 0.01 0 0.03 0.96 0.57 0.76 0.02 0.06 0.21 0.92 0.70 0.26 0.10 0.97 0.59
0 0.02 0.95 0 0.75 0 0 0.04 0.96 0.49 0.19 0.03 0.05 0 0.94 0.48 0.05 0.10 0.51 0.48
0 0.02 0.99 0.19 0.75 0 0 0.03 0.99 0.19 0.42 0.01 0.05 0 0.97 0.23 0.09 0.11 0.30 0.50
0 0.02 0.97 0 0.72 0 0 0.02 0.97 0.22 0.23 0.01 0.05 0 0.95 0.13 0.03 0.09 0.15 0.35
0.01 0.02 0.99 0.70 0.75 0.01 0 0.03 0.99 0.61 0.53 0.02 0.05 0 0.98 0.81 0.22 0.11 0.95 0.68
0 0.01 0.99 0.02 0.77 0 0 0.06 0.98 0.62 0.49 0.05 0.05 0 0.98 0.66 0.06 0.10 0.91 0.48
0.99 0.98 0.99 1 0.99 1 0.96 0.94 1 0.97 0.98 0.97 0.94 1 0.98 1 1 1 1 0.99
0.01 0.02 0.99 0.39 0.76 0 0 0.02 1 0.57 0.46 0.01 0.05 0 0.97 0.77 0.14 0.11 0.99 0.55
0.98 0.98 0.99 1 0.99 1 0.96 0.97 1 0.99 1 0.94 0.95 1 1 1 1 1 1 1
0.01 0.02 0.99 0.32 0.76 0 0 0.03 1 0.54 0.45 0.01 0.05 0 0.94 0.72 0.13 0.11 0.99 0.54
0.01 0.01 0.04 0.32 0.04 0.03 0.35 0.26 0.02 0.59 0.10 0.02 0.84 0.19
a
KPCA and KICA results from Zhang20 using 11 model components; PCA based on 30 components; and RF based on 3 and 30 components, as indicated. Boldface highlights which algorithms performed best (had the lowest missing alarm rates) for each fault.
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
From Figure 10, it is clear that, for the Tennessee Eastman problem, the missing alarm rates are not sensitive to the number of model components at all and that the use of a single component in the model would be adequate. These results are in line with the data in Table 3, where the performances of the RF models based on the use of 3 or 30 components were very similar. The apparently negligible effect that the number of components included in the model has on its performance is a result of the same constrained response observed in the first case study. The random forest model is unable to extrapolate beyond the NOC data, whether in one of multiple dimensions, and hence faults cannot be detected in the feature space, as indicated by the δs curve close to unity (100% failure) in Figure 10. For the same reason, reconstruction of the data is also not affected by the number of components used, because data lying outside the NOC region lie outside it almost regardless of the dimensionality of this region. Figure 11 shows false alarm rates for the RF model as a function of the number of components in the Tennessee Eastman problem, and in this case, the retention of seven components suggested by the parallel analysis appears to be very reasonable, as this corresponds closely with the minimum average false alarm rate. 7.3. Selection of RF Parameters. To determine the sensitivity of the RF fault diagnostic results to random forest parameters, the flotation data set from case study 2 were revisited. The missing alarm rates were recorded as the RF parameters K and M were adjusted, with the results given in Figures 12 and 13. From these figures, it is evident that the RF diagnostics are not overly sensitive to choice of RF parameters. The only notable trend is the increase in δr as the number of random split variables increases in the unsupervised forests.
9193
8. Discussion and Conclusions In this study, a novel fault diagnostic strategy using random forest models was developed and investigated by way of three case studies. In the first case study with synthetic data, the constrained response characteristic of random forest mapping, as evident from RF scores, was highlighted. Decision trees can give only prediction responses that lie within the range of training set responses, and this characteristic is inherited by random forest regression models. Random forest demapping can exploit this artifact, as fault scores will be reconstructed erroneously in the original variable space, resulting in the construction of a sensitive diagnostic for fault detection in the residual subspace. In the second case study, a fault condition was simulated to represent a change in the froth structures of platinum flotation plants. The random forest approach proved markedly more successful at detecting the fault condition than linear or nonlinear principal components analysis. The success of the random forest method could be attributed to its strongly discriminatory residual distance diagnostics. Fault identification using contribution plots could be done satisfactorily for all three methods (random forests, linear PCA, and nonlinear PCA). The third case study involved the benchmark Tennessee Eastman process, with random forest results compared to PCA, KPCA, and KICA. Even though KICA was the most successful (based on nominal missing alarm rates), the margin of improvement over the novel random forest approach was less than 10% for 12 of the 20 faults. RF can thus be considered comparable to the existing feature extraction techniques for this case study. Detection of the faults in the residual space proved to be more successful than score distance detection for all the methods investigated. This would generally be the case where the training data of the random forest model can be enclosed by a single convex boundary, because the random forest model would then
Figure 12. Missing alarm rates for simulated fault conditions in flotation structures for varying numbers of random split variables, as applied to the unsupervised and regression RF models.
Figure 13. Missing alarm rates for simulated fault conditions in flotation structures for varying numbers of RF ensemble members, as applied to the unsupervised and regression RF models.
9194
Ind. Eng. Chem. Res., Vol. 49, No. 19, 2010
be unable to project new (possibly faulty) data beyond the boundaries of these training data. However, when the training (NOC) data of the model are more complex, fault detection in the feature space would become important, for example, when the NOC data are clustered, as could be the case in processes moving from one discrete state to another or when the data cannot be enclosed by a single convex boundary. In conclusion, fault diagnosis based on the use of random forests appears to be a promising approach to process monitoring. The nonlinear nature of random forest features, its constrained response characteristic and subsequent residual distance detection ability, as well as its ability to identify faults, indicate potential for further application to chemical engineering systems.
R ) significance level δ ) missing alarm rate ζ ) false alarm rate η ) node identifier θ ) random vector Λ ) eigenvalues ν ) decision surface complexity ξ ) margin error F ) threshold ς ) candidate split position Σ ) sample covariance matrix φ ) scaling cost function φ ) nonlinear mapping
Acknowledgment
(1) MacGregor, J. F.; Kourti, T. Statistical process control of multivariate processes. Control Eng. Pract. 1995, 3, 403–414. (2) Wold, S. Exponentially weighted moving principal components analysis and projections to latent structures. Chemom. Intellig. Lab. Syst. 1994, 23, 149– 161. (3) Ku, W.; Storer, R. H.; Georgakis, C. Disturbance detection and isolation by dynamic principal component analysis. Chemom. Intellig. Lab. Syst. 1995, 30, 179–196. (4) Lee, C.; Choi, S. W.; Lee, I. Sensor fault identification based on time-lagged PCA in dynamic processes. Chemom. Intellig. Lab. Syst. 2004, 70, 165–178. (5) Yang, B.; Di, X.; Han, T. Random forests classifier for machine fault diagnosis. J. Mech. Sci. Technol. 2008, 22, 1716–1725. (6) Maragoudakis, M.; Loukis, E.; Pantelides, P. Random Forests Identification of Gas Turbine Faults. In Systems Engineering 2008. ICSENG 08. 19th International Conference on; IEEE Press: Piscataway, NJ, 2008; pp 127132. (7) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. (8) Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. J. Classification and Regression Trees; Chapman & Hall: London, 1983. (9) Breiman, L.; Cutler, A. ManualsSetting up, Using, and Understanding Random Forests V4.0; Salford Systems: San Diego, CA, 2008; available at ftp://ftp.stat.berkeley.edu/pub/users/breiman/Using_random_forests_v4.0.pdf (Accessed May 30, 2008). (10) Cox, T. F.; Cox, M. A. A. In Multidimensional Scaling; Monographs on Statistics and Applied Probability; Chapman & Hall: London, 2001; Vol. 88. (11) Hastie, T.; Tibshirani, R.; Friedman, J. In The Elements of Statistical LearningsData Mining, Inference and Prediction; Springer Series in Statistics; Springer: New York, 2009. (12) Shi, T.; Horvath, S. Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 2006, 15, 118–138. (13) Jemwa, G. T.; Aldrich, C. Kernel-based fault diagnosis on mineral processing plants. Min. Eng. 2006, 19, 1149–1162. (14) Hsu, C.-W.; Chang, C.-C.; Lin, C.-J. A Practical Guide to Support Vector Machine Classification; Technical Report; Department of Computer Science, National Taiwan University: Taipei, Taiwan, 2003; available at: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf (Accessed March 30, 2009). (15) Cayton, L. Algorithms for Manifold Learning; Technical Report CS2008-0923; Department of Computer Science, University of California at San Diego: San Diego, CA, 2005; available at: http://people.kyb.tuebingen. mpg.de/lcayton/resexam.pdf (Accessed October 21, 2009). (16) Carreira-Perpinan, M. A. A ReView of Dimension Reduction Techniques; Technical Report CS-96-09; University of Sheffield: Sheffield, U.K., 1997. (17) Kramer, M. A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991, 37, 233–243. (18) Russell, E. L.; Chiang, L. H.; Braatz, R. D. Data-DriVen Methods for Fault Detection and Diagnosis in Chemical Processes; Advances in Industrial Control; Grimble, M. J., Ser. Ed.; Springer: New York, 2000. (19) Downs, J. J.; Vogel, E. F. A plant-wide industrial process control problem. Comput. Chem. Eng. 1993, 17, 245–255. (20) Zhang, Y.; Qin, S. J. Improved nonlinear fault detection technique and statistical analysis. AIChE J. 2008, 54, 3207–3220.
The authors would like to acknowledge the Wilhelm Frank Scholarship of the University of Stellenbosch for funding in part the work of first author L.A. Nomenclature A ) Tennessee Eastman reactant B ) Tennessee Eastman inert C ) Tennessee Eastman reactant Cr ) residual-based contribution D ) Tennessee Eastman reactant D ) dissimilarity matrix e ) independent Gaussian noise E ) residual matrix E ) Tennessee Eastman reactant F ) Tennessee Eastman byproduct F ) high-dimensional space G ) Tennessee Eastman product h ) classification prediction H ) Tennessee Eastman product j ) number of classes in response Y k ) ensemble member identifier K ) number of members in ensemble m ) number of process variables M ) number of random split variables n ) number of samples p ) proportion of samples P ) principal-component loading matrix Q ) variable residual r ) residual distance rR ) residual distance threshold s ) score distance S ) similarity matrix sR ) score-distance threshold sH ) Hotelling’s score distance T ) feature or score matrix T2 ) Hotelling’s statistic U ) eigenvectors w ) separating hyperplane x ) data vector X ) process data/input for training algorithm ˆ ) Reconstructed process data X Y ) response of X Z ) concatenation of X and permutated X a ) reduced dimensionality/number of model components G ) forward mapping function H ) reverse mapping function u ) tree indicator
Literature Cited
ReceiVed for reView December 12, 2009 ReVised manuscript receiVed May 24, 2010 Accepted June 28, 2010 IE901975C