Subscriber access provided by READING UNIV
Article
Adaptive k-Nearest-Neighbor Method for Process Monitoring Wenbo Zhu, Wei Sun, and Jose A Romagnoli Ind. Eng. Chem. Res., Just Accepted Manuscript • DOI: 10.1021/acs.iecr.7b03771 • Publication Date (Web): 30 Jan 2018 Downloaded from http://pubs.acs.org on February 3, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Industrial & Engineering Chemistry Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
Adaptive k-Nearest-Neighbor Method for Process Monitoring Wenbo Zhu,† Wei Sun,‡ and José Romagnoli∗,† †Department of Chemical Engineering, Louisiana State University, Baton Rouge, Louisiana 70803, United States ‡College of Chemical Engineering, Beijing University of Chemical Technology, Beijing 100029, P.R. China E-mail:
[email protected] Abstract In this paper, an adaptive process monitoring method based on the k-Nearest Neighbor rule (k-NN) is proposed to address the issues arising from nonlinearity, insufficient training data, and time-varying behaviors. Instead of recursively updating every measurement for adaptation, a distance-based updating rule is applied to search target prototypes, thus reducing the computational load for online implementation. Furthermore, for fault identification, a subspace greedy search is also introduced to formulate the complete monitoring system. The approach searches for the combination of variables (subspace) which has the greatest contribution to the discriminant between normal data and faulty data (explanatory subspace). The Tennessee Eastman Process (TEP) and data from an industrial pyrolysis reactor are considered to evaluate the performance of the proposed approach as compared with conventional methods.
1
ACS Paragon Plus Environment
Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Introduction In today’s chemical industry, many process measurements are recorded from various sensors in chemical plants. Accessibility of process data has driven the development and improvement of process monitoring in recent years. Researchers applied different models for various objectives including process visualization, 1,2 fault detection 3,4 and measurement estimation. 5,6 Compared with traditional model-based approaches, data-driven models are effective over multiple types of processes. Among these models, principal component analysis (PCA) based methods have attracted a considerable amount of interest in the literature 7–11 for fault detection. Based on given normal data, variables with the largest possible variance are captured for process monitoring. After that, Hotelling T 2 and squared prediction error (SPE) statistics are calculated to draw the control limits for the normal region. Any incoming sample exceeding the control limits is considered as fault. Nevertheless, PCA model has a few major limitations. First, PCA is a linear dimensional reduction technique, so it usually fails to predict the proper control limits for a non-linear data set. In addition, thresholds from PCA T 2 depend on the given training data. If the training set is insufficient to describe the entire normal region, a high false alarm rate could happen. The other drawback is that a fixed PCA model is unable to trace normal process drifting, while most of the industrial processes are time-varying due to catalyst deactivation, equipment aging, and tube coking. 12 To address the issues mentioned above, adaption is introduced into the fixed PCA model, i.e., recursive PCA (RPCA) 13,14 and moving window PCA (MWPCA) 15 . 16 Although such approaches can reduce the the number of false alarms due to normal variations in the process, they still exhibit limitations in real processes. Both approaches are based on the linear PCA model, which is not robust to nonlinear cases. Besides, for the MWPCA approach, old samples are always replaced by new samples, which could trick the model if many repetitive samples are updated (i.e. a steady state region before a drifting happens). The k-Nearest Neighbor algorithm (k-NN) is a broadly used method to classify an unknown object. 17 An unlabeled object is always assigned to the class with which it shares the highest sim2
ACS Paragon Plus Environment
Page 2 of 33
Page 3 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
ilarity. The Euclidean distance is commonly chosen as the similarity measure. The k-NN method has been applied in different areas including pattern recognition, data mining, and outlier detection. However, the k-NN rule has several drawbacks which limit its application in online process monitoring. The computational complexity of the k-NN is O(dn2 ), where d is the dimension of data and n is the number of training data. Although the complexity can be reduced up to O(d log n) by alternating data structures 18 , 19 it is still infeasible for online implementation with a large training set. In addition, like many machine learning methods, the conventional k-NN method is based on the closed-world assumption. 20 In other words, there is a risk that an unknown fault can be mistakenly recognized as a known type of fault or operating mode in the training set, which is undesirable for monitoring an industrial process. Few works have been published by applying the k-NN for process monitoring and fault detection. He et al. 21 22 have developed a k-NN-based method for fault detection in semiconductor manufacturing processes. In He’s approach, only normal data is included in the model. The fault detection rule follows the idea that a faulty sample should be distant from normal samples. The squared nearest neighbor distance threshold is obtained from a non-central chi-square distribution by which the normal region is bounded. However, He’s approach does not offer any adaptation strategy. If a proper adaptation rule can be determined such that only necessary samples are updated, then tracing a time-varying process can be possible. In this paper, an adaptive fault detection approach based on the k-nearest neighbors (Ak-NN) rule is proposed to address the issues arising from nonlinearity, insufficient training data, computational complexity, and time-varying behaviors. Instead of recursively updating every measurement for adaptation, a distance-based updating rule is applied to locate proper prototypes, thus reducing the computational load for online implementation. Furthermore, a fault identification method, based on greedy search is also introduced to formulate the complete monitoring system. The approach searches for the combination of variables (subspace) which has the greatest contribution to the discriminant between normal data and faulty data (explanatory subspace). The Tennessee Eastman Process (TEP) and an industrial pyrolysis furnace data are used as case studies to illustrate
3
ACS Paragon Plus Environment
Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
the feasibility of the proposed approach. A comparison with conventional PCA and RPCA is also provided.
PROPOSED PROCESS MONITORING METHOD In this section, the basic idea of the k-nearest neighbor method is introduced. After that, the proposed approach for process monitoring and the methods used are discussed. For each fault record, contributing variables are identified by subspace greedy search using the concept of explanatory subspace. Hence, all the parts for a complete fault detection and diagnosis system are composed.
k-Nearest Neighbors (k-NN) and Distanced-Based Outliers k-nearest neighbors (k-NN) was first introduced by Hart 23 for pattern classification during the 1960s. Due to its simplicity and flexibility, it was ranked as one of the top 10 algorithms in data mining 24 and is widely used in pattern recognition, data mining and outlier detection. An unlabeled object is assigned to the closest training class, where the Euclidean distance is always used to measure the degree of similarity. Two types of voting schemes are used to classify the unlabeled object. In the majority voting, every neighbor for an underdetermined object has same impact. The object is assigned to the class that appears most frequently within its k-nearest neighbors. Distance-weighted voting weights each neighbor’s impact according to its distance, where the weight is defined as w = 1/d 2 . This approach attempts to reduce the sensitivity to parameter k choice. Based on the k-nearest neighbors rule, Knorr et al. 25 have proposed a distance-based outlier detection method: An object O in a dataset T is a DB(p,D) outlier if at least fraction p of the objects in T lies greater than distance D from O. In other words, a large k-nearest-neighbor distance indicates the appearance of outliers. In terms of outlier, Hawkins gave a formal definition as “An outlier is an observation which deviates 4
ACS Paragon Plus Environment
Page 4 of 33
Page 5 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Outlier detection has broad application in many areas, such as intrusion detection systems and credit card fraud 26 . 27 In process fault detection for a chemical plant, most of the abnormal conditions have apparent deviation from normal condition. Thus, outlier detection theories can be naturally extended into fault detection of process data.
Adaptive k-NN (Ak-NN) In this part, we will introduce a distance-based method for process monitoring and fault detection, which is designed for adaptively tracking the time-varying process behaviors and heuristically extending the continuous normal operation regions. Intuitively, the common philosophy to determine a fault is that faulty data should exhibit little similarity with normal data. In terms of the k-NN rule, a fault can be detected if its k-nearest neighbor distance (k-dist) is greater than that of normal samples. In our approach, only normal samples are used to initialize the model, from which a threshold k-nearest neighbor distance, dα can be determined for fault detection. After the determination of dα from training data, the model should acquire the latest information from the process in order to monitor a process with time-varying behaviors. Using a recursive updating approach by adding every new sample within threshold like RPCA 13 enables the model to learn about the process adaptively. However, the computational complexity of the k-NN rule makes it infeasible for online implementation. In addition, for a chemical process with slow time-varying behaviors, much of the updating is unnecessary because the consecutively incoming data often contains repetitive information. Ideal updating samples are the prototypes which have relatively high similarity with the normal data and are also able to provide novel information to the model. Updating the model with only these data can ensure a low computational cost while keeping the prediction ability. In the following section, we will formalize such prototypes via the k-NN rule. Definition 1. (Nearest neighbors number of a point) For a data point x, the number of its neighbors in data set D located within the kth threshold distance dα , denoted as Nx , is defined Nx = |{q ∈ D|dist(x, q) 1). Since the “valley” points can be non-unique, in order to locate the last “valley” on the curve, the searching domain is constrained in [a, b] as the ending region of the curve, where a ∈ (0, b). The “valley” point is then defined as:
x = min{x ∈ [a, b]|
f (x + s/2) − f (x − s/2) f (b) − f (0) ≥ } s b
(a)
(b)
(c)
(d)
Figure 3: Threshold comparison for different data distribution via different methods. (a) Sample dataset with three different distribution. (b) Conventional PCA T2 threshold. (c) Conventional kNN decision boundary. (d) kNN method bounded by distance threshold Figure 3 illustrates threshold comparison between proposed method and conventional approaches. While PCA T2 draws a good threshold for linear data set, it overgeneralizes the threshold for the curving data set. The conventional k-NN focusing on boundary determination among classes is 8
ACS Paragon Plus Environment
Page 9 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
not suitable for fault detection. The proposed Ak-NN method bounds the training data well for both linear and non-linear cases with the threshold obtained from the k-dist curve. The threshold determination seems reasonable and practical through the comparison. Moreover, once the threshold distance is estimated from the k-dist curve, the training set can be condensed by applying the condensed nearest neighbor rule (cNN) 31 . 32 Data condensation can accelerate the online computing speed, which is important for any methods based on the k-NN rule.
Updating Factor Another important parameter in the proposed method is the updating factor p, which is designed to govern the degree of process variation. The setting of the p value depends on the process variation. Large p value corresponds to the process with intense varying behaviors while small p value is used for slow drifting in the process. As shown in Figure 4, a greater p value allows the model to extend its searching space in order to collect more qualified prototypes, while a smaller p value tightens the searching space.
(a) p = 0.25
(b) p = 0.50
(c) p = 0.75
Figure 4: Updating region changed by p values (k = 10). The highlighted area is the updating region, which expands with an increasing p value. In practice, an accurate value of the updating factor can be estimated from calibration data that contains labels for each data sample. A reasonable p value can be determined from the overall consideration of different evaluation parameters. In this study, three common metrics, namely missed detection rate (MDR), false alarm rate (FAR), and the weighted average of MDR and FAR, 9
ACS Paragon Plus Environment
Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 33
the misclassification rate (MR) are used to evaluate the p value. The expressions of the metrics are given as follows: |x ∈ Fault : d(x) > dα | × 100% |Fault| |x ∈ Normal : d(x) > dα | FAR = × 100% |Normal| |x ∈ X : c(x) 6= y(x)| × 100% MR = |X|
MDR = 100 −
(1) (2) (3)
where c is the target and y is the prediction. Although p value can be automatically optimized by minimizing the MR of the calibration set, the effect to both MDR and FAR is still recommended to take into account for final decision. It should be noted that when there is a huge imbalance between the amount of faulty and normal data, the average of FAR and MDR can be used as the evaluational index instead of MR.
Subspace Greedy Search Once a fault is detected, fault identification is always the next step to diagnose the detailed cause of the fault. In this section, the fault identification based on the idea of explanatory subspace is discussed. Explanatory subspace was raised by Micenkova et al., 33 which was used to explain the cause for an outlier by searching for the most contributing combination of variables. Applying a similar idea, the original problem statement of fault identification can be converted to seek the combination of the most contributing group of variables (subspace) which causes the discriminant between normal and faulty data. To achieve this objective, the most straightforward way is to define a score function that is able to measure disparity among dimensions equally. By comparing the scores, the cause of faulty data can be identified. However, in reality, it is difficult to find a proper score function to measure the disparity equally over multiple dimensions due to the curse of dimensionality. Another issue is computational complexity. For a measurement including d variables, the number of possible subspaces is 2d . It is infeasible to score all subspaces and pick
10
ACS Paragon Plus Environment
Page 11 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
the one with the highest score by brute-force searching. To address such issues, we applied the greedy algorithm to search for the target subspace in multiple iterations, called subspace greedy search (SGS), making it suitable for online implementation. Instead of brute-force searching that visits all subspaces, the greedy algorithm reaches the target subspace by searching only a few subspaces through multiple iterations. The searching starts from the lowest dimension and ascends to higher dimensions until no possible subspaces are left. Each iteration corresponds with a certain dimension. In each iteration, subspaces are created in same dimension by combinations of available variables, and variables in the subspaces that have high testing scores are kept for next iteration. In this study, the testing score is calculated by k-dist between fault and normal samples for each subspace, which is able to fairly compare the disparity among different subspaces at same dimensions. Subspaces with high k-dist indicate the critical variables between faulty samples with normal samples. Hence, according to the greedy algorithm, variables in the subspaces with high k-dist are kept for further testing and the rest are purged. The detailed implementation is summarized in Algorithm 1. Candidate list is designed to preserve the most contributing variables obtained in previous iterations. A subspace list at each dimension is generated by combination formula subject to the candidate variables. The setting of Candidate list can effectively reduce the number of combinations from mCn to m−d+1C1 for each iteration, where m is the number of variables left in each iteration. The score is calculated by k-dist between faulty and normal data at each subspace. The subspace with the highest score are recorded in a Local best list and set as candidate variables for next iteration. Variables in the subspaces with scores above the average are kept for next iteration, and the excluding variables are purged. After that, dimension is increased by one for the next iteration. Once dimension d exceeds the number of variables left, the algorithm ends and variables in Candidate list are returned as the explanation between normal and faulty data. The frequency of each variable in the Local best list can naturally rank their contribution for the disparity.
11
ACS Paragon Plus Environment
Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Algorithm 1 Subspace greedy search 1: Candidate_list = Empty 2: Local_best = Empty 3: Variable_list = All variables 4: d = 1 5: while d