Subscriber access provided by University of Winnipeg Library
Process Systems Engineering
Robust Monitoring of Industrial Processes in the Presence of Outliers in Training Data Shiyi Bao, Lijia Luo, Jianfeng Mao, Di Tang, and Zhenyu Ding Ind. Eng. Chem. Res., Just Accepted Manuscript • DOI: 10.1021/acs.iecr.8b00464 • Publication Date (Web): 29 May 2018 Downloaded from http://pubs.acs.org on May 29, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
Robust Monitoring of Industrial Processes in the Presence of Outliers in Training Data Shiyi Bao, Lijia Luo∗, Jianfeng Mao, Di Tang, and Zhenyu Ding Institute of Process Equipment and Control Engineering, Zhejiang University of Technology, Hangzhou 310014, China ABSTRACT: Data-driven process monitoring methods often use classical multivariate statistical analysis (MSA) techniques for data analysis and modeling, with the assumption that the training data contains only observations associated with normal operating conditions of a process. In practice, the training data of an industrial process may contain outliers. Classical MSA methods can be affected by outliers so strongly that the resulting monitoring methods do not allow to detect true process faults. In this paper, a robust process monitoring method is proposed to cope with the presence of outliers in the training data. First, the minimum covariance determinant (MCD) estimator is used for outlier detection and to obtain the robust estimation of location and scatter of training data. A robust fault detection index, called the robust T2 statistic, is then defined based on the MCD estimator. Squared variable deviation magnitude (SVDM) is defined as a robust measure of the deviation of a process variable from the normal range. A SVDM-based fault diagnosis method is proposed to identify faulty variables responsible for process faults. The performance of the proposed methods is evaluated through an industrial case study. 1. Introduction Process monitoring is an effective means to guarantee safe operation and high quality production of industrial processes. The goal of process monitoring is to detect the occurrence of unusual events
∗
Corresponding Author. Tel.: +86 (0571) 88320349. E-mail address:
[email protected] 1
ACS Paragon Plus Environment
Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
(i.e., process faults) during process operations and further to diagnose the root causes of process faults. Process monitoring can be achieved using process mechanism models, process knowledge or process data. The most efficient way to implement process monitoring is to use process data, known as the data-driven process monitoring method, which avoids the understanding of complicated physical and chemical mechanisms of industrial processes. Data-driven process monitoring methods have been extensively studied over the last few decades.1-3 These methods often apply multivariate statistical analysis (MSA) techniques on process data to extract data information useful for fault detection and diagnosis. The commonly used MSA techniques include principle component analysis (PCA), discriminant analysis and multivariate regression. To build monitoring models and to determine control limits for fault detection indices, data-driven process monitoring methods always need a set of training data obtained when the process is operating in the normal region. The training data have great effect on the process monitoring performance. In general, in order to get better monitoring performance, all training data should be associated with normal operating conditions of a process. However, in practice the training data set of an industrial process may contain outliers (i. e., data points deviating from the pattern represented by the majority of training data), such as faulty data, data collected during the shutdown and startup periods, and data obtained from different operating modes. Usual MSA techniques (e.g., PCA) are based on the empirical mean, covariance and correlation matrices, and least squares fitting. They can be adversely influenced by even a few outliers,4 and thus the resulting data-drive monitoring models may not represent the normal behaviors of industrial processes accurately. In addition, outliers may spoil the fault detection indices (e.g., the Hotelling’s T2 statistic) and their control limits to make 2
ACS Paragon Plus Environment
Page 2 of 30
Page 3 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
them unreliable. Therefore, traditional data-driven process monitoring methods may have poor fault detection and diagnosis performance when the training data set contains outliers. A possible way to cope with the presence of outliers in the training data set is to develop robust process monitoring methods based on robust MSA techniques that are not influenced much by outliers. Robust PCA (RPCA) is the most widely used robust MSA technique. Existing RPCA methods are mainly based on five different approaches:5,6 (i) replacing the classical covariance matrix by a robust covariance estimator, such as the M-estimators,7,8 minimum covariance determinant (MCD) estimator9 and S-estimators,10 (ii) projection pursuit (PP),11-14 which maximizes a robust measure of data variance (e.g., the median absolute deviation (MAD) or the more efficient Sn and Qn estimators15) to obtain consecutive projection directions, (iii) a combination of projection pursuit and robust covariance estimation,16 (iv) robust subspace estimation, which minimizes a robust scale (e.g., the M-estimator or least trimmed squares estimator9) of PCA residuals,17 (v) spherical and elliptical PCA,18 (vi) principal component pursuit,19,20 which recovers a low-rank matrix from a high-dimensional data matrix corrupted by gross sparse errors (e.g., outliers). The applications of RPCA in the field of process monitoring have been reported in the literature.21-23 Chen et al.21 proposed to use a RPCA via projection pursuit in place of classical PCA for multivariate statistical process monitoring. Yvon et al.22 proposed fault detection and isolation methods on the basis of a RPCA model that is determined by a scale-M estimator.17 Yan et al.23 developed a robust multivariate statistical process monitoring method using the stable principal component pursuit approach proposed by Zhou et al.20 Using RPCA can build robust monitoring models that are less influenced by outliers in the training data. This, however, cannot eliminate the influence of outliers 3
ACS Paragon Plus Environment
Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
on fault detection indices (especially their control limits) used together with RPCA models. Consequently, process monitoring performance can be still affected by outliers. Except for using robust MSA techniques, a straightforward way to avoid adverse effects of outliers is to remove outliers from the training data set before implementing process monitoring or to develop robust fault detection indices unaffected by outliers in the training data set. A number of multivariate outlier detection approaches have been proposed in the literature. The representative approaches include least trimmed squares,9 multivariate trimming,24 minimum volume ellipsoid,25 minimum covariance determinant.26 Unfortunately, the applications of these methods in the field of process monitoring are very limited. In this paper, a new robust process monitoring method is proposed to cope with the presence of outliers in the training data set. First, the minimum covariance determinant (MCD) estimator is used to detect outliers in the training data set and to compute the robust estimation of location and scatter of training data. A robust fault detection index, called the robust T2 statistic, is then defined based on the MCD estimator. In addition, squared variable deviation magnitude (SVDM) is defined as a robust measure of the deviation of a process variable from the normal range. A SVDM-based fault diagnosis method is developed to identify faulty variables responsible for process faults. The proposed robust process monitoring method is tested on the Tennessee Eastman process to demonstrate its effectiveness and advantages. The rest of the paper is organized as follows. Fundamental concepts of the minimum covariance determinant (MCD) estimator and T2 statistic are introduced briefly in Section 2. Section 3 describes the robust process monitoring method and its components: outlier detection method using the MCD 4
ACS Paragon Plus Environment
Page 4 of 30
Page 5 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
estimator, robust T2 statistic for fault detection, and the SVDM-based fault diagnosis method. In Section 4, the proposed method is tested on the Tennessee Eastman process. The conclusions are presented in Section 5. 2. Preliminaries 2.1. Minimum covariance determinant estimator Rousseeuw25,26 proposed the minimum covariance determinant (MCD) estimator to compute the robust estimation of multivariate location and scatter when outliers are present in the data set. Consider a data set = ( , … , ) consisting of n observations of dimension v. The MCD estimator searches for a subset of h observations (out of n) whose covariance matrix has the lowest determinant. Let Ω(MCD) = { , … , } denote the indices of observations in the MCD subset. According to the definition of MCD subset, the determinant of the covariance matrix of observations , …, is minimal over all subsets of observations of size h, namely det ( , … , ) ≤ det (( , … , )
(1)
for any subset { , … , } of {1, … , }. The MCD estimate of location is the average of h observations in the MCD subset27
!"# = ∑∈&(!"#)
(2)
The MCD estimate of the scatter matrix is a multiple of the covariance matrix of the MCD subset 27 '!"# = ((,,))*(,,)) ∑∈&(!"#)( − !"# )( − !"# ) +
(3)
where .(ℎ, , 0) and 1(ℎ, , 0) are two correction factors to obtain consistent and unbiased estimates when the data come from a multivariate normal distribution with mean µ and covariance matrix , i.e., ~3(, ). 28,29 The small sample bias-correction factor 1(ℎ, , 0) is obtained by 5
ACS Paragon Plus Environment
Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 30
Pison et al.29 through a combination of Monte Carlo simulation and parametric interpolation. The consistency correction factor .(ℎ, , 0) is given by27 . (ℎ, , 0 ) =
⁄
9 :69 5(6789 7,⁄; )
(4)
= = where ∙A is the greatest integer function.30 A larger h yields more efficient estimates, but at the expense of a reduced breakdown value. The MCD estimator can be computed efficiently with the FAST-MCD algorithm.31 The FAST-MCD computes a one-step reweighted MCD (RMCD) estimate given by31 C!"# = 'C!"# = (
∗ (G,,))* ∗ (G,,))
G+
∑; EF DE E G
∑J I ( − C!"# )( − C!"# )-
(5) (6)
where K = ∑J I , and I is the weight of the ith observation . The . ∗ (K, , 0) and 1 ∗ (K, , 0) are correction factors to guarantee the consistency of the reweighted estimate and to reduce the small sample bias, as do the corresponding factors in Eq. (3). The weight I is determined by31 I = L
= = 1 if O(!"#) ≤ O(!"#)∗ 0 otherwise
(7)
= where O(!"#) is the robust squared Mahalanobis distance (RSMD) defined by = + ( '!"# O(!"#) = ( − !"# )- − !"# )
(8)
= and O(!"#)∗ is a threshold. A suggested threshold is the 0.975 quantile of the d∗= ), a process fault is detected.
3.3. Fault diagnosis with the squared variable deviation magnitude Variables responsible for a detected fault need to be determined in order to eliminate the fault. If a process fault occurs, some variables may exceed the normal operating ranges. The key to fault diagnosis is therefore to identify which variables have deviated from normal ranges.38 This requires an appropriate metric for quantifying deviations of variables from normal ranges. Squared variable deviation magnitude (SVDM) is defined as such a metric. The SVDM of the jth variable as in an observation x is given by Δs= =
(uv +wxyz{,v )9 |xyz{,vv
(20)
'C!"# , and where }̂ C!"#,s is the jth element of C!"# , ΣC!"#,ss is the jth diagonal element of 'C!"# are the reweighted MCD estimates of location and scatter of training data. For C!"# and observations obtained in normal operating conditions, SVDM values of v variables are below the 10
ACS Paragon Plus Environment
Page 11 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
( = [Δ ,( , … , Δ ),( ] (i.e., Δs= ≤ Δ s,( ). The confidence limits ~ ( of SVDM can be confidence limits ~ estimated using the training data. If SVDM values of the jth variable in m normal training = = , its observations (with I = 0 ) are considered as a univariate sample Φs = Δs, , … , Δs,G
probability density function may be estimated by kernel density estimation39 (
) =
G
∑G J
+9v,
(21)
= where Δs, is the SVDM value of the jth variable in the kth observation xk, θ is a bandwidth
s,( , of the SVDM for the jth parameter, and K(∙) denotes a kernel function. The confidence limit, Δ variable at significance level α is computed by
+v, (
)O
= 1 −
(22)
s,( , is setting it as the (1 − α) quantile of An alternative way to determine the confidence limit, Δ s,( . A the sample Φs . For a faulty observation, there may exist at least one variable for which Δs= > Δ variable is thus considered to be faulty (with (1 − α) ∙ 100% confidence) if its SVDM value exceeds the confidence limit. The variable with the largest SVDM may be responsible for the fault. 3.4. Process monitoring procedure Figure 1 illustrates the proposed robust process monitoring procedure, which consists of the main steps as follows: Stage I: Offline data analysis Step I-1: Detect outliers in the training data set X, and compute the MCD estimators C!"# and 'C!"# via Eq. (5) and Eq. (6). Step I-2: Determine the control limit d∗= for the robust T2 statistic via Eq. (18). Step I-3: Compute SVDM values of all variables in normal training observations via Eq. (20). 11
ACS Paragon Plus Environment
Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Step I-4: Determine confidence limits for the SVDM of all variables via Eq. (21) and Eq. (22). Stage II: Online process monitoring For each new observation, xnew = Step II-1: Compute the robust T2 value, dqD , of xnew via Eq. (19). = = Step II-2: Compare dqD to the control limit d∗= . If dqD ≤ d∗= , skip to Step II-5, otherwise detect
a fault and go to Step II-3. Step II-3: Compute SVDM values for all variables in xnew via Eq. (20), Step II-4: Compare SVDM values to the corresponding confidence limits to identify faulty variables. Step II-5: Return to step II-1 and monitor the next observation.
Figure 1. Flow diagram of the robust process monitoring procedure. 4. Case study: Tennessee Eastman process 4.1. Process description The performance of the proposed robust process monitoring method is illustrated by a simulation 12
ACS Paragon Plus Environment
Page 12 of 30
Page 13 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
of the Tennessee Eastman (TE) process, a well-known benchmark and realistic representation of a real industrial process.40 This process consists of five major unit operations: an exothermic reactor, a condenser, a flash separator, a recycle compressor and a reboiled stripper. Two products G and H and a by-product F are produced from four reactants A, C, D and E and an inert B. The 33 process variables in Table 1 are monitored. The training data set consists of 960 samples, with 910 normal samples collected in normal operating conditions and 50 Gaussian distributed outliers. Moreover, 21 faulty data sets are generated by introducing 21 process faults in Table 2 to the TE process. Each faulty data set contains 960 samples, with the fault occurred at the 161st sample. Table 1. Monitored variables in the TE Process. No. Variable name Units No. Variable name A feed (stream 1) kscmh v1 v18 Stripper temperature -1 kg h v2 D feed (stream 2) v19 Stripper steam flow -1 kg h v3 E feed (stream 3) v20 Compress work Reactor cooling water outlet kscmh v4 A and C feed (stream 4) v21 temperature Condenser cooling water outlet kscmh v5 Recycle flow (stream 8) v22 temperature kscmh v6 Reactor feed rate (stream 6) v23 D feed flow valve (stream 2) kPa v7 Reactor pressure v24 E feed flow valve (stream 3) % v8 Reactor level v25 A feed flow valve (stream 1) ℃ v9 Reactor temperature v26 A and C feed flow valve (stream 4) kscmh v10 Purge rate (stream 9) v27 Compressor recycle value Product separator temperature ℃ v11 v28 Purge valve (stream 9) Separator pot liquid flow valve % v12 Product separator level v29 (stream 10) Stripper liquid product flow valve kPa v13 Product separator pressure v30 (stream 11) 3 -1 m h v14 Product separator underflow v31 Stripper steam valve Stripper level % v15 v32 Reactor cooling water flow valve Condenser cooling water flow kPa v16 Stripper pressure v33 valve 3 -1 v17 Stripper underflow (stream 11) m h
13
ACS Paragon Plus Environment
Units ℃ kg h-1 kW
℃ ℃ % % % % % % % % % % %
Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 30
Table 2. TE process faults. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Faulty variable A/C feed ratio, B composition constant (stream 4), B composition, A/C feed ratio constant (stream 4) D feed temperature (stream 2) Reactor cooling water inlet temperature Condenser cooling water inlet temperature A feed loss (stream 1) C header pressure loss-reduced availability (stream 4) A, B, C feed composition (stream 4) D feed temperature (stream 2) C feed temperature (stream 4) Reactor cooling water inlet temperature Condenser cooling water inlet temperature Reaction kinetics Reactor cooling water valve Condenser cooling water valve Unknown Unknown Unknown Unknown Unknown The valve for stream 4 was fixed at the steady state position
Type Step Step Step Step Step Step Step Random Random Random Random Random Slow drift Sticking Sticking Unknown Unknown Unknown Unknown Unknown Constant
4.2. Outlier detection results To eliminate the influence of outliers on the process monitoring performance, the MCD estimator is applied to detect outliers in the training data set and to compute the robust estimation of location and scatter of training data. A total of 49 outliers are detected by the outlier detection method in Section 3.1. Figure 2a shows the robust squared Mahalanobis distances (RSMDs) of all training samples. The red line represents the threshold of RSMD to distinguish outliers from normal samples. The threshold is selected as the 0.975 (which is a common choice in the literature4,31,37) quantile of the scaled F distribution in Eq. (11). Samples with RSMDs larger than the threshold are flagged as outliers. The 49 outliers are labeled with their sample numbers in Figure 2a. For comparison, Figure 2b shows the outlier detection result obtained using the classical squared Mahalanobis distance 14
ACS Paragon Plus Environment
Page 15 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Industrial & Engineering Chemistry Research
(CSMD). The threshold of CSMD is the 0.975 quantile of the