Robust Self-Supervised Model and Its Application ... - ACS Publications

Jun 12, 2017 - process monitoring methods are neural-network-based models such as autoassociative neural networks and principal curve method.6,7 After...
0 downloads 0 Views 1MB Size
Subscriber access provided by University of Florida | Smathers Libraries

Article

Robust Self-supervised Model and Its Application for Fault Detection Li Jiang, Zhihuan Song, Zhiqiang Ge, and Junghui Chen Ind. Eng. Chem. Res., Just Accepted Manuscript • DOI: 10.1021/acs.iecr.7b00949 • Publication Date (Web): 12 Jun 2017 Downloaded from http://pubs.acs.org on June 19, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Industrial & Engineering Chemistry Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Robust Self-supervised Model and Its Application for Fault Detection Li Jiang, Zhihuan Song*, Zhiqiang Ge, Junghui Chen State Key Laboratory of Industrial Control Technology, Institute of Industrial Process Control, Department of Control Science and Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, P. R. China Department of Chemical Engineering, Chung-Yuan Christian University, Chung-Li 320, Taiwan, ROC

Abstract: Previous work on process monitoring has shown that chemical process can be modeled by data-based model such as Principle Component Analysis(PCA) and Neural Network(NN) model. But it’s difficult to train a model that has good generalization capabilities in fault detection, especially for nonlinear process. Based on the idea of making the trained model robust to the noisy training data, this paper intends to develop a unified training method for PCA and Auto-encoder(AE) model. A unified training model called Robust Self-supervised model is first proposed. Then theoretical analysis shows that the models trained by the proposed methods are more sensitive to the fault occurrence in the process. The corresponding statistic is introduced for proposed methods. According to the simulation results of three case studies,

ACS Paragon Plus Environment

1

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 37

efficiencies of both Robust Auto-encoder and Robust PCA monitoring performances are evaluated.

1. Introduction Over the last decades, the modern chemical industry has obtained tremendous achievement on safe and high-quality production.1 Nowadays, the widespread of computer aided process control and instrumentation techniques in the process industry leads to huge number of process data, but many operational problem and inefficiencies still go undiagnosed, i.e., data rich but information poor. Therefore, data-driven process monitoring or statistical process monitoring becomes one of the most active research areas in industry process control.2 The multivariate statistical process monitoring (MSPM) tasks typically include: (i) fault detection; (ii) fault identification or diagnosis; (iii) fault reconstruction; (iv) product quality monitoring and control.3 To detect faults happened in the process, most of existing MSPM methods are based on the multivariate statistics and machine learning methods. Traditional multivariate statistical process monitoring (MSPM) model such as principal component analysis (PCA) separates data information into two parts: a systematic part and a noisy part. Nevertheless, conventional MSPM are based on the precondition that process variables are linear correlated with each other, which is quite restricted in practice. To monitor nonlinear process, several nonlinear extensions of MSPM methods have been proposed, including methods based on neural network, methods based on kernel function and methods based on linear approximation approaches. Kernel approaches such as kernel PCA (KPCA) map the measured data into a high-dimensional feature space by kernel function.4 However, the size of kernel matrix is the square of the number of samples. When there are too

ACS Paragon Plus Environment

2

Page 3 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

many samples, the calculation will become time consuming. Linear approximation method for nonlinear process monitoring assumes that the nonlinear space can be approximated by several local linear models, but this type of method may not be able to model strong nonlinearities in the process.5 Another type of nonlinear process monitoring method is neural network based model such as auto-associative neural network and principle curve.6,7 After obtaining the associated scores and the correlated data using the principal curve method, they used a neural network model to map the original data into scores and map these scores back to the original variables. Recent years, a new neural network based dimension reduction techniques, which is known as deep learning, has become popular in machine learning area. The deep learning method includes Auto-Encoder based model, Restricted Boltzman Machine (RBM) based model, etc. One central idea of these methods is that they can do unsupervised training to learn the representation of nonlinear data.8,9 The unification of linear and nonlinear methods has also been studied by several researchers, such as nonlinear continuum regression (NLCR).10 Besides of the nonlinear property, one major shortcoming of the classical MSPM models is its brittleness with respect to corrupted or outlying data. The corrupted data happened in process monitoring applications, where some measurements may be arbitrarily corrupted. So far a number of approaches to robust methods have been explored. There are three major ways to deal with this problem: (i) assume the data follows student t-distribution in probabilistic modeling methods;11 (ii) detect and eliminate the corrupted data before using classical MSPM methods;12 (iii) eliminate the influence of the corrupted data by robust methods. This paper takes the third way by training a robust MSPM model. Motivated by the above analysis, this paper presents a novel robust modeling technique that unifies linear and nonlinear methods. A unification framework called self-supervised model is

ACS Paragon Plus Environment

3

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 37

proposed first to combine some existing MSPM methods including PCA, auto-associative neural network and so on. Then introduce the robust training approach into the self-supervised model by adding artificial data to the model input. After robust training, the corresponding process monitoring procedure is listed. With the incorporation of robust training, whatever the input data is corrupted or not, the model can obtain a robust output and the SPE statistic is more sensitive to the fault. The advantage of proposed robust self-supervised model is demonstrated through theoretical analysis and case studies using linear and nonlinear models. The rest of the paper is organized as follows. The next section introduces the unification framework of different models. After that, Section 3 proposed the robust training problem and the corresponding training criteria. Section 4 describes the application of PCA model using linear methods, followed by a numerical example. The Section 5 introduces a neural network called Auto-encoder which can be used to reconstruct original space. Then based on the basic method, a Robust Auto-encoder model is proposed to monitoring the nonlinear process. In case study, Robust methods is compared with traditional methods to show superiority of the proposed method in two industrial process. Finally, conclusions are made.

2. Self-supervised Model Self-supervised model is a unification of different models. Training to learn the identity mapping has been called self-supervised backpropagation or autoassociation.13,14 The selfsupervised model maps the data into lower dimensions in the feature space and restore the intrinsic features into the original dimensionality of the data. For a input training data vector

x = [ x1 , x2 ,..., xJ ] ∈ R J , where J is the number of variables, the two-step mapping function to T

predict the input data can be represented as

ACS Paragon Plus Environment

4

Page 5 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

r(x) = θ ( β; φ ( α; x ) )

(1) φ( g)

where β and α are the matrix of function parameters. In the first step mapping, R J → R K ,

φ ( • ) is the projection function that mapping the input to the feature space and K is the dimension in feature space; the feature space is constructed. The latent variable t in the feature space is denoted as

t = φ ( α; x )

(2)

The second mapping is a reconstruction function θ ( • ) which is carried out in the feature space: () RK → R J . The output vector of the self-supervised model is r = [ r1 , r2 ,..., rJ ] , which is T

θg

viewed as a prediction of input vector x . The self-supervised model can be trained by minimizing the sum of squared errors in the expected loss form L = ∫ r (x) − x

2

p ( x ) dx

(3)

where ... represents the Euclidean distance, and p ( x ) denotes the probability density function 2

of input data. The objective function given by eq 3 specializes to PCA, NLPCA and Autoencoder for different r(•) . In experimental setting, for a finite discrete data set consisting of I samples, the probability p ( x ) is:

p ( x) =

1 I ∑ δ ( x − xi ) I i =1

(4)

in which δ is the Dirac Delta function and xi ∈ℜ J ×1 is i th sample vector. The objective function of the empirical loss is expressed as:

L=

1 r ( x) − x I∫

2

I

1

I

∑δ ( x − x )dx = I ∑ r(x ) − x i

i =1

i

2 i

(5)

i =1

ACS Paragon Plus Environment

5

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 37

where r(xi ) is the output of self-supervised model when the input is xi . Minimizing eq 6 is the criterion of self-supervised model in the experimental setting. The self-supervised model can provide a unified form of the different models, including linear and nonlinear. For the different models, Table 1 lists the different corresponding projection functions and reconstruction functions Table 1. Comparison for different self-supervised methods

Methods

Projection function φ ( x )

Reconstruction function θ ( t )

Optimization parameters α, β

PCA

PT x

Qt

P, Q

W11 , W12 , b1 , NLPCA

W11 f ( W12 x + b1 )

W21 f ( W22 t + b2 )

W21 , W22 , b 2

Auto-Encoder

f ( W1x + b1 )

g ( W2 t + b 2 )

W1 , b1 , W2 , b 2

However, the training result of this unified form cannot improve the generalization performance of the model; that is to say, the model may not perform well for a new input data that do not belong to the previous training data set. As showed in Figure. 1, x1 , x 2 and x3 are input training data which is drawn from the normal operation state, and they all lies in a local area in which they are close to each other. The green crosses represent the normal data point, and the red crosses represent the fault data point. The self-supervised model learns the identity function in which the output should be equal to the input, i.e. r ( x ) = x , viewed as the dotted oblique line in the Figure 1. When there is a fault data xt which is far from the normal area, the model will return the same value xt . So the model output is not robust to the input data and cannot capture the normal operation state information of input data. To address the problem stated above, a novel training method of self-supervised model is proposed.

ACS Paragon Plus Environment

6

Page 7 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

r r(x)

x1 x2

x3

xt

x

Figure 1. Training and test results of self-supervised model in 1-dimensional example Considering the multivariable process, some faults will change the correlation among the monitored variables. In such case, the fault is usually detected in the latent variable space, which is not the main concern of this paper. In this paper, the proposed robust training methods mainly improve the detection performance in the residual space.

3. Robust Self-supervised (RSS) Model Models trained directly on the history data matrix using criterion in Section 2 cannot capture stable structure. Therefore, to enforce the model robustness to the online data, this section first hypothesizes and propose an additional specific criterion: robustness to the artificial corruption of input, i.e., train the model to reconstruct a clean data from artificially corrupted input. Then investigate the objective function relationship with that in Section 2. At last, the corresponding process monitoring statistics is proposed to monitor the process after the model is trained. 3.1 Robust Training Criterion of Self-supervised Model. It is assumed that the history data vector x is generated from normal operating condition. The normal data x can be seen as a noise-free representation h plus an additive measurement error ξ,

ACS Paragon Plus Environment

7

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 37

(6)

x = h+ξ

While the data xt are corrupted by the fault information, xt may contain a fault change f , formulated by xt = ht + f + ξ t

(7)

The problem for getting a RSS model is defined as: Problem (robust self-supervised model). Given training data vector x = h + ξ , which is drawn from the steady state condition. h and ξ are unknown, but h is idealized (or noise free) systematic normal part and ξ is known to follow Gaussian distribution (measurement noise). For a test data vector x t = h t + f + ξ t , where ht and h are drawn from the same noise-free normal data. ξ t and ξ follow the same Gaussian distribution. f is the fault change and unknown. Try to use x to learn a model r (g) which is robust to the fault change f , that is to say, r (xt ) should approximate the unkonwn ht . To solve the above problem, the optimization criteria of self-supervised model in Section 2 are modified to a RSS training method. The key idea of training a RSS model is intuitive: to make the learned models robust to the test data that contains fault information, train the corrupt training input to be robust to the uncorrupt input. The robust training of self-supervised model first add a Gaussian vector ε onto each input data vector x before x is substituted into the selfsupervised model, then train model to reconstruct the clean input: L=



r (x + ε) − x

2

p ( x ) dx

(8)

where ε : N (0, σ ε2 I ) , ε = [ε1 ,..., ε J ] is a vector in which every variable is independent and follows T

the same Gaussian distribution with variance σ ε2 . Intuitively, the modified model enforces its

ACS Paragon Plus Environment

8

Page 9 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

robustness to the fault inputs by trying to reconstruct normal input from a corrupted one. Figure 2 is the schematic representation of the robust training process. An example x is corrupt to x + ε , then maps it to the feature space and attempts to use r to predict x by minimizing the corresponding squared error L ( x, r ) .

x+ε

x

φ (x + ε)

L ( x, r )

θ (t)

r

t

Figure 2. Robust training for input x . In process monitoring context, this robust training procedure may help the model improve the prevention of a certain degree of fault occurrence. In experimental setting, after training selfsupervised model in Section 1, the criteria will force the learned model r* (x) approximate h .

(

min E r ( x ) − x

2

) ⇒ r ( x) ≈ h *

(9)

However, during online monitoring, when there is a fault change f happened to the process, the model may not perform well, i.e., model cannot recover noise-free part ht from the fault data xt . That is

r * ( xt ) ≠ ht

(10)

If using RSS model, the criteria will lead to a more robust result,

(

min E r(x + ε) − x

2

) ⇒ r (x + ε) ≈ h *

(11)

Because the artificial additive data ε can be arbitrary, it’s predictable that in online test procedure,

ACS Paragon Plus Environment

9

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 37

The model can recover noise-free part ht from the fault data xt .

r * ( xt ) ≈ ht

(12)

This feature can be used for fault detection in process monitoring. 3.2 Regularization Formulation. In supervised learning area, there is a well-known link between training with noise and regularization: they are equivalent for small additive noise.16 Similarly, as an unsupervised method, the optimization criterion of RSS model is equivalent to a regularization add on the output mean squared error, stated by the following theorem. Theorem 1 when the artificial noise is ε : N (0, σ ε2 I ) , the robust training criterion of selfsupervised model min ∫ r ( x + ε ) − x

2

p ( x ) dx

(13)

is equal to the regularized reconstruction error:  ∂r (x) 2 min ∫  r (x) − x + σ ε2  ∂xT 

 2  p(x)dx + o (σ ε ) F 

2

(14)

The proof of the above Theorem use Tylor expansion around x and is in Appendix A Theorem 1 shows that the objective criterion of the RSS model (eq 13) can be viewed as two parts: the basic reconstruction error part (which is the same part in eq 3) and the regularization part. Minimizing the basic reconstruction error part will force r to follow the variation of x . In the extreme case, with no regularization ( σ ε2 = 0 ), the model will learn the identity function just like the basic self-supervised model. Accordingly, the form of the above regularization is the Frobenius norm for derivative of r (x) , minimizing this regularization will make r as unresponsive to the variation of x as possible. It means that the optimized r must be robust to x . In another extreme case, with only regularization ( σ ε2 → ∞ ), the model will learn a flat function

ACS Paragon Plus Environment

10

Page 11 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

because in that case

∂r (x) ∂xT

2

= 0 . Therefore, the tradeoff between the two parts will lead to the F

model output r trying to follows the change of input x around the high density area of input data x

and at the same time the derivative of r becomes very small at the area far away from the high

density area. This is viewed in Figure 3 using a 1-dimensional example.

r SPE r(x)

f x1 x2

x3

xt

x

Figure 3. Comparison of self-supervised model and robust model in 1-dimensinal example Similar to the Figure 1, x1 , x 2 and x3 are input training data, and xt is the fault data. After optimization the result is r* , which denote the optimal function that minimizes the objective function. In Figure 3, the dotted line is the identity function learned by training method in Section 1, and the solid line is the function r* learned by robust training. For a change f happened to normal data x : xt = x + f , r * (xt ) will still close to x if the change is deviate from the area of training data because the derivative of r* at that area is very small. 3.3 Process Monitoring Using RSS model. Monitoring is performed using reconstruction error in the residual space. The squared prediction error (SPE) is used to model the data variation in the residual space

ACS Paragon Plus Environment

11

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 37

SPEtrain = ( x − r* (x) ) ( x − r* (x) ) ≈ ( x − h ) ( x − h ) = ξT ξ T

T

(15)

2

The control limits can be build using the χ distribution of SPEtrain . 2

SPE : g χ h

(16)

g gh = mean( SPEtrain )

(17)

2 g 2 gh = var( SPEtrain )

(18)

where the parameter is obtained by

For online data xt , the SPE is as follows: SPEtest = ( xt − r* (xt ) ) ( xt − r* (xt ) ) ≈ ( xt − ht ) T

T

( xt − ht ) = ( ξ + f ) ( ξ + f ) T

(19)

So the SPE will becomes large when there is a deviate f from the normal data. In traditional method the trained model r* is forced to follow the all the variations in the data by minimizing reconstruction error. This causes the SPE statistics not sensitive to the fault change in the test data. While regularization forces the model to become not too sensitive to the input variance, especially ignoring the change derivate from the main system variation. As showed in the Figure 2, when there is a fault change f, the SPE of the traditional method is smaller than robust training methods. Offline training procedure • Collect the training data set X ∈ ℜ I × J which is collected from normal operating state and normalize it • Add Gaussian corruption noise Ε to the input X , So one can get a noisy input: X + Ε • Train the model with the noisy input by minimizing the eq 13 • Calculate the value of monitoring statistics SPE of the normal operating data using eq 15 • Determine confidence limit SPE

Online monitoring procedure • Obtain new data xt and normalize it

ACS Paragon Plus Environment

12

Page 13 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

*

• Compute the reconstruction output r (xt ) • Calculate the value of monitoring statistics SPE for the new sample data to monitor if it exceeds confidence limit.

4. Process Monitoring Based on PCA Using Robust Training Method PCA is a widely used dimension reduction and process monitoring technique. The idea of PCA is that high dimensional data can often be represented by a much lower dimensional representation (i.e. latent variable). This Section provides a robust training PCA model to show the efficiency of linear process monitoring based on RSS model. 4.1 Robust PCA (RPCA). As a typical linear self-supervised model, the traditional PCA model can be formulated by minimum reconstruction error formulation min P ,Q

1 I ∑ xi − QPT xi I i =1

2

(20)

subject to Q Q = I K T

where QPT x i is linear representation of r (xi ) in Section 2. P = {pk } ∈ RI ×K , k = 1, 2,..., K is the loading matrix that maps the input to the latent space and Q = {qk } ∈ RI ×K , k = 1, 2,..., K is reconstruction matrix that reconstruct the input using the latent variable. The above function is equal to maximization variance formulation of PCA, which is the alternative derivation of PCA other than minimum error approach.15 As mentioned in Section 3, PCA model can be add artificial data to the input data to train a Robust PCA(RPCA), so the objective function becomes min P ,Q

1 I ∑ xi − QPT ( xi + εi ) I i =1

2

(21)

subject to QT Q = I K

Using eq 12, the above function can also be rewrite as regularization formulation and continuous form

ACS Paragon Plus Environment

13

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(

2

L = ∫ x − QPT x + σ ε2 QPT

2 F

Page 14 of 37

) p(x)dx

(22)

subject to QT Q = I K

By optimizing the above loss function, the RPCA principle components can be obtained. Theorem 2 let XT X = VT D2 V be the Eigen decomposition of covariance matrix, where D = diag ( λ 1 , λ2 ,..., λJ ) ,

λ 1 > λ2 > ... > λJ , V = { v j } ∈ R I × J , j = 1, 2,..., J . For any σ ε2 , let

(

2

(Q, P ) = arg min ∫ x − QPT x + σ ε2 QPT Q,P

2 F

) p(x)dx

(23)

subject to Q Q = I K T

Then qk = v k , pk =

λk v k , k = 1, 2,..., K λk + σ ε2

(24)

 λ1 λ2 λK  P = diag  , ,..., Q 2 2 λK + σ ε2   λ1 + σ ε λ2 + σ ε

(25)

According to Theorem 2, it is easy to find that if σ ε2 = 0 , i.e., no noise is added, then P is exactly the first K loading vectors of ordinary PCA. If σ ε2 = λK ,

λk → 1 , so the latent variable λk + σ ε2

will be almost the same with the PCA latent variable. For a data x , the latent variable is t = PT x , and can be projected on the principal component subspace (PCS) and residual subspace(RS) respectively xˆ = QP T x ∈ S PCS

(26)

% x = ( I − QP T ) x ∈ S RS

(27)

λ1 λ2 λK  T , ,..., Q and q k = v k , the input data sample can be 2 2 2  λ σ λ σ λ + +  1 ε 2 ε K + σε  

Since QPT = Qdiag  decomposed into

x = QPT x + ( I − QP T ) x  λ1 = Vdiag  2  λ1 + σ ε

L

λK λK + σ ε

2

0 L

 0 

J *J

 σ ε2 VT x + Vdiag  2  λ1 + σ ε

L

σ ε2 1 L λK + σ ε2

 1 

J *J

VT x

(28)

ACS Paragon Plus Environment

14

Page 15 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

The above result shows: compare with the traditional PCA, the new methods RPCA will contain more information in residual space, thus the SPE statistics may be more sensitive. The detail theoretical analysis of statistics and simple case study is done in the next two parts. 4.2 Fault Detection Statistics. The RPCA based monitoring method is similar to that using in PCA. A measure of the variation within RPCA model is Hotelling’s T2 statistic, which is the sum of the normalized squared scores and is defined as

TR 2 = tT Λ −1t

(29)

where t is obtained from t = PT x and Λ −1 is the diagonal matrix of the inverse of the eigenvalues associated with the retained PCs. Λ = diag (λ1 , λ2 ,...., λK )

(30)

The measure in residual space use SPE statistic which is defined in Section 3. SPER = xT ( I − QPT ) ( I − QPT ) x T

(31)

Because of different loading matrix the T2 and SPE (Tong and Crowe, 1995) statistic for traditional PCA and PCA with Robust training is different as follows TR2 = ( PT x ) D−2 PT x T

 λ1 = xT Vdiag   ( λ + σ 2 )2 ε  1 SPER =

L

σ ε2 2 ( ) 1 L λK + σ ε2

 1 



K

+ σ ε2 )

2

J *J

(32) VT x

(( I − QP ) x ) ( I − QP ) x T

T

 σ ε2 2 = x Vdiag  ( ) L 2  λ1 + σ ε T

0 L

 0  

λK

T

(33)

J *J T

V x

where T 2 and SPE are statistic for normal sample x using ordinary PCA, and TR2 and SPE R are the statistic for the same data but using RPCA. Obviously, in RPCA model based fault

ACS Paragon Plus Environment

15

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 37

detection, SPE R can monitor the variation in the direction of first K principle components. When σ ε2 = λK



k

λk

+ σε )

2 2



1

(34)

λk

TR2 will be close to T 2 . If there is a fault f happened to the process data x , the change of SPE in traditional PCA CPCA is as follows CPCA = SPE f − SPE = ( x+f ) Vdiag ( 0,..., 0,1,...,1) T

J *J

V T ( x +f ) − xT Vdiag ( 0,..., 0,1,...,1)

J *J

VT x

(35)

While the change of SPE in RPCA is showed as CRPCA = SPERf − SPER = ( x +f )

T

 σ ε2 2 Vdiag  ( ) L 2  λ1 + σ ε

 σ ε2 2 − x Vdiag  ( ) L 2  λ1 + σ ε T

σ ε2 2 ( ) 1 L λK + σ ε2 σ ε2 2 ( ) 1 L λK + σ ε2

 1 

 1 

J *J

V T ( x +f )

(36)

J *J

VT x

Compare C RPCA and CPCA , it is obviously that CRPCA > CPCA . So the SPE statistics in RPCA is more sensitive than ordinary PCA when there is a fault happened. And T2 statistic has almost the same performance when the σ ε2 is smaller than λK . The simulation result using numerical example also supports the above theoretical analysis. 4.3 Linear Numerical Example For simplicity and clear Figure illustration, only three variables are chosen for process modeling. The process is as follows:  x1   t1 + 5t2   e1        x =  x2  =  2t1 + 4t2  +  e2         x2   3t1 + 2t2   e3 

(37)

ACS Paragon Plus Environment

16

Page 17 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

where e1 , e2 and e3 are independent and identically-distributed Gaussian sequences of zero mean and variance 20, which represent measurement uncertainty; t1 and t 2 are independent and identically-distributed Gaussian sequences of zero mean and variance 100, which represent the systematic latent variable. From the above process, a data set X consisting of 100 samples is generated to model PCA and RPCA. For model online monitoring, another data set Xt containing 100 samples is also generated. The scatter plot of X shown in Figure 2-3 is represented by green circle points. The blue points are fault data samples chosen from Xt . One can see that the test data are far away from the training data set. After the PCA and RPCA model is trained, the 99% control limits of T 2 statistic are plotted using ellipse on the Figure 4. The blue solid line is the T 2 control limits of PCA, and the other lines represent RPCA control limits for different σ ε2 . It is shown that

σ ε2 = 0.02 (black line), i.e., the RPCA model with the additive artificial noise variance being 0.02 can well detect the fault using T 2 statistic. While if the σ ε2 is too big the fault cannot be detected, and if σ ε2 is too small the SPE will not be sensitive to the fault.

ACS Paragon Plus Environment

17

Industrial & Engineering Chemistry Research

second principle

4 3

first principle

2

x3

1 0 -1 -2

Training Data Fault Data T2 of PCA T2 of RPCA with noise=0.02 T2 of RPCA with noise=0.5 T2 of RPCA with noise=1

-3 -4

-4 -2 0

x1

5 0

2 -5

4

x2

(a) 1.2

Training Data Fault Data Reconstruction of PCA Reconstruction of AIM-PCA

Last principle

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 37

0

PCA SPE

AIM-PCA SPE

-1.2 -8

0 Fist principle

8

(b) Figure 4. Monitoring results via PCA and RPCA: (a) T2 statistic control limits; (b) SPE statistic.

ACS Paragon Plus Environment

18

Page 19 of 37

Next, the SPE statistic of PCA and RPCA is plotted on the Figure 4 using black arrow and red arrow, respectively. Particularly, the red point and the green solid point are reconstruction of PCA and RPCA for fault data, respectively. Because the red line has a bias along the direction of principle components, one can see that the SPE statistic of RPCA can detect the fault happened on that direction. In contrast, the SPE statistic of traditional PCA cannot detect this kind of fault. The monitoring results of both PCA and RPCA models are demonstrated in Figure 5(a) and Figure 5(b). It can be seen that the T 2 performance of both methods are almost the same when the artificial noise variance σ ε2 is 0.02. At the same time, the RPCA SPE performs better than PCA SPE according to Figure 4(b) and 5(b), especially from sample 40 to 70. PCA 40

T2

30 20 10 0

0

10

20

30

40

50

60

70

80

90

100

0

10

20

30

40 50 60 sample number

70

80

90

100

0.05 0.04 SPE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

0.03 0.02 0.01 0

(a)

ACS Paragon Plus Environment

19

Industrial & Engineering Chemistry Research

Robust training PCA 30

T2

20

10

0

0

10

20

30

40

0

10

20

30

40

50

60

70

80

90

100

50 60 sample number

70

80

90

100

0.2 0.15 SPE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 37

0.1 0.05 0

(b) Figure 5. Fault detection results with 99% control limits (a) PCA; (b) RPCA 5. Auto-encoder Model with Robust Training Methods This Section modifies the basic nonlinear method to robust methods and proves the sensitivity in nonlinear process monitoring problem Auto-encoder (AE) model. The Auto-encoder tries to learn the function r ( x ) ≈ x . This procedure is composed of two steps: encoder and decoder. The encoder maps the input data vector to a code vector that can represent the input, and then the decoder tries to use this code vector to reconstruct the input vector with minimum error. Both the encoder and decoder are using artificial neural network (ANN). Hence the target output of the auto-encoder is the autoencoder input itself. The structure of linear auto-encoder is formulated as follows, formulation (38) is encoder, (39) is decoder: t i = f ( W1xi + b1 )

(38)

r ( xi ) = g ( W2 t i + b 2 )

(39)

ACS Paragon Plus Environment

20

Page 21 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

K ×1 where i = 1, 2, L , I is the sample number of raw data. xi ∈ℜ J ×1 is i th sample vector, zi ∈ℜ K ×J K×1 is the code or the features extracted from the xi . W1 ∈ℜ and b1 ∈ℜ is the weight matrix J ×K J×1 and b2 ∈ℜ is and bias between layer 1 (input layer) and layer 2 (hidden layer). W2 ∈ℜ

the weight matrix and bias between layer 2 and layer 3 (output layer). If the encoder and decoder step are both linear, i.e., f (•) is an identity function, the autoencoder becomes an alternative version of PCA, which is a network with a central bottleneck that has only k hidden units. If f ( g) is the sigmoid function:

f ( x ) = σ ( x) =

1 1 + e− x

(40)

where the function output range is [0,1]. This is the result of nonlinear transformation. Note that the if we use the conventional Auto-encoder above, it can only deal with the input range with [0,1], which is usually regarded as binary variable input (the number between (0,1) can be interpreted as probabilities). This situation is commonly applied in some image recognition problem. But industrial process data often do not follow that. They are real valued data after been normalized. To deal with this problem, Linear Decoder is proposed based on the above autoencoder model for dealing with real valued process data. In this paper an auto-encoder with a sigmoid hidden layer and a linear output layer is used to deal with process monitoring problem, see Figure 6. It is formulated as t i = σ ( W1xi + b1 )

(41)

r ( xi ) = W2 t i + b 2

(42)

ACS Paragon Plus Environment

21

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 37

(a) X x1

W1

Z z1

W2

X x1

x2

z2

x2

x m−1

z k −1

x m−1

xm

zk

xm

(b) Figure 6. Structure of basic Auto-encoder (a) Procedure (b) Structure

5.1 Robust Auto-encoder (RAE) Model. Unlike directly train the nonlinear model using normal process data just like in traditional methods, one can train the nonlinear model to reconstruct the normal data using a noisy process data.

L=

1 I ∑ r(xi + εi ) − xi I i =1

2 2

(43)

This optimization is typically carried out by the BP algorithm.14 The robust nonlinear training procedure can be illustrated as given an artificial noisy input, try to make the corresponding output always fall into the raw area, i.e., robustness to the inputs.

ACS Paragon Plus Environment

22

Page 23 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

In the ideal setting, the above function can also be rewritten as regularization formulation and continuous form  ∂r (x) 2 2 LRAE = ∫ r (x + ε ) − x p (x)dx = ∫  r (x) − x + σ ε2  ∂x 

  p( x)dx F 

2

(44)

By Euler-Lagrange Equation the optimal ideal result is

r* (x) = x + σ ε2

∂ log p(x) ∂ log p(x) + o(σ ε2 ) ≈ x + σ ε2 ∂x ∂x

(45)

The inference use Euler-Lagrange Equation similar to that in PCA. The detail can be found in Guillaume Alain and Yoshua Bengio,2013.17 Assume that the training data sampled from normal operation state can be expressed as the sum of nonlinear function h and measurement noise ξ x = h+ξ

(46)

ξ : N (0, σ ξ2 I)

(47)

So the probability density function of x is p (x) =

1 m 2

( 2π ) σ ξ I 2

1 2

T  1  exp  − ( x − h ) (σ ξ2 I ) −1 ( x − h )   2 

(48)

Substitute it to the eq 45,

ACS Paragon Plus Environment

23

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

∂ log p (x) ∂x   1 T   1  2 −1 ∂ log  exp  − ( x − h ) (σ ξ I ) ( x − h )   1 J  2  2 2 2  ( 2π ) σ ε I  2 = σε ∂x T  1  ∂  − ( x − h ) (σ ξ2 I )−1 ( x − h )  2  = σ ε2  ∂x

Page 24 of 37

σ ε2

=−

(49)

σ ε2 (x − h) σ ξ2

σ ε2 =− 2ξ σξ From the above result, for a normal data sample x .

x = r* (x) − σ ε2

σ2 ∂ log p(x) * = r (x) + ε2 ξ ∂x σξ

(50)

The above result shows that a proper σ ε2 can make the output r* (x) close to the unknown systematic part h . 5.2 Fault Detection Based on RAE Model. As a nonlinear modeling method, RAE can capture the nonlinear information in the training data. The measure in residual space uses SPE statistic. For a normal data x , the SPE of RAE model is SPER = ( r* (x) − x ) ( r* (x) − x ) T

(51)

Use eq 50, 2

σ 2  SPER =  ε2  ξ Tξ σ   ξ 

(52)

For a test data with fault change f , the corresponding SPE is as follows

ACS Paragon Plus Environment

24

Page 25 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

2

SPE f = ( r * ( x + f ) − x − f )

T

 2 ( r* (x + f ) − x − f ) =  σσ ε2  ( ξ + d )T ( ξ + d )  ξ 

(53)

In basic nonlinear the optimized output is approximately input. So r * ( x) ≈ x , so when a fault f happened to normal data x , the derivation of AE model SPE and RAE model SPE is as follows, respectively

C AE = SPE f − SPE ≈ 0 2

C RAE

(54) 2

σ 2  σ 2  T = SPERf − SPER =  ε2  ( ξ + f ) ( ξ + f ) −  ε2  ξT ξ σ  σ   ξ   ξ 

(55)

CRAE is larger than CAE when choosing proper σ ε2 . 5.3 Numerical Example. For simplicity and clear figure illustration, example with three variables is chosen for process modeling. The following process is used to build AE and RAE model: x1 = t + e1 x2 = t 3 − 2t + e2

(56)

x3 = 2t + e3

where t ∈ [−10,10] , e1 , e2 , e2 : N (0, 0.2) , it’s easy to see that this example is a nonlinear system. AE and RAE model are built based on the 500 samples generated from the above example. The test data is a drift from the eq. 56. Figure 7 is the test of robustness of different σ ε2 . σ ε2 is determined by testing different values during offline training. From Figure 7 one can find that with proper σ ε2 ( σ ε2 =0.2 ) the test data can be reconstruct to the normal area, which means the SPE statistics can well detect the fault.

ACS Paragon Plus Environment

25

Industrial & Engineering Chemistry Research

Reconstruction data using RAE with dirrerent nosise variance

2 1

x3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 37

training data test data reconstruction data σ=0 reconstruction data σ=0.01 reconstruction data σ=0.2

0 -1 -2 5

4

x2

0

2 0 -5

x1

-2

Figure 7. RAE model reconstruction output with different noise 6. Case studies 6.1 Tennessee Eastman Process. The Tennessee Eastman (TE) process is a realistic industrial process for comparing and evaluating the performance of various monitoring approaches. The TE benchmark is based on a simulation which contains five major unit operations: a reactor, a condenser, a compressor, a separator and a stripper, as showed in Figure 8 in this paper, we choose 33 variables for process monitoring. There are 21 different process faults and they are tested based on AE model and RAE model, separately. In each fault tested, there are 960 samples, and all faults are introduced at sample 160.

ACS Paragon Plus Environment

26

Page 27 of 37

FIC FIC

FIC

8

Compressor Purge

1

9

Feed A

Condenser

JIC

PI

TIC

FIC

LIC

Separator

13

7

2

Analyzer

Condenser Cooling

Feed D

XB

TI

5 PI

LIC

FIC

FIC

PIC

10

3 LIC

Stripper

TIC

XF FIC

stream

TIC

6

Reactor FI

FIC

FC

12

TIC

4

FI

FC

Analyzer

Reactor Cooling

Feed E

Analyzer

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

G H

11 Product

Feed C

Figure 8. TE benchmark process Table 2. Fault detection rate comparison in TE using AE and RAE RAE

AE

RAE

RAE

RAE

Fault

0

0.08

0.12

0.16

0.2

3

0.076

0.102

0.136

0.181

0.210

4

0.911

0.972

0.996

1

1

11

0.670

0.716

0.782

0.830

0.853

15

0.088

0.122

0.175

0.240

0.276

21

0.498

0.530

0.560

0.607

0.628

σ ε2

ACS Paragon Plus Environment

27

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 37

Table 3. False alarm rate comparison in TE

σ ε2 Fault

AE

RAE

RAE

RAE

RAE

0

0.08

0.12

0.16

0.2

0.0062

0.031

0.050

0.050

0.062

From Table 2 one can find that the basic AE method suffers from low fault detection rate so that it cannot be used for fault detection directly. The proposed RAE method can improve the fault detection rate by adding different noises with variance σ ε2 . False alarm rate in Table 3 help to determine the proper σ ε2 in Table 2 in case that the false alarm rate is too high. After building AE and RAE model for TE process, the monitoring results are compared in Table 2 and the statistical confidence limits of both methods are set to 99%. When training RAE, the artificial noises is add to the input data and 25 hidden units are chosen according to the normality test of the reconstruction errors. From fault detection rates comparison in Table 2, One can see that the fault detection rates in RAE significantly outperform that in AE, especially on fault 3, 4 , 11, 15 and 21. Therefore, the RAE approach performs better than AE approach and is more sensitive to small, slow drift faults. 6.2 CO2 Absorption Column. CO2 absorption column is an important processing equipment in the Ammonia Synthesis process where the production NH3 is one of the main materials in the extended Urea Synthesis process. In the Ammonia Synthesis process, one of the process materials is the gaseous hydrogen which is obtained from the methane decarburization unit. Nevertheless, the carbon element still exists in the process gas in the form of gaseous CO2, which is useless in the Ammonia Synthesis process but exploitable in the Urea Synthesis process, after the decarburization unit. Therefore,

ACS Paragon Plus Environment

28

Page 29 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

segregating the gaseous CO2 from the process gas to obtain pure materials for the Ammonia Synthesis unit and increasing source of CO2 for the Urea Synthesis process is a particularly important procedure. The CO2 absorption and recovery unit is built after the decarburization unit in Ammonia Synthesis process. The absorbent solution in the column is potassium carbonate, which can set off a chemical reaction with CO2 and turn into potassium bicarbonate. The chemical reaction equation is given as follows CO2 + K 2 CO3 + H 2 O ← → 2 KHCO3 + Q

.

Through this reaction, most of the CO2 in the mixed process gas can be absorbed while only very little residual CO2 is transferred to the next unit with the feed gas. The recovery column is another key equipment for CO2 recovery and barren solution recycle since the column is heated by the low pressure steam and the potassium bicarbonate decomposed as the following chemical equation ∆ 2KHCO3 ← → K 2 CO3 + H 2 O + CO2 ↑ .

The regenerated potassium carbonate solution is pumped back to the absorption column, and the recovered CO2 is then transferred to the Urea Synthesis process as the raw material after a series of separations. Detailed descriptions of the process variables are given in Table 4. Table 4. Description of the process variables in CO2 absorption column Tag

Description

PC04011.PV

The pressure of process gas into 05E001

LC05020.PV

The liquid level of 05F003

TC05015.PV

The barren liquor’s temperature at 05E003’s exit

FC05015.PV

The flow of barren liquor to 05C001

FC05016.PV

The flow of half deficient liquor to 05C001

ACS Paragon Plus Environment

29

Industrial & Engineering Chemistry Research

TI05016.PV

The temperature of process gas at 05F003’s exit

PDR05016.PV

The differential pressure of process gas at 05C001’s entrance

TI05018.PV

The temperature of rich liquor at 05C001’s exit

LC05022.PV

The liquid level of 05C001

LA06001.PV

The high level alarming of 06F001

PC06001.PV

The pressure of process gas into unit 06

AR06001.PV

The concentration of residual CO2 in process gas

There are in total 30000 samples collected from the DCS of this process. To construct and validate the proposed fault detection models, 2000 of the samples are used for model training and the remaining ones for testing. At the 2801 sample time of testing data there is a fault happened.

flow temperature concentration

temperature pressure pressure

level flow level

flow

pressure

The RAE model is used for fault detection.

liquid level

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 37

Figure 9. The process data including training and testing data

ACS Paragon Plus Environment

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

SPE statistics 4.5 4 3.5 3 2.5 SPE

Page 31 of 37

2 1.5 1 0.5 0

0

1000

2000

3000

4000 5000 6000 sample number

7000

8000

9000 10000

Figure 10. SPE statistic in testing data using RAE Figure 10 is the fault detection result of testing data. The RAE can well detect the fault at sample time 2824. Table 5 shows the detail fault detection rate using RAE with different noise. According to eq. 53, the value of SPEf reflects the true fault drift when the σ ε2 σ ξ2 equals 1 in the ideal situation. That means the artificial noise variances should be very close to the true noise variance of data. Because the raw data is normalized and noise variance is usually very small compared to the standard variance 1. So we choose variance values from 0.01 to 0.09, with the increase of σ ε2 , the fault detection rate grows but the false alarm rate also increase. The simulation results show that when σ ε2 is 0.06, the model has good fault detection performance and false alarm rate is acceptable.

ACS Paragon Plus Environment

31

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 37

Table 5. The fault detection and false alarm rate with different σ ε2

σ ε2

0

1

0.539 0.724 0.566 0.635 0.683 0.914 0.909 0.911 0.916 0.916

Detec- 2

0.855 0.857 0.843 0.899 0.898 0.914 0.911 0.911 0.910 0.917

tion

3

0.859 0.766 0.786 0.827 0.904 0.907 0.911 0.911 0.932 0.922

4

0.673 0.680 0.899 0.913 0.910 0.909 0.909 0.909 0.910 0.916

5

0.899 0.868 0.907 0.896 0.914 0.922 0.916 0.915 0.910 0.914

6

0.746 0.807 0.793 0.877 0.907 0.908 0.908 0.908 0.907 0.911

Average

0.762 0.784 0.790 0.841 0.870 0.912 0.910 0.911 0.914 0.916

False Alarm Rate

0.004 0.012 0.024 0.032 0.040 0.049 0.053 0.072 0.080 0.090

Fault

Rate

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

7.Conclusions This paper has proposed a unified fault detection method based on the RSS training. The unification is based on the common robust training framework for neural and statistical empirical modeling methods. The result is called RSS model that including Robust PCA and RAE. The RSS model utilizes a common training procedure for modeling and fault detection. By adding Gaussian noise to the input training data, the RSS model can better detect the small drifts in the process monitoring. This property owes to robust training and is derived by theoretical analysis. Based on the RSS model a corresponding fault detection strategy is proposed and reconstruction errors are used for building monitoring statistic SPE. The three case studies, including one nonlinear numerical process, the TE benchmark and real CO2 absorption column process also show that the robust training method can avoid too many false alarms compared with basic method and is more sensitive to the faults than the existing traditional method. Acknowledgment

ACS Paragon Plus Environment

32

Page 33 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

This work was supported by the National Natural Science Foundation of China (NSFC) (61573308)

Appendix A Proof of Theorem 1: The Taylor series expansion of r ( x + ε ) at x with respect to a small derivation ε , can be expressed as

rj ( x + ε ) = rj ( x ) + ∑ ε i

∂rj (x)

i

∂xi

2

+ o(σ ε ) = rj (x) + εT

∂rj (x) ∂x

2

+ o(σ ε ) ,

Substitute it to the function(57), so the objective function becomes L = ∫ r (x + ε) − x J

2

p ( x ) dx

= ∑ ∫ ( r j ( x + ε ) − x j ) p ( x ) dx 2

j =1

2

J 2 ∂r ( x )   = ∑ ∫  rj ( x) + εT 1 + o (σ ε ) − x j  p ( x ) dx ∂x  j =1  2 J  ∂r j ( x ) 2  T ∂rj ( x)   2 = ∑ ∫ ( rj ( x ) − x j ) + 2εT r x − x + ε ( ) (j    p ( x ) dx + o(σ ε ) j) ∂ x ∂ x j =1      2 ∂rj ( x) 2  T ∂rj ( x)   2  T = ∑ E ( rj ( x) − x j ) + 2ε rj ( x ) − x j ) +  ε (   + o(σ ε ) ∂x ∂x   j =1    J

Due to the independence of the artificial noise ε , the function is decomposed into J  T ∂rj (x) ∂rj (x) T  2  ∂rj (x)  2 L = ∑ E ( rj (x) − x j ) + 2 E ( εT ) E  ε  + o(σ ε )  E ( rj (x) − x j ) +E  ε  ∂x ∂x j =1  ∂x   

)

(

J  ∂rj (x) ∂rj (x) T  2  ∂rj (x)  2 = ∑ E ( rj (x) − x j ) + 2 E ( εT ) E  E r ( x ) − x + TrE   E ( εT ε ) + o(σ ε )  ( j j)  ∂x ∂x  j =1  ∂x  

)

(

Because of ε : N (0, σ ε I ) , the objective function becomes the traditional squared reconstruction loss plus a 2

contractive penalty on the model function:

ACS Paragon Plus Environment

33

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 37

J  ∂rj ( x) ∂rj ( x) T  2 2 L = ∑ E ( rj ( x) − x j ) + TrE   E ( εT ε ) + o(σ ε )   ∂x  j =1  ∂x 2  2 ∂r ( x)  2 = E  x − r ( x) + σ ε2  + o(σ ε ) T  ∂x F   2  2 ∂r ( x)  2 = ∫  r ( x) − x + σ ε2  p ( x) + o(σ ε )dx T  ∂ x F  

)

(

Appendix B Proof of Theorem 2: using Euler-Lagrange Equation the optimized result can be gotten QPT x = σ ε2 QPT

∂ log p(x) + x = −σ ε2 QPT ∑ −1 x + x ∂x

⇔ QPT = ( I + σ ε2 ∑ −1 )

−1

(

⇔ QPT = I + σ ε2 ( XT X )

where

∂logp ( x) = ∑ −1 x , ∑ ∂x

) (

−1 −1

= I + σ ε2 ( VT D2 V )

)

−1 −1

=V

D2 VT D2 + σ ε2 I

is covariance matrix and ∑ = XT X = VT D2 V is the Eigen decomposition

of covariance matrix , D = diag ( λ 1 , λ2 ,...., λJ ) can get that after optimization q k = v k , p k =

λ 1 > λ2 > ... > λJ . According to QT Q = I k , we

λk v k where k=1, …, K. λk + σ ε2

SUPPORTING INFORMATION The details of the monitored variables and process faults used in Tennessee Eastman(TE) process are given here. Table S1 shows the 33 variables selected for monitoring the TE process. Table S2 shows the 21 process faults used in our manuscript.

References (1) Venkatasubramanian, V; Rengaswamy, R; Kavuri, S N; et al. A review of process fault detection and diagnosis: Part III: Process history based methods. Computers & chemical engineering, 2003, 27(3): 327-346.

ACS Paragon Plus Environment

34

Page 35 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

(2) Ge, Z; Song, Z; Gao, F. Review of recent research on data-based process monitoring. Industrial & Engineering Chemistry Research, 2013, 52(10): 3543-3562. (3) Qin, S J. Survey on data-driven industrial process monitoring and diagnosis. Annual Reviews in Control, 2012, 36(2): 220-234. (4) Lee, J M; Yoo, C K; Choi, S W; et al. Nonlinear process monitoring using kernel principal component analysis. Chemical Engineering Science, 2004, 59(1): 223-234. (5) Ge, Z, Zhang, M; Song, Z. Nonlinear process monitoring based on linear subspace and Bayesian inference. Journal of Process Control, 2010, 20(5): 676-688. (6) Kramer, M A. Autoassociative neural networks. Computers & chemical engineering, 1992, 16(4): 313-328. (7) Dong, D; McAvoy, T J. Nonlinear principal component analysis—based on principal curves and neural networks. Computers & Chemical Engineering, 1996, 20(1): 65-78. (8) Hinton, G E; Salakhutdinov, R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504-507. (9) Bengio, Y; Courville, A; Vincent P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(8): 1798-1828. (10) Bakshi, B R; Utojo, U. Unification of neural and statistical modeling methods that combine inputs by linear projection. Computers & chemical engineering, 1998, 22(12): 1859-1878.

ACS Paragon Plus Environment

35

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 37

(11) Zhu, J; Ge, Z; Song, Z. Robust modeling of mixture probabilistic principal component analysis and process monitoring application. AIChE journal, 2014, 60(6): 2143-2157. (12) Ge, Z; Yang, C; Song, Z; et al. Robust online monitoring for multimode processes based on nonlinear external analysis. Industrial & Engineering Chemistry Research, 2008, 47(14): 4775-4783. (13) Ballard, D H. Modular Learning in Neural Networks. AAAI. 1987: 279-284. (14) Cottrell, G W; Munro, P; Zipser D. Learning internal representations from gray-scale images: An example of extensional programming. Ninth annual conference of the cognitive science society. 1987: 462-473. (15) Friedman, J; Hastie, T; Tibshirani, R. The elements of statistical learning. Springer, Berlin: Springer series in statistics, 2001. (16) Bishop, C M. Neural networks for pattern recognition. Oxford university press, 1995. (17) Bengio, Y; Yao, L; Alain, G; et al. Generalized denoising auto-encoders as generative models. Advances in Neural Information Processing Systems. 2013: 899-907.

ACS Paragon Plus Environment

36

Page 37 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

For Table of Contents Only

ACS Paragon Plus Environment

37