Process Knowledge Discovery Using Sparse Principal Component

Oct 26, 2016 - School of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029 China. ‡ Department of Chemi...
0 downloads 11 Views 1MB Size
Subscriber access provided by UIC Library

Article

Process Knowledge Discovery Using Sparse Principal Component Analysis Huihui Gao, Shriram Gajjar, Murat Kulahci, Qun-Xiong Zhu, and Ahmet Palazoglu Ind. Eng. Chem. Res., Just Accepted Manuscript • DOI: 10.1021/acs.iecr.6b03045 • Publication Date (Web): 26 Oct 2016 Downloaded from http://pubs.acs.org on November 6, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Industrial & Engineering Chemistry Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Process Knowledge Discovery Using Sparse Principal Component Analysis Huihui Gao 1, Shriram Gajjar 2, Murat Kulahci 3 ,4, Qunxiong Zhu 1, Ahmet Palazoglu 2* 1

School of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China

2

3

Department of Chemical Engineering, University of California, Davis, CA 95616, USA

Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark

4

Department of Business Administration, Technology and Social Sciences, Luleå University of Technology, Luleå, Sweden

Abstract: As the goals of ensuring process safety and energy efficiency become ever more challenging, engineers increasingly rely on data collected from such processes for informed

*

Corresponding author. Tel.: +1 530-752-8774.

E-mail address: [email protected] (Ahmet Palazoglu).

1

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 38

decision making. During recent decades, extracting and interpreting valuable process information from large historical datasets have been an active area of research. Among the methods used, principal component analysis (PCA) is a well-established technique that allows for dimensionality reduction for large datasets by finding new uncorrelated variables, namely principal components (PCs). However, it is difficult to interpret the derived PCs, as each PC is a linear combination of all of the original variables and the loadings are typically nonzero. Sparse principal component analysis (SPCA) is a relatively recent technique proposed for producing PCs with sparse loadings via the variance-sparsity trade-off. We propose a forward SPCA approach that helps uncover the underlying process knowledge regarding variable relations. This approach systematically determines the optimal sparse loadings for each sparse PC while improving interpretability and minimizing information loss. The salient features of the proposed approach are demonstrated through the Tennessee Eastman process simulation. The results indicate how knowledge and process insight can be discovered through a systematic analysis of sparse loadings.

Keywords: Principal Component Analysis (PCA); Sparse Principal Component Analysis (SPCA); Process Knowledge Discovery; Tennessee Eastman Process

2

ACS Paragon Plus Environment

Page 3 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

1. INTRODUCTION

The term knowledge discovery appeared around 1989 and is attributed to Frawley, Piatetsky-Shapiro and Matheus

1-3

. According to Frawley et al. 2, knowledge discovery in

databases is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. The ability to exploit variable measurements to extract systemic information has always been a crucial component in process knowledge discovery. Historically, in process industries, data has been collected for a few critical process variables and/or equipment that could then be used for troubleshooting and process monitoring purposes. That practice has now changed substantively, driven by advancements in computing and sensing technologies. It is now common to have an archival history of thousands of sensors sampled every second over long periods of time. The decision-making increasingly relies on data that arrives at overwhelming speeds and volume. It is certain that without the aid of computing methodologies, it would not be feasible to examine the plethora of data generated today, let alone use it to generate useful insight. It is often said that “We are drowning in data but starved for knowledge.”

Meglen

3

described an exploratory data analysis procedure to uncover three main aspects of

data: anomalous samples or measurements, significant relationships among the measured variables, and significant relationships or groupings among the samples. The primary tools used in this approach were factor analysis, PCA, and cluster analysis. Cios et al. 4 further mentioned the use of clustering and regression models as mechanisms to reveal structure from data. Sebzalli and 3

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Wang

5

Page 4 of 38

used PCA and fuzzy c-means clustering to identify operational spaces and to develop

strategies for manufacturing the desired products for a fluid catalytic cracking process. Using the data analysis approach they were able to discover distinct operational zones that correspond to producing on- and off-specification products. Sebzalli and Wang

5

offered variable contribution

plots as a way of identifying the most influential variables for each operating condition. They demonstrated that the knowledge discovered by using the PCA and fuzzy clustering approach can aid in developing more effective operational strategies for monitoring and rapid product changeover events.

Over the last decade, the field of multivariate statistics has focused on developing methods whereby data collected from many sensors are combined with process information, such as physical connectivity of process units, to offer a holistic picture of a large scale process plant. Principal component analysis (PCA) is the most commonly used multivariate technique with various applications ranging from feature extraction to data dimension reduction to clustering6-7. In its simplest form, raw data can be measured in two dimensions: number of samples (n) and number of variables (p) where both n and p can be large. PCA extracts the essential information from p variables of the original dataset into k retained principal components (PCs). In most conventional settings, the data may be high-dimensional but the underlying signal would have a low-dimensional structure, underpinned by a limited number of physical phenomena that govern the system evolution. Thus, k is often much smaller than p. By choosing k features, irrelevant and 4

ACS Paragon Plus Environment

Page 5 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

redundant features are removed. Indeed, it is this conciseness that facilitates the comprehensibility and interpretability of the data. Wang et al

8

performed PCA to obtain the structure-toxicity

relationship for a panel of nanoparticles. They used PCA to process different acute toxicity measures that classified the particles and identified materials with an acute toxicity profile. They further demonstrated that PCA and contribution plot analysis can be implemented to identify the structural properties that could determine the acute cytotoxicity of the materials. Helena et al.

9

investigated the evolution of the groundwater composition using PCA and varimax rotation 10. The data exploration in such a manner allowed them to uncover strong associations between some variables as well as a lack of association between the others. However, in PCA all p variables have non-zero loadings on the derived PCs. This, in turn, confounds the interpretation of PCs especially when the dimension p is large.

A number of researchers have proposed approaches to improve interpretability in the PCA setting 11-15. Sparse principal component analysis (SPCA) is a relatively recent technique proposed for producing PCs with sparse loadings via the variance-sparsity trade-off. There are several methodologies proposed in literature to obtain sparse loadings al.

25

16-23

. Trendafilov

24

and Jolliffe et

provide a review of main approaches and recent developments for improving the

interpretation of results obtained from PCA applications. Zou et al.

17

put forward a strategy to

obtain sparse loadings by reformulating the PCA as a regression problem and imposing LASSO (elastic net) constraints on the L1 norm of the regression coefficients (sparse loadings). This 5

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 38

methodology known as sparse principal component analysis (SPCA) has several advantages such as it efficiently solves the optimization problem with a cost of a single least square fit. It can also be applied in the case when p is much larger than the sample size (n) and the desired number of non-zero loadings (NNZL) can be independently specified for each component.

None of the research mentioned above use a dimension reduction technique to produce sparse loadings and thereby discover the process knowledge, especially the relationships among process variables. SPCA offers such an opportunity to establish the link between dimension reduction and knowledge discovery. An efficient forward SPCA algorithm is proposed to generate the optimal sparse loadings. A systematic scheme for process knowledge discovery based on our proposed forward SPCA method is presented. We regard this approach as a complementary tool to all of the methodologies that take advantage of correlation and regression methods as well as clustering techniques.

The remainder of this paper is organized as follows: Section 2 briefly reviews PCA and SPCA methods. Section 3 provides a detailed description of the proposed forward SPCA method to construct the SPCA model. Section 4 discusses the application of the method to a well-known chemical process simulation to verify the existing knowledge on the process by systematically discovering variable relationships. Section 6 offers conclusions.

6

ACS Paragon Plus Environment

Page 7 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

2. PRELIMINARIES

This section briefly reviews the algorithm that forms the backbone of the proposed knowledge discovery method.

2.1 Principal Component Analysis (PCA)

PCA is a classical dimensional reduction method that transposes the original data onto a lower dimensional subspace by maximizing the variability explained by the reduced-order model. Consider the original data matrix X =  X 1 , X 2 ,L , X p  ∈ R n× p where n denotes the number of samples and p denotes the number of process variables. Without loss of generality, each column in the data matrix is scaled to zero mean and unit variance. Here the PCA is carried out using singular value decomposition (SVD) of the data matrix. Let the SVD of X be X = UDV T

(1)

The columns of Z = UD are the PCs, and the columns of V are the corresponding loadings of the PCs. The sample variance of the ith PC is d i = Dii2 n . A key property of the PC directions is that they are orthogonal thus explaining unique features in the dataset. Usually the first k PCs are chosen to represent the variability in p variables, where k is determined based on, among others, the widely accepted cumulative percent variance (CPV) approach to capture at least 85% of normal variability:

7

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 38

k

∑d

i

× 100% ≥ 85%

i =1 p

∑d

(2)

i

i =1

In this manner, dimensionality reduction is achieved by identifying k unique patterns that underlie the dataset. The remaining dimensions are expected to be associated with unstructured (random) noise captured during the data collection process. Once k is determined, the original data matrix can be projected onto the loadings. Each PC then becomes a linear combination of all of the original p variables with varying loadings magnitudes (and directions) to reflect the influence of that variable on the specific feature captured by that PC direction. 2.2 Sparse Principal Component Analysis (SPCA)

One of the drawbacks of PCA is that, as noted, all elements of the loading matrix V are typically nonzero which makes it often difficult to interpret the derived PCs and may confound the discovery of key patterns and trends in datasets. SPCA is introduced to reduce the number of variables that explicitly have non-zero loadings, thereby creating a sparse V matrix. To perform SPCA, PCA is recast as a regression-type optimization problem with a quadratic penalty; the LASSO penalty can then be directly integrated into the regression criterion, resulting in a modified PCA with sparse loadings.

8

ACS Paragon Plus Environment

Page 9 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Considering the first k PCs, let xi denote the ith row vector of the original data matrix X, A p×k = [α1 , α 2 ,L , α k ] and B p×k = [ β1 , β 2 ,L , β k ] , the following optimization problem is solved to obtain the sparse principal components (SPCs): n

ˆ , Bˆ ) = arg min (A ∑ xi − ABT xi A,B

i =1

2

k

+ λ∑ β j

2

j =1

k

+ ∑ λ1, j β j j =1

1

(3)

subject to A A = I k ×k . T

where λ > 0 and different λ1, j are allowed for penalizing the loadings of different PCs.

ˆ are ˆ = [Vˆ ,L , Vˆ ] be the modified PCs also called SPCs. Since the columns of V Let V 1 k ˆ ) is too optimistic to represent the total variance ˆ TV correlated, tr (V

17, 24

. Using the QR

decomposition, we can easily obtain the adjusted variance by taking into account the correlations

ˆ = QR , where Q is orthonormal and R is upper triangular, ˆ . Suppose V among the columns of V then the adjusted variance of SPCs is denoted as

ˆ ))  SV = diag(qr( XV 

2

n

(4)

The percentage of variance explained (PVE) of SPCs is denoted as PVE =

SV × 100% sum( SV )

(5)

The CPV of SPCs is equal to sum( SV ) Eˆ = × 100% p

(6)

Based on the above description, we can now denote the general SPCA algorithm 17 as ˆ ˆ V  , E  = SPCA( X, N , k )

(7)

9

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 38

where the X is the data matrix, k is the number of chosen PCs, N = [ N1 ,N 2 ,L ,N k ] (1 ≤ N i ≤ p )

ˆ = [Vˆ , Vˆ ,L , Vˆ ] is the sparse loading vectors and Eˆ is the is the desired NNZL in each SPC, V 1 2 k CPV of SPCs.

3. PROCESS KNOWLEDGE DISCOVERY METHOD

This section provides a detailed description of the proposed scheme to establish the optimum SPCA model for process knowledge discovery. An overview of the steps is depicted in Figure 1.

3.1 Determination of original loadings and number of PCs

Without loss of generality, the data matrix is scaled to zero mean and unit variance first. Then the original loadings of PCs V and the number of PCs k are calculated through the SVD of X and the CPV method, respectively. The detailed process is explained in Section 2.1. 3.2 Determination of sparse loadings based on forward SPCA ˆ are sought by considering the Once k and V are obtained, the optimum sparse loadings V

trade-off between sparsity and CPV explained. In this process, the determination of N, NNZL on each SPC, is critical. As a heuristic for automated industrial process systems, one can consider the basic pairwise causality between a manipulated variable and its corresponding controlled variable. In most cases this would be the minimal expected relationship among process variables. The forward SPCA approach is initialized by this heuristic rule to find the optimum NNZL for each SPC sequentially. 10

ACS Paragon Plus Environment

Page 11 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Normal Data matrix X =  X 1 , X 2 ,L , X p  ∈ R n× p  

Data normalization

Scale X to zero mean and unit variance

SVD of X: X = UDV T

Initialization:

Forward SPCA

N (0) = [ N 1 , N 2 0 ,L , N k 0 ] = [2,2,...,2] 0

Determine k using CPV: k

∑D

n

∑D

n

2 ii

×100% ≥ 85%

i =1 p

2 ii

ˆ  ˆ V  (0), E (0)  = SPCA( X, N (0), k ) i =1

i =1

PCA

Search the optimum loadings of ith sparse PC: solve N i opt = arg max  Eˆ (i ) = SPCA( X, N (i ), k )  Ni then update N (i ) = [ N1opt ,L , N i opt , N i +1 ,L , N k ] ˆ ˆ  V  (i ), E (i )  = SPCA( X, N (i ), k ) i = i +1

Eˆ (i ) − Eˆ (i − 1) < η

NO

or i>k

YES Determine the optimum sparse loadings:

N opt = N (i − 1) opt ˆ  ˆ V  opt , Eopt  = SPCA( X, N , k )

Discover the valuable process knowledge

Discovery

Figure 1. A schematic of the forward SPCA method for process knowledge discovery.

11

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 38

As noted, N (0) = [ N10 , N 2 0 ,L ,N k 0 ] = [2,2 ,L , 2] is set as the initial basis. Second, the optimal NNZL is found for the first SPC N1opt with the maximum CPV by fixing that of the other (k-1) SPCs. Third, the optimal NNZL is found for the second SPC N2opt with the maximum CPV by fixing that of the other (k-1) SPCs. These steps are repeated until the difference between the new CPV and the old CPV reaches a pre-defined limit η , thus revealing no further improvement in the variance explained by decreasing sparsity. Algorithm 1 provides the detailed steps. Algorithm 1. Forward SPCA Algorithm

1. Initialize N as N (0) = [ N10 , N 2 0 ,L ,N k 0 ] = [2,2 ,L , 2] , the NNZL in each SPC. Calculate the initial sparse loadings and initial CPV using the general SPCA algorithm in Section

ˆ (0), Eˆ (0)  = SPCA( X, N (0), k ) . Set i = 1 , denoting the first SPC. 2.2:  V  2. Search the optimal loadings of the ith SPC: N i opt = arg max  Eˆ (i ) = SPCA( X, N (i ), k )  . Ni

ˆ (i), Eˆ (i)  = SPCA( X, N (i), k ) and Then update N (i ) = [ N1opt ,L , N i opt , Ni +1 ,L , N k ] ,  V  i = i +1 .

3. Check if Eˆ (i ) − Eˆ (i − 1) < η or i > k , if so, go step 4, otherwise go step 2.

ˆ , Eˆ  = SPCA( X, N opt , k ) . 4. Determine the final sparse loadings: N opt = N (i − 1),  V opt opt  The goal of forward SPCA is to determine if significant information gain can be realized by systematically sacrificing sparsity. With this simple but efficient forward SPCA method, only the first few SPCs contain more non-zero loadings, which achieve the desired sparsity and information gain. Furthermore, one can readily distinguish and grasp the dominant information

12

ACS Paragon Plus Environment

Page 13 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

patterns captured by each SPC through the change of the loadings in each search step. We would like to emphasize the following: Remark 1. The stopping criterion ( η ) does influence the determination of the sparse loadings. The smaller it is, more non-zero loadings will be added to SPCs, i.e., the loadings of SPCs will be less sparse. In general, η should be set properly to well balance the sparsity and the variability. In this paper, η is taken as a data-driven optimization heuristic capable of efficiently balancing information loss and sparsity.

Remark 2. We note that the SPCA algorithm is computationally more burdensome due to its iterative nature. However, in this work, the proposed optimization is carried out off-line on historical datasets and the amount of computational time to complete the iterations is relatively irrelevant.

3.3 Process knowledge discovery by interpreting the sparse loadings

Once the optimum sparse loadings of SPCs are obtained, one can extract valuable process knowledge by attempting to interpret them. It is noted that the process variables having non-zero loadings are associated with a major unit operation of the process. The dominant process variables having relatively high loadings on one SPC are most likely to be strongly correlated due to their inherent operational characteristics, being part of a control loop or having actual physical connections through the laws of conservation and thermodynamics. On the other hand, the process

13

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 38

variables having relatively small loadings on one SPC would have weak correlations with the other variables. Especially when only two variables load on one SPC, we argue that SPC would then most likely capture the causal relationships in the process. 4. APPLICATION OF THE PROPOSED METHOD

The method is illustrated through the benchmark Tennessee Eastman process. This process is reasonably well-known and sufficiently transparent to help validate the inferences made by the interpretation of the SPC loadings. The case study shows that fundamental process knowledge can indeed be discovered using the proposed approach. 4.1 Tennessee Eastman (TE) process

TE process has five major unit operations: an exothermic reactor, a product condenser, a vapor-liquid separator, a recycle compressor and a reboiled product stripper. The flow diagram is shown in Figure 2. There are 22 continuous process measurements, 12 manipulated variables and 19 composition measurements. Four gaseous reactants A, C, D, E and inert B are fed to the reactor where they react to form two liquid products G, H and one byproduct F. Detailed descriptions of this process can be found in the work of Downs and Vogel 26. A total of 33 variables that consist of 22 continuous process measurements and 11 manipulated variables are selected in this study, as listed in Table 1. For evaluation and comparison purposes, 960 normal samples with sampling rate of 3 min are used to build the SPCA model. The simulation data can be downloaded from http://web.mit.edu/braatzgroup/links.html. 14

ACS Paragon Plus Environment

Page 15 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

F1

8

1

XC 14

13 P7 Condenser

SC

FC2

F3

XF

F4 C

A N A L Y Z E R

FC4

Vapor/ LiqSepara tor

T11

5

P16

CWS

L8

XE

FC 11

3

E

XC

P13

L12

2

Purge

PHL 6

T22

7

XC 15

XD

J20 Compressor

FC1

D

XB

9

Cooling Water F2

XC 19

FC6

F10

FC3

A

XA

FC5

F5

XC 13

LC7 F14

XA A N A L Y Z E R

6

Stripper

TC 16

T18

T21

F19

CWR Stm

L15

Reactor TC 18

T9

Cond LC8

LC 17

F17

4

11

XE XF XG

XC 20

FC9

12

F6

XC XD

XH

10

TC 10

XB

A N A L Y Z E R

XD XE XF XG XH

Product

Figure 2. Process flow diagram of Tennessee Eastman process. (22 continuous measurements are in bold red, 11 manipulated variables are in bold blue) Table 1. Variable designations and labels in the TE process. Variable Type

Variable Number

22 Continuous Measurements

Variable Description

1

(F1)

A feed

2

(F2)

D feed

3

(F3)

E feed

4

(F4)

A and C feed

5

(F5)

Recycle flow

6

(F6)

Reactor feed rate

7

(P7)

Reactor pressure

8

(L8)

Reactor level

9

(T9)

Reactor temperature

10

(F10)

Purge rate

11

(T11)

Separator temperature

12

(L12)

Separator level

13

(P13)

Separator pressure

15

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

11 Manipulated Variables

14

(F14)

Separator underflow

15

(L15)

Stripper level

16

(P16)

Stripper pressure

17

(F17)

Stripper underflow

18

(T18)

Stripper temperature

19

(F19)

Stripper steam flow

20

(J20)

Compressor work

21

(T21)

Reactor cooling water outlet T

22

(T22)

Condenser cooling water outlet T

23

(FC1)

D feed flow valve

24

(FC2)

E feed flow valve

25

(FC3)

A feed flow valve

26

(FC4)

A and C feed flow valve

27

(FC5)

Compressor recycle valve

28

(FC6)

Purge valve

29

(LC7)

Separator pot liquid flow

30

(LC8)

Stripper liquid product flow

31

(FC9)

Stripper steam valve

32

(TC10)

Reactor cooling water flow

33

(FC11)

Condenser cooling water flow

Page 16 of 38

4.2 SPCA model of TE process As discussed in Section 3, the number of PCs retained would be k = 14 if one desires to explain ~85% of the CPV

27

. The number of SPCs will also be selected as 14 as a reasonable

heuristic, although one can certainly experiment with fewer or more SPCs in the discovery model. For this example, a brute-force search to find the optimal sparse loadings of SPCs would require a total of 3314 iterations, making this approach not only computationally expensive but rather impractical and uninformative as well. By using the forward SPCA algorithm, the desirable and

16

ACS Paragon Plus Environment

Page 17 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

optimum sparse loadings can be obtained systematically and relatively quickly. Moreover, the forward SPCA algorithm sequentially adds relevant non-zero loadings onto the SPCs therefore allowing the user to extract more insight from the knowledge of which variables load as a priority in the next iterative step. For the purposes of demonstration, the stopping criterion is set as η =0.005 . The search process is depicted in Figure 3. The PVE of each SPC in each iteration is shown in Figure 4. As the search step is repeated, the amount of captured variance monotonously increases. It is observed that from the base case to the first iteration, the CPV explained is raised from 62.0% to 72.8% by adding nine non-zero loadings to SPC1. The PVE of SPC1 more than doubles from 6.0% to 15.5%. Apart from SPC1, the PVE values of SPC2, SPC3 and SPC11 also get larger. From the first iteration to the second iteration, the CPV explained increases from 72.8% to 77.0% by adding eleven non-zero loadings to SPC2. The PVE of SPC2 increases from 4.4% to 8.7% while the PVE of SPC3 decreases with the PVE of SPC7 increasing slightly. The PVE values of the remaining 11 SPCs stay unchanged. From the second iteration to the third iteration, the CPV explained increases from 77.0% to 79.8% by adding eight non-zero loadings to SPC3. The PVE of SPC3 increases from 4.8% to 7.6%. The PVE values of the other thirteen SPCs remain unchanged. In the fourth iteration, the CPV explained is almost the same as that of the third iteration. That is to say, the captured variance does not increase significantly with sacrificing the sparsity. In this respect, the algorithm is declared as converged. The NNZL in 14 SPCs for the 17

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 38

final solution is N opt = [11,13,10, 2,L, 2] , and the optimum CPV is Eˆ opt = 79.8% . The absolute values of sparse loadings of 14 SPCs in the four cases are shown in Figure 5. It is observed that only the first three SPCs (SPC1-SPC3) have more than ten but less than fifteen non-zero loadings, while the other 11 SPCs (SPC4-SPC14) have only two non-zero loadings.

A comparison of the loadings of the first three dominant PCs obtained from PCA and forward SPCA are presented in Figures 6 and 7, respectively. It can be observed that the forward SPCA approach yields a clear and sparse representation of each PC. Although the captured variance of forward SPCA is ~5% less than that of PCA, the NNZL in each SPC is significantly lower than that of the corresponding PC. This demonstrates that the forward SPCA is able to balance the variability and sparsity to obtain the desired solution.

Figure 3. The optimum search process for sparse loadings.

18

ACS Paragon Plus Environment

Page 19 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Figure 4. The percentage variance explained (PVE) by each SPC in base case and three iterations.

19

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 38

Figure 5. The absolute value of sparse loadings of 14 SPCs in the four different cases. (Each row denotes the SPC and each column denotes the variable) (The color and size of the circles indicate the strength of the loading) 20

ACS Paragon Plus Environment

Page 21 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Figure 6. The loadings of the first three PCs of PCA model for TE process.

21

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 38

Figure 7. The sparse loadings of the first three SPCs of SPCA model for TE process.

22

ACS Paragon Plus Environment

Page 23 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

4.3 Process knowledge discovery based on sparse loadings

This section illustrates how valuable process knowledge can be discovered by interpreting the sparse loadings. Table 2 summarizes the results of the internal and external relations revealed from the 14 SPCs during the iterative search. An internal relation is defined as the relation between or among variables that are within the same operation unit/stream whereas an external relation is regarded as a relation between or among variables that belong to different operation units/streams. Table 2. The internal and external relations revealed from SPCs in four cases (variables highlighted in red denote external relations).

1st iteration

Base case

2nd iteration

3rd iteration

Operation Variable Variable description unit

Possible

SPC

Possible

SPC

Possible

SPC

Possible

SPC

causality

No.

causality

No.

causality

No.

causality

No.

(F1,FC3)

7

(F1,FC3)

7

(F1,FC3)

7

(F1,FC3)

7

(F2, FC1)

2

(F2, T21)

2

(F2, F14)

13

(F2, FC1)

2&3

(F2, F14)

13 (F3, FC2)

11

(F3, FC2)

11 (F14, FC2)

13

No. 1 (F1)

A feed

A feed 25 (FC3) A feed flow valve 2 (F2)

D feed

D feed 23 (FC1) D feed flow valve 3 (F3)

E feed

E feed

(F14, FC2)

13

(F3, FC2)

11

24 (FC2) E feed flow valve A and C 4 (F4) feed

A and C feed (F4, FC4)

10&11

(F4, FC4)

10

(F4, FC4)

10

(F4, FC4)

10

(F6, L8)

14

(F6, T21)

14

(F6, T21)

14

(F6, L8)

14

26 (FC4) A and C feed flow valve 6 (F6)

Reactor feed rate

(P7, P13, P16, 7 (P7)

Reactor pressure

(P7, P13)

1

F10, FC5,

(P7, P13, P16,

1

(P7, P13, P16, F10, FC5, Reactor

F10, FC5,

1

FC6) 8 (L8)

Reactor level

(T9, TC10)

3

FC6)

1 (T9, T21,

FC6)

3 TC10)

9 (T9)

Reactor temperature (F2, T21)

21 (T21) Reactor cooling water

(T9, TC10)

3

(T21, TC10)

2

(T21, TC10)

2

(T21, T22)

2

2 (T9, TC10)

3

23

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 38

outlet T (T11, T21, 32 (TC10) Reactor cooling water flow

(T21, T22)

2

3 T22)

10 (F10)

Purge rate

(F10, FC6)

8

(P7, P13, P16, Purge

(F10, FC6)

(F10, FC6)

8

(P7, P13, P16,

(F10, FC6) (P7, P13, P16,

8

28 (FC6) Purge valve

F10, FC5,

1

FC6)

F10, FC5,

1

FC6)

F10, FC5,

(T11,T22)

9

(T11,T22)

9

(T11,T22)

1

FC6) (T11,T22)

11 (T11) Separator temperature

8

9

9

(T11,T21, 3 T22)

12 (L12) Separator level Separator 13 (P13)

Separator pressure

(L12, LC7)

5

(T11, P13)

1

(T11, P13)

1

(T11, P13)

1

(P13, P7)

1

(L12, LC7)

5

(L12, LC7)

5

(L12, LC7)

5

(P7, P13, P16, 14 (F14)

Separator underflow

F10, FC5, (F14, FC2)

(P7, P13, P16, 1

29 (LC7) Separator pot liquid flow

FC6)

(F14, F2)

13

(L15, LC8)

4

Stripper level

16 (P16)

1

Stripper pressure

(F14, F2)

13

(F14, F2)

13

(L15, LC8)

4

(L15, LC8)

4

(F17, FC11)

6

(F17, FC11)

6

Stripper underflow

(F17, FC11)

6

F10, FC5,

(P7, P13, P16, 1

FC6) Stripper 18 (T18)

(L15, LC8)

4

(F17, FC11)

6

Stripper temperature

F10, FC5,

1

FC6)

(P7, P13, P16, (P16, T18, F10, FC5,

19 (F19)

1

FC6)

(P7, P13, P16, 17 (F17)

F10, FC5,

13 FC6)

15 (L15)

F10, FC5,

(P7, P13, P16,

1

(P16, T18, 1

FC9, F19)

Stripper steam flow

1 FC9, F19)

FC6) Stripper liquid product 30 (LC8)

(P16, T18,

(T18, FC9,

flow

(T18, FC9, 2

FC9, F19)

1

F19)

2 F19)

31 (FC9) Stripper steam valve 5 (F5)

Recycle flow

(F5, FC5)

12

(F5, FC5)

12

(F5, FC5)

12

20 (J20)

Compressor work

(J20, FC5)

2

(J20, FC5)

2

(J20, FC5)

2

Compressor

(F5, FC5)

12

27 (FC5) Compressor recycle valve

(P7, P13, P16, F10, FC5,

(P7, P13, P16, 1

FC6)

F10, FC5,

(P7, P13, P16, 1

FC6)

F10, FC5,

1

FC6) (T11, T21,

3

Condenser cooling water 22 (T22) outlet T

(T22, T11)

9

(T22, T11)

9

(F17, FC11)

6

(F17, FC11)

6

Condenser Condenser cooling water

(T21, T22)

2

T22)

(T22, T11)

9

(T21, T22)

2

(F17, FC11)

6

(T22, T11)

9

(F17, FC11)

6

33 (FC11) flow

24

ACS Paragon Plus Environment

Page 25 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

First, the base condition (maximum sparsity) is studied to see what fundamental relations (correlations) are explained at this level with the understanding that this SPCA model captures significant amount of variance (CPV=62%) already and thus the information contained therein is highly relevant. Even with the high level of sparsity realized with respect to this CPV, the picture gleaned from this snapshot is quite revealing. As noted, each of the 14 SPCs has two non-zero loadings. Figure 5 and Table 2 indicate that the same variables (F4 and FC4) load on both SPC10 and SPC11 (remembering that the SPCA algorithm generates SPCs that are not necessarily uncorrelated with each other), thus there are 13 distinct SPCs explaining the partial knowledge of the TE process. One can observe that the process variables that have relatively large loadings in each of these 13 SPCs are associated with the major or important operation units/streams. Accordingly, the behavior of five major process equipment are captured viz. Reactor, Condenser, Separator, Compressor and Stripper. In addition, the behavior of four feed streams are revealed viz. A feed, D feed, E feed and A and C feed along with the Purge operation. Given the variable pairs loadings onto each SPC, the internal and external relationships can be discovered based on the dominant non-zero loadings. First we consider the internal relationships within an individual operation unit/stream. Taking A feed stream in Table 2 as an example, the variables F1 and FC3 are grouped in SPC7. This immediately points to a control loop between F1 and FC3 because F1 is a measurement while FC3 is manipulated variable. As these two process variables are associated with the same operation unit/stream, namely A feed stream, they are regarded as an 25

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 38

internal cause-effect (causal) pair. Another example is F6 and L8 in SPC14; because as two variables are both measurements, the relationship between them points to a correlated behavior caused by the inherent process dynamics. In the TE process, the reactor level is controlled by the feed rate, thus (F6, L8) represents an internal cause-effect pair. Similarly, the remaining internal relations can now be discovered: •

A feed stream: F1 and FC3 load onto SPC7.



A and C feed stream: F4 and FC4 load onto SPC10 and SPC11.



Reactor: F6 and L8 load on SPC14 and T9 and TC10 load onto SPC3.



Purge: F10 and FC6 load onto SPC8.



Separator: L12 and LC7 load onto SPC5.



Stripper: L15 and LC8 load onto SPC4.



Recycle compressor: F5 and FC5 load onto SPC12.

Next the external relationships among variables in different operation units are considered. There are a total of 5 SPCs representing possible external relationships. They are (P7, P13) captured by SPC1, (F2, T21) captured by SPC2, (F17, FC11) captured by SPC6, (T11, T22) captured by SPC 9, and (F14, FC2) captured by SPC13. Unlike the internal causality, the external relations observed in the base case appear to be partial. Given that only 62% of CPV is explained by the base case model, these relationships need to be explored further. It is expected that more external

26

ACS Paragon Plus Environment

Page 27 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

relations will be uncovered with the increase of CPV in the subsequent iterations, thus these pairs present a preliminary yet incomplete picture of the process. The results of the first iteration are studied next to see what new information is generated by increasing CPV. The new internal relations thus discovered are as follows: •

E feed stream: F3 and FC2 load onto SPC11. This is a control loop.



Separator: T11 and P13 load onto SPC1. P13 is affected by T11 and vice-versa due to the vapor-liquid equilibrium in the separator.



Stripper: the four variables P16, T18, FC9 and F19 load onto SPC1. It can be seen from Figure 2 that T18 is controlled by FC9 and F19 is manipulated by FC9 which represents the cascade control loop. Also P16 is related to T18 by thermodynamic laws.



Compressor: J20 and FC5 load onto SPC2.

Apart from the above additions to the internal causality information captured in the first iteration, the external relations, especially those captured by SPC1, are worthy of note. The variable P16 (Stripper pressure) appears and loads onto SPC1. Now, All the three pressure variables in the TE process load onto SPC1 and this implies that they must represent a significant source of variation in the process. From process discovery standpoint, this hints at a feature that embodies the causality among pressure measurements. Upon analyzing the flowchart of TE process in Figure 2, it can be observed that both Stripper and Separator have return streams to the Reactor. Thus, these 27

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 38

three pressures must be varying in concert, as together they influence many variables associated with the fundamental material and energy balances around the process units. The new external relations discovered are as follows: •

Variables P7, P13, P16, F10, FC5 and FC6 load onto SPC1 and reveal the overall pressure control loops for the TE process. Any changes in P7, P13 or P16 will affect F10, FC5 and FC6 and vice-versa. This demonstrates how SPCA can assist in discovering the underlying process dynamics that may not be explicitly obvious on a process flow diagram.

The second iteration results are observed on SPC2. The new internal relations discovered are as follows: •

D feed stream: F2 and FC1 load onto SPC2. This is a control loop.



Reactor: The internal causality is discovered for variables T21 and TC10 in SPC2. In addition, SPC3 captures the internal causality between T9 and TC10. Such information hints at the fact that the variables T9, TC10 and T21 are related in some manner. These three variables indeed are cascaded by the control loop as shown in Figure 2.



Stripper: Variables T18, F19 and FC9 load onto SPC2 and reveal the temperature cascade control loop for the Stripper.

The new external relation discovered is as follows: 28

ACS Paragon Plus Environment

Page 29 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research



Variables T21 and T22 load on SPC2. It is easy to infer from Figure 2 that any changes in T21 are bound to affect T22.

With the third iteration results, one can observe that the internal causality in the D feed stream (F2, FC1) is now captured strongly by SPC3. Moreover, the internal causality between F6 and L8 in the Reactor is captured by SPC14. The external causality between variables T11, T21 and T22 is captured by SPC3. It is easy to infer from Figure 2 that any changes in T21 will affect the downstream equipment temperature viz. T22 and T11. There is one curious perhaps not fully explained feature here. The external relation (F14, FC2) has appeared for the first time in the base case solution, where both variables load onto SPC 13. This appears to indicate a correlation between the E feed flow valve (FC2) and the stripper underflow (F14) that seems counterintuitive. Indeed the loading of variable FC2 on SPC13 is quite weak (Figure 5), perhaps implying that this pairing may not be meaningful. In the next two iterations, F14 is paired with F2 (D feed flow), again with a very weak loading of F2 on SPC13. In the last iteration, the initial pairing reappears, confirming perhaps that looking for a dependency relation between F14 and another variable is futile. This phenomenon points to the possible conclusion that the stripper underflow (F14) has a distinct directional behavior, almost individually loading on SPC13. In summary, the final internal and external relations discovered by interpreting the SPCs are illustrated in Figures 7 to 10. The final summary of internal relations is as follows: 29

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60



Page 30 of 38

For the four feed streams, the internal causalities resulting from control loops are all captured.



Reactor: It was revealed that variable F6 (reactor feed rate) directly influences L8 (reactor level). T9 (reactor temperature) and T21 (reactor cooling water outlet temperature) are both controlled by manipulating TC10 (reactor cooling water flow valve).



Purge: F10 (purge rate) is controlled by FC6 (purge valve).



Separator: L12 (separator level) is controlled by LC7 (separator pot liquid flow valve).



Stripper: L15 (stripper level) is controlled by LC8 (stripper liquid product flow valve). T18 (stripper temperature) is controlled by FC9 (stripper steam valve), and FC9 directly influences F19 (stripper steam flow).



Compressor: F5 (recycle flow) is controlled by FC5 (recycle valve), and J20 (compressor work) is directly affected by FC5.

Apart from these internal causalities, the external relations are determined as follows: •

Temperature cluster: T21 (reactor cooling water outlet temperature) affects T22 (condenser cooling water outlet temperature) and T11 (separator temperature). The temperature measurements among the three unit operations appear to be correlated.

30

ACS Paragon Plus Environment

Page 31 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research



Pressure cluster: P7 (reactor pressure), P13 (separator pressure) and P16 (stripper pressure) play an important role in the whole process and appear to be correlated. This is analogous to the temperature cluster for the same unit operations. In addition, the (F10, FC6) pair, which loads also on SPC8 as a control loop pairing, appears in this block to underscore the dependence of the purge operation on the pressure cluster associated with the three unit operations as noted. Finally, the appearance of FC5, compressor recycle valve, in this block hints at the fact that the compressor operation is directly influenced by the variability in the pressure cluster.



F17 (stripper underflow) is controlled by FC11 (condenser cooling water flow).

In any fluid-processing plant, temperature and pressure variables are not independent due to thermodynamic dependencies thus leading to the variables overlapping in SPC1, SPC2 and SPC3 as the variance captured is increased. The proposed scheme effectively discovers such process knowledge by exploiting the forward SPCA technique and interpreting the resulting SPCs.

31

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 38

Figure 7. The non-zero loadings of the first SPC (SPC1) for the TE process.

Figure 8. The non-zero loadings of the second SPC (SPC2) for the TE process. 32

ACS Paragon Plus Environment

Page 33 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

FC5

F5

XC 13 F1

F10

FC3

9

Cooling Water

XC 14

J20

T22

7

D

P7 FC2

F3

Condenser

SC

FC 11

T11

5

3

E

XD

0.7

XE XF

0.6

XG XH

10

0.5 0.4

XA

A N A L Y Z E R

XB XC XD XE XF

F4

0.8

XC

LC7 F14

P16

CWS

L8

XB

A N A L Y Z E R

13

XC 15

0.9

XA

P13

L12

2

Purge

PHL 6

Compressor

FC1

F2

1.0

8

1

A

XC 19

FC6

TC 10 6

C

Stripper

T21

XC 20

CWR

Reactor

Stm

L15 Cond

TC 18

LC8

LC 17

F17

4

11

0.3

A N A L Y Z E R

F19

FC9

12

F6

T9 FC4

TC 16

T18

XD XE

0.2

XF XG

0.1

XH

0.0 Product

Figure 9. The non-zero loadings of the third SPC (SPC3) for the TE process.

F1

F10

FC3

FC1

SC

FC2

XE XF

F4 C

5 T11

FC 11

LC7

CWS P16

L8

A N A L Y Z E R

FC4

TC 10 6

12

0.8

XC XD

0.7

XE XF

0.6

XG XH

F19

CWR Stm

TC 18

XB

XC 20

L15 T9

A N A L Y Z E R

0.5 0.4

F14 TC 16 FC9

Reactor

4

Stripper

T18

0.9

XA

10

T21

F6

LC 17

Vapor/ LiqSepara tor

Condenser

3

E

XD

P13

L12 13

P7 F3

1.0

Purge

PHL 6

T22

7 2

XC 15

XB

J20 Compressor

D

XC

9

Cooling Water

XC 14 F2

XC 19

FC6

8

1

A

XA

FC5

F5

XC 13

Cond LC8

A N A L Y Z E R

0.3 XD XE

XG

0.1

XH

0.0

F17 11

0.2

XF

Product

Figure 10. The non-zero loadings of the remaining 11 SPCs (SPC4-SPC14) for the TE process. 33

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 38

5. CONCLUSIONS

An efficient and robust forward SPCA method is proposed to help provide a tool for process knowledge discovery that complements other existing techniques. The proposed forward search approach is used to find the optimum sparse loadings at a low computational cost. The set of sparse eigenvectors reveal valuable and meaningful patterns from the data, especially the internal and external relations that are associated with variables within and across units/streams, respectively. The proposed process knowledge discovery method is tested on the well-known TE process demonstrating that our data-driven technique can help recover important process structure and knowledge that may otherwise require in-depth process and unit operations experience and modeling efforts. While some of the findings could be deemed obvious, for the relevance and importance of the proposed method, we refer to any complex process structure and the extent of process knowledge that can be gained through the proposed method particularly if little to no prior process knowledge exists.

ACKNOWLEDGEMENTS

Part of this research was carried out during the sabbatical stay of Murat Kulahci at the Department of Chemical Engineering at UC Davis, partly funded by Otto Mønsted’s Foundation in Denmark. Huihui Gao’s stay at UC Davis was funded by the International Joint Graduate-Training Program Scholarship provided by Beijing University of Chemical Technology.

34

ACS Paragon Plus Environment

Page 35 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

REFERENCES 1.

Piateski, G.; Frawley, W., Knowledge discovery in databases. MIT press: 1991.

2.

Frawley, W. J.; Piatetsky-Shapiro, G.; Matheus, C. J., Knowledge discovery in databases: An

overview. AI magazine 1992, 13 (3), 57. 3.

Meglen, R. R., Examining large databases: a chemometric approach using principal

component analysis. Marine Chemistry 1992, 39 (1), 217-237. 4.

Cios, K. J.; Pedrycz, W.; Swiniarski, R. W., Data mining methods for knowledge discovery.

Springer Science & Business Media: 2012; Vol. 458. 5.

Sebzalli, Y.; Wang, X., Knowledge discovery from process operational data using PCA and

fuzzy clustering. Engineering Applications of Artificial Intelligence 2001, 14 (5), 607-616. 6.

Kuncheva, L. I.; Faithfull, W. J., PCA feature extraction for change detection in

multidimensional unlabeled data. IEEE transactions on neural networks and learning systems 2014, 25 (1), 69-80. 7.

Cui, K.; Gao, Q.; Zhang, H.; Gao, X.; Xie, D., Merging model-based two-dimensional

principal component analysis. Neurocomputing 2015, 168, 1198-1206. 8.

Wang, X. Z.; Yang, Y.; Li, R.; Mcguinnes, C.; Adamson, J.; Megson, I. L.; Donaldson, K.,

Principal component and causal analysis of structural and acute in vitro toxicity data for nanoparticles. Nanotoxicology 2014, 8 (5), 465-476. 9.

Helena, B.; Pardo, R.; Vega, M.; Barrado, E.; Fernandez, J. M.; Fernandez, L., Temporal

evolution of groundwater composition in an alluvial aquifer (Pisuerga River, Spain) by principal component analysis. Water Research 2000, 34 (3), 807-816. 10. Kaiser, H. F., The varimax criterion for analytic rotation in factor analysis. Psychometrika 1958, 23 (3), 187-200. 11. Jolliffe, I. T., Principal Component Analysis. Springer New York: 2002.

35

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 38

12. Jolliffe, I. T., Rotation of principal components: Some comments. Journal of Climatology 1987, 7 (5), 507-510. 13. Jolliffe, I. T., Rotation of principal components: choice of normalization constraints. Journal of Applied Statistics 1995, 22 (1), 29-35. 14. Jolliffe, I. T., Rotation of Ill-Defined Principal Components. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1989, 38 (1), 139-147. 15. Richman, M. B., Rotation of principal components. Journal of Climatology 1986, 6 (3), 293-335. 16. Jolliffe, I. T.; Trendafilov, N. T.; Uddin, M., A modified principal component technique based on the LASSO. Journal of computational and Graphical Statistics 2003, 12 (3), 531-547. 17. Zou, H.; Hastie, T.; Tibshirani, R., Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics 2006, 15 (2), 265-286. 18. d'Aspremont, A.; El Ghaoui, L.; Jordan, M. I.; Lanckriet, G. R., A direct formulation for sparse PCA using semidefinite programming. SIAM review 2007, 49 (3), 434-448. 19. Banerjee, O.; El Ghaoui, L.; d'Aspremont, A., Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research 2008, 9, 485-516. 20. Witten, D. M.; Tibshirani, R.; Hastie, T., A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 2009, kxp008. 21. Zass, R.; Shashua, A. In Nonnegative sparse PCA, Advances in Neural Information Processing Systems, 2006; pp 1561-1568. 22. Shen, H.; Huang, J. Z., Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 2008, 99 (6), 1015-1034. 23. Journée, M.; Nesterov, Y.; Richtárik, P.; Sepulchre, R., Generalized power method for sparse principal component analysis. The Journal of Machine Learning Research 2010, 11, 517-553. 36

ACS Paragon Plus Environment

Page 37 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

24. Trendafilov, N. T., From simple structure to sparse components: a review. Computational Statistics 2014, 29 (3-4), 431-454. 25. Jolliffe, I. T.; Cadima, J., Principal component analysis: a review and recent developments. Phil. Trans. R. Soc. A 2016, 374 (2065), 20150202. 26. Downs, J. J.; Vogel, E. F., A plant-wide industrial process control problem. Computers & Chemical Engineering 1993, 17 (3), 245-255. 27. Gajjar, S.; Palazoglu, A., A data-driven multidimensional visualization technique for process fault detection and diagnosis. Chemometrics and Intelligent Laboratory Systems 2016, 154, 122-136.

37

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 38

TABLE OF CONTENTS (TOC)/ABSTRACT GRAPHIC

38

ACS Paragon Plus Environment