Hierarchical Multilabel Segmentation for System Identification Using

PDF (8 MB) · Go to Volume 0, ... Any new MPC implementation requires model identification. ... A representative model of process can be obtained in tw...
3 downloads 0 Views 2MB Size
Subscriber access provided by Université de Strasbourg - Service Commun de la Documentation

Process Systems Engineering

Hierarchical multi-label segmentation for system identification using historical data . Manikandan S, and Raghunathan Rengaswamy Ind. Eng. Chem. Res., Just Accepted Manuscript • DOI: 10.1021/acs.iecr.8b06335 • Publication Date (Web): 06 May 2019 Downloaded from http://pubs.acs.org on May 7, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Hierarchical multi-label segmentation for system identication using historical data Manikandan S and Raghunathan Rengaswamy∗ Department of Chemical Engineering, Indian Institute of Technology Madras, Chennai 600 036, India E-mail: [email protected]

Abstract Model predictive controllers (MPC) utilize a model of the process to optimize the future trajectory using an objective function to obtain a control move plan. Any new MPC implementation requires model identication. The quality of the identied model depends on the information content of the data. Performing step tests to obtain informative data is time-consuming and may not be economical. Since the process data is stored for long term in industries, this data can be used for identication. But this historical data contains informative data scattered among regions of insignicant variation, long term disturbance eects, process interruptions, etc. Informative data required for identication can be mined from historical data by using appropriate machine learning techniques. This paper focuses on generating high quality data segments from historical records that can be used for identication of reliable process models for use in any model based controller such as MPC. An interval-halving based hierarchical classication method is proposed to identify segments and label them based on their information content and presence of disturbance. The key distinction between the proposed method and the methods in literature is the ability to identify process models from historical records that might comprise of regions of low quality data, beset with 1

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

intermittent disturbance eects and one that has not been annotated in terms of these characteristics. The proposed algorithm is tested on simulated systems and the method was able to identify process models from historical data with little to no annotation.

Introduction Model Predictive Control(MPC) has found applications across various engineering domains in the last few decades. 1 This is due to the fact that MPC framework solves an optimization problem, that can be designed and customized to optimize various operational and economic objectives. It can also handle constraints that are inherent in all physical systems. MPC performs control calculations using a representative model of the process. MPC algorithms compute an input move plan, by optimizing estimated future trajectory of system using this model. The performance of MPC control schemes are signicantly inuenced by the approximation errors or the extent of plant-model mismatch. MPC implementation in a new process then requires the identication of process model, a priori. A representative model of process can be obtained in two ways. One is to develop the model using rst principles. The other is to use process data to estimate a model. The second method requires informative data that can be used for modeling the process. Moreover, in order to use data driven approaches to modeling, certain assumptions have to be made. These assumptions may lead to a higher degree of plant-model mismatch. In order to obtain the best model of the process that is valid throughout the whole operating region, data must be collected systematically. Hence, a set of well planned tests involving changing one or more inputs at a time is performed and corresponding response data is collected. Generally inputs are perturbed using step like signals. So this test phase is called the step test phase. This data is used for modeling after careful screening. Performing such tests may be time consuming, economically not viable, operationally require critical monitoring, may cause disturbances in the production and may need careful scheduling to avoid the eects of disturbances on collected data. Thus data collection and modeling is the most time consuming step in the 2

ACS Paragon Plus Environment

Page 2 of 39

Page 3 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

implementation of MPC requiring as much as 90% of the implementation time. 2 Industrial processes are often complex and have interacting dynamics. If one were to approach modeling industrial processes using rst principles without simplifying assumptions, the resultant model is almost always nonlinear. Using nonlinear model in MPC framework typically results in nonlinear optimization problems and require high computational power and computational time. Thus using rst principles to get a model for control is often ineffective. Adaptive MPC and model free predictive controller frameworks can simultaneously identify a new model using data and control the system. 3,4 These methods reduce model-plant mismatch. Hence data driven model identication is preferred for MPC implementation. With the development of cheaper storage devices and faster computational systems, it has become a norm to store considerable quantities of past data of the process. This bulk data is called the historical data of the process. Such data is used for evaluating the controller performance, analyzing the eectiveness of various operations and for fault diagnosis. 58 These techniques utilize only data corresponding to the region of interest in a desired granularity. Historical data can be used for model identication purpose with a few additional techniques to isolate informative data. Since MPC requires a dynamic model to be estimated, only data with signicant variation in inputs and outputs can be used. In a typical MPC implementation, step tests are performed to ensure sucient amount of informative data is collected. Since the data collected while performing step test corresponds to only a short span of time, the eect of long time disturbances like corrosion, fouling of heat exchanger, seasonal eects due to atmospheric temperature variations etc. can be neglected. While performing step test, the move plan is made such that none of the inputs aecting same set of output variables are moved together. Thus the possibility of correlated input moves is eliminated. Using the collected data for model identication results in high quality model. Historical data contains few regions where there are signicant changes in input and output scattered across regions where there is negligible change in input-output data. So these 3

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

regions have to be automatically mined from historical data. Historical data may also contain long term disturbance eects. Thus if historical data has to be used for modeling, long term disturbance eects have to be accounted for. If the process involved is multivariate, then the choice of inputs may make the system ill-conditioned. In such cases, appropriate dimensionality reduction techniques need to be employed before model identication. Similarly there may be segments of data where the input is varied in correlated fashion. These segments have to be identied, and then can either be eliminated from the data used for identication or can be included by choosing appropriate estimation techniques. There are three techniques present in literature for data mining of historical data for system identication. 911 These techniques have addressed SISO problems. Multivariable analysis for segmentation based on clustering techniques have also been proposed. 1214 These techniques, provide a data quality metric for each segment, typically condition number of Fisher's information matrix of assumed model structure and classify the segments based on a threshold of this metric. The presence of disturbance variables and the eect of these variables on model identication is not taken into account in the formulation. We propose an algorithm that segregates the historical data into various segments and classies them based on the quality of data and the presence of disturbance. Since the proposed algorithm classies data into various segments of similar properties, these can be individually used in model identication. By using the readily available historical data, time spent on performing tests, while implementing MPC can be reduced. In summary, model identication for MPC implementation using step tests is a timeconsuming, uneconomical task. If historical data can be mined for regions where informative data is present, then step tests can be avoided. Historical data may contain regions of uninformative data, eects of unmeasured disturbances, long-term disturbance eects like corrosion, correlated input moves etc. In order to use historical data for identication, appropriate data mining techniques that can classify various regions based on certain properties of segment need to be employed. Informative data as detected by the algorithm has to be used 4

ACS Paragon Plus Environment

Page 4 of 39

Page 5 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

for modeling. If disturbance variables are measured then these can also be included along with input variables and both the process and disturbance models can be identied together. In general, however, disturbance variables are seldom measured. Hence the objective is to get a process model alone in the presence of unmeasured disturbances. Once successful, the same algorithm can be quite easily extended for cases where disturbance variables are measured.

Motivating examples If system identication is the goal then nding regions of historical data that are most suitable for model identication is the objective of techniques that are used for partitioning historical data. Existing segmentation algorithms 9,10 use condition number of Fisher's information matrix as a criterion to partition the data into regions of informative data. Fisher's information matrix is constructed as a function of past inputs and outputs based on the choice of model to be tted. Laguerre based model is proposed in Peretzki et al. 9 An ARX model structure is used in Shardt and Huang. 10 These works do not include disturbance in the construction of Fisher's information matrix. Thus the condition number of Fisher's information matrix does not change for data with and without disturbance. As a result, using the data identied using this criterion may lead to biased model parameter estimation. In Wang et al, 11 adjustments are made to include disturbance variables in Fisher's information matrix as they are continuous and measured. When the disturbance variables are unmeasured, these cannot be included in Fisher's information matrix. One approach to solving this problem is the classication of historical data based on the presence/absence of disturbance; however, without the use of explicit disturbance model identication. Partitioning based on condition number of Fisher's information matrix alone is not sufcient for obtaining the best available informative data segments in historical data. The presence of poor quality data reduces the accuracy of the identied model. 15 Better models can be obtained by classifying segmented data into various categories and using only the 5

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 39

best available data for identication. These categories are combinations of (low)high quality of data and presence(absence) of disturbance. Thus the segmented data can be labeled into four dierent labels. The best available data are from the segments with high quality data without disturbance eects. In order to demonstrate the requirement of such classication, example simulations that bring out two important ideas are presented in this section.

y[k] =

n X

g(i)u[k − i] + e[k]

(1)

i=0

sP M SE =

N i=1 (y[i]

− yˆ[i])2 N −p

(2)

Example - Eect of presence of low quality data on model identication and Fisher's Information Matrix First, the eect of presence of low quality data along with high quality data on the estimated model and Fisher's Information Matrix is studied. Data is generated using a process model structure as given in Equation 1, with n = 10. The parameters used in the simulation are listed in Table 1. Data of 2000 samples are generated with dierent fractions of data with non-persistently exciting (NPE) input and persistently exciting (PE) input. Data generated using non-persistently exciting input reduces the quality of the data. A PE input is generated as a sum of 6 sinusoids, i.e. just one more than required to estimate all 10 parameters. NPE input is generated as a sum of 3 sinusoids. Simulated data is used for model identication. Performance of the estimated model is compared with actual model using mean squared error(MSE) computed as given in Equation 2 of the responses to an arbitrary input. The condition number(κ(F)) of the Fisher's information matrix for the same model structure is also computed. This simulation is executed 1000 times and the average values of MSE and condition number are plotted in Figure 1. Condition number of Fisher's matrix reaches high

6

ACS Paragon Plus Environment

Page 7 of 39

value(κ(F) = 7.6481 × 1013 ) only for the case when the whole dataset is simulated with non-persistently exciting input and thus of low quality. In the other cases, the condition number is well within the threshold specied in Shardt et al. 10 i.e. κ(F) ≤ 1 × 104 . On the other hand, the mean squared error of the estimated data increases as the fraction of low quality data is increased. This is due to the fact that the parameters are estimated with bias and lead to poor predictions.

0.8

1013 8

Effect of presence of non-persistently exciting input

7

MSE of estimated model Condition number of Fisher's information matrix

0.7

6

0.6

5

0.5

4

0.4

3

0.3

2

0.2

1

0.1 0.1

Condition number of Fisher's matrix

0.9

Mean squared error(MSE)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Fraction of non-persistently exciting input present

1

Figure 1: Eect of presence of low quality data on MSE of estimated model and Fisher's information matrix

Example - Eect of presence of disturbance on model identication In order to emphasize the need for segregation based on the presence of disturbance, a similar experiment using the system structure specied in Equation 3 is performed, with

nu , nd = 10. In this experiment, disturbance is introduced in various fractions of the data. A full band PRBS input is used for simulation. Disturbance variable d is assumed to be random and is generated as d ∼ N (0, 1). The results are plotted in Figure 2. Since the data is of high quality, the value of condition number does not vary signicantly throughout this 7

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

experiment. But MSE varies along with the increase in the presence of disturbance in the data.

y[k] =

nu X

g(i)u[k − i] +

i=0

nd X

(3)

gd (i)d[k − i] + e[k]

i=0

Table 1: Simulated system parameters System coecients Gs Gd

g(1) 2.000 0.900

g(2) 1.910 0.730

g(3) 1.860 0.460

g(4) 1.700 0.350

g(5) 1.530 0.280

g(6) 1.200 0.260

g(7) 0.980 0.230

g(8) 0.630 0.210

g(9) 0.420 0.160

g(10) 0.310 0.120

Effect of disturbance 9.1

0.18 0.17

MSE of Estimated model

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 39

9

0.16 8.9 0.15 0.14

8.8

0.13

8.7

0.12 8.6 0.11 8.5

0.1 0.09 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

8.4 1.0

Fraction of disturbance data present

Figure 2: Eect of disturbance presence on MSE of estimated model and Fisher's information matrix Thus it is not sucient to categorize data only based on condition number of Fisher's information matrix. Even though data quality is detected by the condition number of Fisher's matrix, the detection is proper only if the data range consists of poor quality data only. If the region contains even a small number of high quality data samples along with low quality data, then the criterion fails. Usage of sliding window approach may end up including data regions that are of low quality till the criterion is met. Thus the threshold used for condition 8

ACS Paragon Plus Environment

Page 9 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

number becomes an important tuning factor, and is likely to have a signicant impact on the properties of various segments. Using these segments directly for identication may result in models with poor prediction. Further, if disturbance variables are not measured then Fisher's information matrix will not be able to identify the informative segments properly.

Problem formulation and solution In this paper, we propose a solution to the problem of identifying high quality data regions with minimal disturbance eects from historical data, which can subsequently be used for dynamic model identication of SISO systems. The method proposed in this paper can be readily extended in principle to MIMO systems after addressing collinearity issues and correlated input moves. The underlying process is assumed to be linear time invariant (LTI). This assumption simplies the solution methodology. Under very mild assumptions about local time invariance, the concatenation step of the algorithm (to be discussed later) can be modied to remove the LTI assumption. It is also assumed that the order of the process is known. This is an important assumption that allows proper identication of the segments. Removal of this assumption would require use of some other information metric (MSE is used in this paper) in the segmentation process. While there are several possible choices for this (AIC, BIC, Entropy), a careful evaluation of these choices have to be made so that the proposed approach is routinely applicable for a large cross-section of problems. This is not pursued in this paper and will be a subject for future research. Finally, an FIR model structure is used, easily replaced with any other model form particularly under the assumption of a known structure. The quality requirement of the data for identication depends on the number of parameters to be estimated. Since parametric models have less number of parameters, lower quality data can be used to identify these models. FIR models, being non-parametric require more number of parameters to be estimated and hence requires high quality data. With the usage of FIR model structure, worst-case scenario for the

9

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

proposed method is demonstrated. The proposed approach follows a sequence of steps as depicted in Figure 3 to solve the problem of interest. Each of these steps will be described in detail in this section. One of the key requirements for system identication from historical data is the ability to isolate data segments that contain informative data. In this section, multi-label classication approach is explained rst, followed by the description of proposed approach.

Figure 3: Proposed algorithm

Multi-label classication - key ideas Classication is a machine learning technique which assigns class labels to data instances. Classication can be based on a single criterion, in which case it is binary classication or multiple criteria in which case it is multi-dimensional classication. In a traditional multiclass problem, each data instance is assigned to one class. In contrast, in a multi-dimensional classication problem, a given data instance can be assigned to multiple classes. 16 Multilabel classication can be thought of as a particular case of multi-dimensional classication problem. Here each data point is assigned to multiple classes called labels which take only binary values. There are several techniques for solving these types of problems. The two class variables of interest are quality of the data and disturbance. Corresponding binary values are high/low and present/absent respectively. Thus, the segmentation of historical data is a multi-label classication problem. Hierarchical approach is proposed for identication of high quality data segments from historical data. It is worth commenting here that this could be generalized to a multi-dimensional classication problem, if the categories of quality of 10

ACS Paragon Plus Environment

Page 10 of 39

Page 11 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

data are made more ne-grained and/or if explicit disturbance types are introduced. Another important distinction to note here is that associating single data instances to class labels in this problem is not appropriate. In this case, a group of contiguous data has to be assigned class labels. Further, this has to be performed in an unsupervised manner as it is assumed that annotation of historical data is not generally available. Hence, segmentation is performed on a group of data rather than a single data instance. Given a segment, the information content and presence of disturbance in that segments need to be identied. Multi-label classication can be used to assign multiple labels to the data simultaneously. However, this version of multi-label classication cannot be used in the case of classifying historical data segments for system identication. This is because the presence of disturbance manifests as model mismatch. An approximate model of the system has to be identied and analyzed whether the model t is good enough in order to identify the presence of disturbance eects. For identifying model of a particular structure with sucient accuracy, data of certain quality is needed. Thus the labels are inter-related. Hence the required classication has to be performed sequentially. Classication of data based on quality has to precede detection of presence of disturbance. Thus the proposed approach is hierarchical multi-label segmentation.

Interval-halving segmentation For segmentation, one can use a top-down, bottom-up or a sliding window approach. 17 In the motivation section it is demonstrated that segments could be categorized as high quality even if there are several sub-segments of low quality data is present within that segment . As a result segmentation that starts from the whole data and reduces down to segments that are of low quality(top-down approach) is likely to perform better than other approaches. In this work, interval-halving method is used. Interval-halving method has been used in various data mining techniques. This technique has been used extensively in various elds like econometrics, process control, fault diagnosis, trend analysis etc. 6,8,18,19 It can be 11

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

used to detect any statistical property change in data. Interval-halving works by recursively splitting the data into two halves based on some criterion. If the criterion is met, the splitting is terminated and the corresponding data window is stored. Otherwise interval-halving is continued till the desired criterion is met or some minimal length of data is reached. The remaining data is then processed in the same manner to obtain regions where the criterion is either met or not. A key advantage of interval-halving is that segmentation can be performed based on any property of interest. Mean, variance, 18 Hurst exponent 20 and other properties have already been used for segmentation in various applications. A pictorial representation of the interval halving process is shown in Figure 4.

Figure 4: Interval halving method - One execution step. The x- axis line represents the whole data sequence that is being classied. The dashed line represents the next iteration cycle if the length of the data is > nminL If data consists of regions where the criterion is not met then interval-halving proceeds till a minimum length of data is reached. This results in more segments being identied than present in the data. Hence, a post-processing algorithm is usually employed where 12

ACS Paragon Plus Environment

Page 12 of 39

Page 13 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

segments with similar properties are merged together. In fact, for this problem of interest, the post-processing algorithm is of special signicance. A owchart of the interval-halving algorithm is shown in Figure 5. Interval-halving algorithm is demonstrated using a simple example. The data is generated as shown below

d[k] =

    ∼ N (0, 0.1) ∀ k ∈ (1, 100)     ∼ N (0.2, 0.1) ∀ k ∈ (101, 150)       ∼ N (0.15, 0.1) ∀ k ∈ (151, 200)

The data has three regions with dierent means and a common variance. The aim of intervalhalving approach is to identify that the data is comprised of three segments with dierent means and the location of transition points. Interval halving is performed with mean tolerance of 0.01 and minimum data length of 50 samples. If mean of the segment is less than 0.01, then interval halving is stopped and corresponding region is marked. Then the algorithm moves to the remaining segments. At the rst iteration assuming the whole data as a single segment the mean is computed (µS = 0.0891 > 0.01). Since the criterion is not met, interval-halving is performed and a new segment 1 is considered and mean of this segment is computed. Since µs1 = 0.0032 < 0.1 (see Figure 6), the stopping criterion is satised, the segmentation moves on to rest of the data. In the remaining segment the specied criterion is not met as µs2 = 0.1749. Hence interval halving is performed and the mean of new segment is computed. Even though

µs3 = 0.2060 > 0.01, the algorithm stops because the criterion that data should at least have 50 samples is reached. Mean of the remaining segment is also computed and interval halving stops because of the minimum length criterion. If the minimum length criterion were less (say 20), then the process would have continued.

13

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 39

Figure 5: General interval-halving technique.

Hierarchical multi-label segmentation Proposed method consists of two sequential interval-halving based segmentation. First segmentation is focused on identifying high quality data segments. Second segmentation is aimed at identifying the presence of disturbance.

Remark 1. In general, historical data will contain large portions of data where changes in input-output data are insignicant. To reduce computational complexity, a pre-processing algorithm can be used to identify historical data with variance larger than a threshold value. Various methods are available for identication of such regions from data. 14

ACS Paragon Plus Environment

11,18,21

Page 15 of 39

Interval halving

0.5 0.4 0.3

S

= 0.003 S

0.2 Magnitude

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

=0.0891

0.1 0 S

-0.1

=0.2060

S

= 0.1439

-0.2 S

=0.1749

-0.3 0

50

100 Sample data

150

200

Figure 6: Interval-halving example for change in mean detection.

Segmentation 1 - Classifying the data based on quality of the data The pre-processed data is analyzed using an interval-halving technique. The criterion used for classication is the rank of input covariance matrix(Ruu ). This value provides information about the maximum number of parameters that can be evaluated unambiguously. Thus if np is the number of parameters to be estimated in the chosen FIR model, then the corresponding input covariance matrix should have rank ≥ np for proper identication. If sucient quantity of high quality data is present along with low quality data, then the segment will result in an input covariance matrix with rank ≥ np . Thus interval halving will be done as long as the segment has rank(Ruu ) ≥ np . The resultant data segment is recursively split into two regions and each region is tested again for rank of Ruu . The evaluation stops when the data region results in the input covariance matrix has rank less than the number of parameters to be identiedRnn < np or length of the data is less than a specied limit nminL . This limit allows us to stop interval-halving if the data is of high quality in the whole chosen range. At the end of this process, each of the nal identied segments can be tagged with an identier 15

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

[HQ,?] or [LQ,?], which represents whether the segment is of high quality or not but the presence/absence of disturbance has not yet been determined.

Segmentation 2 - Classifying data based on presence of disturbance in the region Individual regions, i.e. data of high quality and data of low quality can be further classied based on the presence/absence of disturbance. Now all the segments identied from the previous step are subject to a second level of segmentation to convert the [xx,?] to [xx,D] or [xx,ND] to signify the presence/absence of disturbance respectively (xx denotes HQ or LQ). Since the presence of disturbance can be viewed as model mismatch, the residuals of identication exercise will be colored noise signals. Hence whiteness test can be used to identify the presence of disturbance in data. Similarly tting a lower order model to the system or tting the correct structure when disturbance is present results in higher MSE for the model. MSE values for the model where disturbance is present is much larger than when disturbance is not present. In the proposed method, MSE is used as a criterion in the second level of segmentation. Model estimation can be performed using various methods like least square estimation(LS), maximum likelihood estimation, regularized least square estimation etc. In the proposed method LS estimation is used and the corresponding MSE is used for segmentation. Interval-halving is done based on a user dened threshold (M SE ) of MSE beyond which the data can be considered as data with disturbance. The interval halving proceeds as long as M SE ≥ M SE or minimum length of the data nminL . When the condition that

M SE < M SE is satised, then the region is marked as a region without disturbance. At the end of this stage, each segment is now classied as one of [HQ,D],[HQ,ND],[LQ,D],[LQ,ND].

Concatenation and system identication Since interval-halving proceeds till nminL for regions where data is of high quality, the algorithm in general identies a higher number of segments than what exists in the data. However, only four categories have to be identied. As a result, if there are intervals with 16

ACS Paragon Plus Environment

Page 16 of 39

Page 17 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

the same tag ([HQ,D],[HQ,ND],[LQ,D],[LQ,ND]) that are contiguous then they can all be merged together to make an episode of a longer duration with similar characteristics. Such concatenation algorithm of the identied segments is performed to generate episodes of longer duration. After this a list of episodes is generated, where each element of the list is in the form [{HQ or LQ}, {D or N D}, nb , ne ], where nb is the beginning sample number and ne is the end sample number. Model identication can then be done by consolidating all the data from the [HQ, ND] episodes. A least squares modeling framework is used to generate a model of the given structure. An interesting point to note here is that if it is assumed that the system is only locally time invariant, dierent models could be identied using dierent [HQ, ND] sections. The goodness of identied model is evaluated using MSE between the predicted and actual output. There are other methods such as validation using test data of known type, analyzing the residuals and so on in order to identify goodness of t. These will be explored in future work. Proposed method is schematically represented in Figure 7. When the model is not satisfactory, then the model structure needs to be changed. This changes the required quality of the data. In other words the rank of the input covariance matrix Ruu has to be adjusted based on the chosen model structure. This would require segmentation 1 and segmentation 2 to be performed again. Since it is assumed that the model structure is known, this loop is not utilized in the present work; nonetheless, it is included for completeness of the algorithm. If this approach is extended to cases of unknown model structure, other criteria for the two levels of segmentation will also have to be evaluated over and above the iterative loop that is included.

Case studies In this section validity of the proposed approach is evaluated on multiple case studies. For one case study detailed analysis that demonstrate working of the algorithm and the impact of tuning parameters on the results is provided. A summary of the results are presented to

17

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 39

show the generality of proposed method.

Data generation procedure We benchmark the approach on a system, whose parameters are as given in Table 2. Data is generated based on the details provided in Table 3. PE input is generated with pseudo random binary signal (PRBS) input. NPE input is generated as a combination of

np 2

+1

number of sum of sinusoids. Disturbance input is randomly generated. The system is then simulated using the generated inputs and measurement noise is added as e(k) ∼ N (0, σ 2 )∀k =

1, 2, 3, ... such that signal-to-noise ratio (SNR) is maintained at 10. A plot of this data is shown in Figure 8. Table 2: System parameters used in case study Coecient Index g(0) g(1) g(2) g(3) g(4) g(5) g(6) g(7) g(8) g(9) g(10) g(11)

System 1 Process Disturbance model model 0.935 0.900 0.753 0.730 0.486 0.460 0.346 0.350 0.312 0.280 0.276 0.260 0.213 0.230 0.176 0.210 0.098 0.160 0.032 0.120 0.017 0.100 0.010 0.010

Table 3: Actual segment details of simulated data

Index 1 2 3 4 5

Description

Data of low quality with disturbance Data of high quality without disturbance Data of low quality without disturbance Data of high quality without disturbance Data of high quality with disturbance

18

ACS Paragon Plus Environment

Range in actual data (1,2048) (2049,4096) (4097,6144) (6145,8192) (8193,10240)

Page 19 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Table 4: Meta parameters for hierarchical classication

Parameter

Minimum segment length Threshold for Ruu Minimum length for interval halving Merging threshold for segmentation 1 Threshold for MSE Merging threshold for segmentation 2

Value 2048 20 80 0 0.1 0.12

Results The hierarchical segmentation algorithm is implemented on the data with meta parameters as shown in Table 4. The result of rst level of segmentation is shown in Table 5. Segmentation 1 identied that 5 regions are present in the data. Notice that both regions 4 and 5 have high quality data. But one has no disturbance while the latter includes eects of disturbance. Thus there are only 4 regions with respect to the data quality. The additional segment identied by the algorithm contains the interface segment between low quality and high quality data. Further, it can be seen that the actual ranges for these segments and the identied ranges are quite close. The nal result of the hierarchical classication is to assign class labels. Thus the actual class labels, their ranges along with the identied ranges are shown in Table 6. Figure 9 depicts this result pictorially. At the end of segmentation 2, 7 segments are identied compared to the actual 5 segments. Figure 10 shows the actual ranges and the identied partitions. It can be seen that there is one extra segment that has been identied for [LQ,D] label, and one extra segment identied with [HQ,D] label. It can also be seen that the second range in the [LQ,D] class mostly belongs to the [LQ,ND] class. This misclassication is not of real concern as these segments would be discarded from identication exercise. The additional segment identied for [HQ,D] label is a result of the misclassication in segmentation 1. The additional points added to data with label HQ by segmentation 1 are almost always identied as region of label [HQ,D]. This can be clearly seen in Figure 11, which shows more details near the region of misclassication. Thus one can be reasonably assured that the data segments identied as [HQ,ND] are the best 19

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

available data segments for identication. Table 5: Segmentation 1 result - Segmentation with respect to data quality.

Segment Index Actual range Identied range 1

(1,2048)

2

(2049,4096)

3 4

(4097,6144) (6145,10240)

(1,2000) (2001,2240) (2241,4160) (4161,6080) (6081,10240)

Table 6: Segmentation 2 result - Identication of presence of disturbance

Class of segment Actual range Identied range [LQ,D]

(1,2048)

[LQ,ND]

(2049,4096) (6145,8192) (4096,6144)

[HQ,D]

(8193,10240)

[HQ,ND]

(1,2000) (4161,4183) (2051,4160) (6081,8167) (4183,6080) (2001,2050) (8168,10240)

A comparison of estimated parameters using [HQ,ND] data with actual parameters is shown in Figure 12, which demonstrates reliable identication. To benchmark models identied using data from dierent segments, mean squared errors between the responses of true and estimated systems were computed for 2048 points generated using a full band PRBS signal and the results are tabulated in Table 7. It can be seen that MSE for the segment with (HQ,ND) label is the lowest suggesting that this model provides the best predictions. Presence of disturbance leads to biased estimates. In this case, the model identied using [LQ,ND] data also provides reasonable estimates. This suggests that it might be a good strategy to combine all the regions with no disturbance to build a model. There are several important points to note here. We have already shown that in some cases (motivation section) using segments with both high and low quality data can lead to poor estimates. Additionally, if the underlying process model changes over time in the historical data being analyzed, combining all regions with no disturbance might result in the identication of an overall poor model. Finally, this result should not be interpreted to mean that segmentation 20

ACS Paragon Plus Environment

Page 20 of 39

Page 21 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

of historical data can be performed to look only for presence or absence of disturbance. As discussed before, identifying high quality data segments will always have to precede, in a hierarchical fashion, segmentation to identify the presence/absence of disturbance. Table 7: MSE for models identied using each category based on simulation using known input.

Class of segment MSE (HQ,ND) (HQ,D) (LQ,ND) (LQ,D)

1.0040 1.1567 1.0572 3.1015

The results that have been presented are for optimized meta parameters. To study the eect of these parameters on the results, several trials were performed and the results are summarized in Table 8. To quantitatively measure the performance, a metric called percentage of usable points that are lost is used. The percentage of usable data points lost is calculated as

%Loss =

nM issed × 100 nT otal

Here nM issed is the number of data points misclassied from segments with high quality data without disturbance. nT otal is the total number of data points belonging to the segments with high quality data without disturbance. From the table it can be seen that the worst performance is where only about 3 % of usable data is lost. It can also be seen that the number of segments and ranges are calculated quite accurately in almost all cases. One general observation is that points are misclassied between [HQ, D] and [HQ, ND] and between [LQ,D] and [LQ,ND]. The second category is not much importance as far as the main goals of the hierarchical segmentation are concerned. To test the generality of the approach, multiple simulations were performed with dierent systems. The best segmentation results are presented in Table 9 for ve SISO systems with varying number of FIR coecients. Systems 1, 2 and 3 are constructed FIR models. These are constructed to indicate that the algorithm can detect all the segments present in given 21

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

data. The segmentation result for System 2 is given in the Figure 13 Systems 4 and 5 are based on estimated FIR models using iddata1 and dry2 dryer datasets (available in System Identication Toolbox of MATLAB) with 35 and 50 coecients respectively. As can be seen from the table, the number of segments present in the data are identied quite well. One could also see that the worst case percentage of usable points lost is 2.77 %. This shows the general usefulness of the proposed approach in culling out useful segments of data from historical records from the viewpoint of system identication.

Discussion In this section, the eect of tuning parameters on segmentation are discussed. In the proposed algorithm, the following parameters are used for adjusting the performance. ˆ Minimum length of the segment, nminL ˆ Threshold for merging in segmentation 1, eth ˆ Threshold for interval halving in segmentation 2, M SE ˆ Threshold for merging identied segments in segmentation 2, eM SE The criterion of segmentation for identifying data quality is the number of parameters in the model that is to be estimated. Thus Segmentation 1 is performed based on the criterion that, the rank of the input covariance matrix has to be at least equal to the number of parameters to be estimated. Threshold for the rank of input covariance matrix, number of parameters to be estimated and the minimum length of data are related to each other. The smallest segment identied by interval halving nminL has to be at least 3np where np is the number of parameters to be estimated, in order to obtain good estimates for computing lagged input covariance matrix. Thus, nminL ≥ 3np is recommended. Since the algorithm works by halving intervals recursively, using any value between (N/2a+1 , N/2a ) where N is the length of data, results in identication of same segments. With increase in the value 22

ACS Paragon Plus Environment

Page 22 of 39

Page 23 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

of nminL , the number of misclassied points increase in segmentation 1. Using minimum segment length between 3np and 6np provides good segmentation. Merging after segmentation 1 is performed by concatenating contiguous segments if their corresponding input covariance matrices are of same rank and the combined segment also does not change the rank of the input covariance matrix. Thus the merging threshold(eth ) for segmentation 1 is based on the dierence between rank of the input covariance of the combined segment and each segments. Since including even a few data points of high quality results in the combined data to have higher rank of input covariance matrix, eth = 0 is chosen. Data points in the boundary between LQ and HQ regions contain the discontinuity introduced by the concatenation of various segments. This discontinuity acts like a step input and results in the segment containing the interface being misclassied into HQ data. i.e. A few LQ data points are misclassied as HQ data points due to the way with which the data set is constructed. If the data is continuous and the quality changes gradually then the regions will be properly identied without misclassication. The number of such misclassied data points is at most the product of number of segments in the data and the minimum segment length nminL . Presence of disturbance introduces a plant-model mismatch and hence results in higher value of MSE. Fitting a process model alone for regions with disturbance results in MSE that is at least 10 times that of the MSE computed for data without disturbance. Interval-halving is performed based on the MSE of estimated model predictions. Since the model order is assumed to be known, the data without disturbance will result in smaller MSE values. Thus interval halving is performed till the MSE is less than some specied threshold value. The threshold for MSE is optimized to be M SE = 0.1 in the case study. This threshold is the important tuning parameter in the segmentation algorithm. Use of higher / lower value of this threshold leads to misclassication. Typically smaller values of M SE ≤ 0.10 leads to reasonable classication. Merging algorithm for segmentation 2 is chosen to be the sum of squared prediction errors. The merged segment is used to estimate the model again and the 23

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 39

corresponding MSE is used for further merging. Since merging subsequent segments may result in higher MSE, a threshold of eM SE = 0.12 is chosen. The choice of merging threshold (eM SE ) is a function of the segmentation threshold(M SE ). Choosing higher value for eM SE results in merging of data segments with disturbance along with the data segments without disturbance and thus may lead to misclassication. A dierent value of nminL can be used in segmentation 2. Since FIR estimation requires a minimum length of data for estimating the model, using nminL ∈ (4np , 6np ) gives proper estimates of parameters. In general, reduction in M SE will result in misclassication of the regions without disturbance as regions with disturbance and vice-versa. With close to optimal M SE the increase or decrease of the same has negligible eect on the identied segments. With the example systems used, the value of M SE has to be increased to values like 1.5 for system 3, 2.5 for system 4 in order to obtain reasonable segregation. However, a threshold which is a fraction (5 %) of the worst case MSE that is seen in the segmentation process might generally provide good results. The MSE metric could also be augmented with whiteness tests to conrm that a good model has been developed. MSE normalized with the mean of the output could be a more robust metric with some degree of universality.

Alternate options for the proposed algorithm Interval-halving can be performed in two dierent ways. Given data, interval halving proceeds recursively till the stopping criterion is met. Then the rst segment is classied. In the next step, either whole of the remaining data or only the remaining segment in the previous step for segmentation can be considered for interval halving. In the proposed work, the latter form of interval-halving is used. These methods have their own merits and demerits. The main advantage of the latter method is that, the length of each segment will always be

N (2a )

where a ∈ Z≥0 . The same can also be considered as disadvantage, since if the boundary N separating two dierent segments are found somewhere in between say ( N8 , 16 ) then the al-

gorithm proceeds to perform interval halving to 24

N 16

before identifying the boundary. Thus

ACS Paragon Plus Environment

Page 25 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

this method leads to more number of segments being identied. On the other hand the former method may identify the boundary with better accuracy. This needs to be explored in future. As an alternative to MSE, the second segmentation can be performed using any Information criteria (any forms of AIC or BIC), Hurst exponent, whiteness test etc. However, using Hurst exponent introduces an additional tuning parameter, window length n. Akaike Information Criterion (AIC) is a function of MSE and the number of parameters to be estimated. Since the number of parameters to be estimated is the same throughout the segmentation, AIC becomes a function of MSE alone for a given model structure. However, once the hierarchical segmentation is completed and a model identied, AIC might be used in an outer loop to nd optimal model structures. This could be used to remove the assumption of known model structure that is employed in this paper. While extending the algorithm for MIMO systems, possibility of correlated input moves, and a solution method have to be addressed. Such segments can be removed from the estimation exercise. The number of parameters to be identied in case of MIMO system increases. Correlated input moves directly aect the rank of input correlation matrix. Hence, the robustness of the proposed algorithm when extended to MIMO systems has to be thoroughly tested.

MIMO extension MIMO systems pose unique challenges when historical data has to be used for model identication. These challenges include collinearity in the input variables, correlated input moves, and data regions where only a few of the input variables have signicant variations while the other inputs contain minimal variation. Both collinearity and correlated input moves require further analysis in addition to data quality. Collinearity of the inputs need to be addressed by sub-selecting the inputs that are independent by means of dimension reduction. Correlated input moves result in biased estimates of the parameters. Thus cross correlation 25

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

between input variables have to be analyzed in addition to the rank of the lagged correlation matrix. When historical data contains regions where only a few of the input variables have signicant variations, this calls for classifying data quality in segmentation 1 using multiple classes. These classes are, high, low and intermediate quality data segments. Intermediate quality segments are the segments where rank of the lagged input correlation matrix is high for a few input variables. Using intermediate quality data segments for model identication may not be appropriate for the elements corresponding to input variables with minimal variation. Patel, 21 attempted extension of segmentation technique to MIMO systems. In the work, author proposes two dierent model structures for handling the regions where only a few inputs have signicant variations. Thus segmentation 2 may also need to be altered to handle MIMO data containing such regions. Though MSE as a criterion for segmentation 2 can be used for MIMO identication, model structures may have to be explored to optimally utilize the segments of intermediate quality. Generalized hierarchical classication structure for MIMO system is given in Figure 14. With the following simplifying assumptions the proposed algorithm can be extended to MIMO system with minimal alterations. 1. The inputs are independent of each other 2. All inputs are either of high quality or of low quality in a given range The 2×2 system given in Equations 4-6 is identied as FIR model with 30 coecients using full band PRBS input signal. This FIR model consists of 123 parameters to be estimated as opposed to 11 parameters in the system equations. This FIR model is used for

26

ACS Paragon Plus Environment

Page 26 of 39

Page 27 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

data generation. Simulated data contains 5 segments with various attributes.



  yp (z) = 

1 z 2 +0.2z+0.01

2 z 2 +0.11z++0.7

2 z 2 +0.2z+0.01

1 z 2 +0.1z+0.2

(4)



  yd (z) = 

  u(s)

1 z 2 +0.3z+0.03 1 z 3 +0.1z 2 +0.03z+0.2

  u(s)

y(k) = yp (k) + yd (k) + e(k)

(5) (6)

Proposed algorithm is applied on the data using meta parameters listed in Table 10. The model for segmentation 2 is assumed to be FIR model with 12 coecients. Hence MSE threshold used in segmentation 2 needed careful tuning to obtain reasonable segmentation. The algorithm was able to identify all the segments contained in the data. % misclassication is found to be 1.05%. Result for MIMO segmentation is given in Table 11 and Figure 15.

Conclusion and Future work In this paper, an algorithm to identify the most informative data from historical data is proposed. The ecacy of the proposed algorithm is demonstrated using simulation studies. The method was able to identify the number and ranges of segments fairly accurately and was able to label them with high precision. Though the algorithm is described using FIR models, this method can be extended to utilize other model structures easily. The possibility of automating the algorithm to identify the order of the underlying system and the presence of disturbance simultaneously needs to be explored. This will help in applying the proposed segmentation method to real systems. Extension of the proposed algorithm to MIMO systems needs to be pursued. MIMO systems pose challenges such as correlated input signals, illconditioning, insucient order of excitation and so on; issues that need to be carefully addressed. Once the algorithm is extended for MIMO systems, performance of the proposed

27

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

algorithm on industrial historical data has to be evaluated. Use of identied segments in model identication and control will have to be benchmarked.

28

ACS Paragon Plus Environment

Page 28 of 39

Page 29 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

References (1) Lee, J. H. Model predictive control: Review of the three decades of development. Int J

Control Autom 2011, 9, 415424. (2) Morari, M.; Lee, J. H. Model predictive control : past , present and future. Comput.

Chem. Eng. 1999, 23, 667682. (3) Fukushima, H.; Kim, T. H.; Sugie, T. Adaptive model predictive control for a class of constrained linear systems based on the comparison model. Automatica

2007, 43,

301308. (4) Genceli, H.; Nikolaou, M. New Approach to Constrained Predictive Control with Simultaneous Model Identication. AIChE J.

1996, 42, 28572868.

(5) Harding, J. A.; Shahbaz, M.; Srinivas,; Kusiak, A. Data Mining in Manufacturing: A Review. J. Eng. Ind.

2006, 128, 969.

(6) Maurya, M. R.; Paritosh, P. K.; Rengaswamy, R.; Venkatasubramanian, V. A framework for on-line trend extraction and fault diagnosis. Eng Appl Artif Intell

2010, 23,

950960. (7) Villez, K.; Venkatasubramanian, V.; Rengaswamy, R. Generalized shape constrained spline tting for qualitative analysis of trends. Comput. Chem. Eng.

2013, 58,

116

134. (8) Das, L.; Srinivasan, B.; Rengaswamy, R. A novel framework for integrating data mining with control loop performance assessment. AIChE J.

2016, 62, 146165.

(9) Peretzki, D.; Isaksson, A. J.; Bittencourt, A. C.; Forsman, K. Data mining of historic data for process identication. AIChE Annu. Meet., Conf. Proc.

2011,

(10) Shardt, Y. A.; Huang, B. Data quality assessment of routine operating data for process identication. Comput. Chem. Eng.

2013, 55, 1927. 29

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 39

(11) Wang, J.; Su, J.; Zhao, Y.; Zhou, D. Searching historical data segments for process identication in feedback control loops. Comput. Chem. Eng.

2018, 112, 616.

(12) Abonyi, J.; Feil, B.; Nemeth, S.; Arva, P. Modied Gath-Geva clustering for fuzzy segmentation of multivariate time-series. Fuzzy Sets and Systems

2005, 149, 3956.

(13) Abonyi, J.; Feil, B. Cluster analysis for data mining and system identication ; Springer Science & Business Media, 2007. (14) Dobos, L.; Abonyi, J. Fisher information matrix based time-series segmentation of process data. Chem. Eng. Sci.

2013, 101, 99108.

(15) Ljung, L., Ed. System Identication (2nd Ed.): Theory for the User ; Prentice Hall: Upper Saddle River, NJ, USA, 1999. (16) Read, J.; Bielza, C.; Larranaga, P. Multi-Dimensional Classication with Super-Classes.

IEEE Trans. Knowledge Data Eng 2014, 26, 17201733. (17) Keogh, E.; Chu, S.; Hart, D.; Pazzani, M. An online algorithm for segmenting time series. Proceedings 2001 IEEE International Conference on Data Mining. 2010; pp 289296. (18) Dash, S.; Maurya, M. R.; Venkatasubramanian, V.; Rengaswamy, R. A novel intervalhalving framework for automated identication of process trends. AIChE J. 50, 149 162. (19) Andrews, D. W. K. Tests for Parameter Instability and Structural Change With Unknown Change Point. Econometrica

2012, 61, 821856.

(20) Srinivasan, B.; Spinner, T.; Rengaswamy, R. Control Loop Performance Assessment Using Detrended Fluctuation Analysis (DFA). Automatica

2012, 48, 13591363.

(21) Patel, A. Data Mining of Process Data in Multivariable Systems. 2016; http://www.

diva-portal.se/smash/get/diva2:1072526/FULLTEXT01.pdf. 30

ACS Paragon Plus Environment

Page 31 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Graphical TOC Entry

31

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 7: Detailed owchart of the proposed algorithm. The following notations are used to represent labels. HQ - High quality data, LQ - Low quality data, D - Disturbance is present, ND - No disturbance is present 32 ACS Paragon Plus Environment

Page 32 of 39

Page 33 of 39

Simulated data and corresponding regions

8

[LQ,D]

[HQ,ND]

[LQ,ND]

[HQ,ND]

[HQ,D]

6 4

Output y 1

2 0 -2 -4 -6 -8 0

2000

4000 6000 Time (seconds)

8000

10000

Figure 8: Data with each region marked.

10 8

Segmentation based on the data quality index HQ,? LQ,?

HQ,?

LQ,?

HQ,?

LQ,ND

HQ,ND

6 4

Output y 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

2 0 -2 -4 -6 -8

LQ,D

HQ,ND

HQ,D

-10 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Time (seconds)

Figure 9: Segmentation 1 - Results of segmentation based on data quality. Regions shaded by red are regions with high quality data and the regions shaded by blue are regions with low quality data. The axis tick marks are placed at the end of each segment with corresponding labels. Identied labels are marked on top of the plot.

33

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

Figure 10: Segmentation 2 - Results of segmentation based on disturbance. Region shaded by dark blue is the region of (LQ,D). Regions shaded by red color are of (HQ,ND). Region shaded by orange color is (LQ,ND). Region shaded by blue color is the segment (HQ,D)

Actual segments in simulated data

Output y 1

10

Class 1 Class 2

5

Class 3 Class 4

0 -5 0

Output y 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 39

4 2 0 -2 -4 -6

2000

4000 6000 8000 Sample index Identified segments in simulated data Class 1 Class 3 Class 2 Class 4

1600

1800

2000 2200 Sample index

2400

10000

2600

Figure 11: Combined result after the completion of two segmentation algorithms. Class 1 Segment with (HQ,ND), Class 2- Segment with (LQ,ND), Class 3 - Segment with (HQ,D), Class 4 - Segment with (LQ,D). 34

ACS Paragon Plus Environment

Page 35 of 39

Actual parameters Vs Estimated parameters using best data set 1

0.9

Actual system parameters Estimated parameters using (HQ,ND) data Estimated parameters using (LQ,ND) data

0.8 0.7

Magnitude

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

0.6 0.5 0.4 0.3 0.2 0.1 0 0

2

4

6

8

10

12

Parameter Index

Figure 12: Comparison of actual parameters with estimated parameters using segments identied by the algorithm.

35

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research

Actual segments in simulated data

15

Output y 1

10 5 0 -5 -10

Class 1 Class 2

Class 3 Class 4

-15 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 104

Sample index Identified segments in simulated data

15 10

Output y 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 39

5 0 -5 -10

Class 1 Class 2

Class 3 Class 4

-15 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Sample index

Figure 13: Segmentation result for case study 2 with 20 segments

Figure 14: Hierarchical classication for MIMO systems

36

ACS Paragon Plus Environment

2 104

ACS Paragon Plus Environment

37

Eect of under estimation

Eect of over estimation

Eect of M SE

Changing only nminL2

Changing both nminL1 and nminL2

Actual data

Case Detail

80 80 80

80

80

80

100

40

160

160

160

40

100

80

nminL2

100

80

nminL1

5

5

5

5

5

5

5

5

8

20

12

12

12

12

12

12

0.1

0.1

0.15

0.1

0.1

0.1

0.1

0.1

Actual Estimated number of number of  M SE segments FIR coecients

(2051,4160) (6081,8167)

(2047,2880) (2921,4120) (6121,8193) (2051,4160) (6081,8167) (2051,4160) (6081,8167)

(8193,10240)

(2049,4096) (6145,8192) (2051,4160) (6081,8167) (2100,4160) (6081,8167) (2121,4160) (6081,8140) (2048,2880) (2921,4120) (6121,8193)

(2001,2050) (8168,10240)

(2000,2050) (8168,10240) (2001,2050) (8168,10240)

(8194,9160) (9201,10240)

(2001,2099) (8167,10240) (1921,2120) (8140,10240) (8194,10240)

(8168,10240)

[P,D]

[P,ND]

(4183,6180)

(4183,6180)

(2901,2920) (4121,6120) (9161,9200) (4183,6080)

(2901,2920) (4124,6096)

(4244,6080) (4161,4243) (4161,6080)

(4183,6080)

(4096,6144)

[NP,ND]

(1,2000) (4161,4181)

(1,2000) (4161,4182) (1,2000) (4161,4181)

(1,2040) (2881,2900) (6097,6120) (9161,9200) (1,2040) (2881,2900)

(1,1920)

(1,2050) (4161,4183) (1,2000)

(1,2048)

[NP,D]

0.66

0.66

0.66

0.98

0.98

3.03

1.86

0.66

Percentage of usable Data points lost

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 8: Consolidated result for case study with various tuning parameters

Page 37 of 39 Industrial & Engineering Chemistry Research

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 39

Table 9: Best segregation results for ve dierent example systems.

System Actual number Number of FIR Identied number Percentage of usable Index of segments coecients used of segments data points missed 1 2 3 4 5

10 20 10 10 19

12 20 27 35 50

15 22 14 12 19

1.65 2.77 0.64 0 0.26

Table 10: Meta parameters for MIMO segmentation case study

Parameter

Minimum length for interval halving nminL Threshold for Rank Ruu Threshold for merging in segmentation 1 Threshold for MSE M SE Threshold for merging in segmentation 2

Value 80 30 0 0.1 0.15

Table 11: MIMO case - Results

Class

[HQ,ND] [LQ,D] [LQ,ND] [HQ,D]

Actual range Identied range (1,2048) (2049,6144) (6145,8192)

(8193,10240)

38

(1,2030) (2081,6118) (6119,8160) (2031,2080) (8161,10240)

ACS Paragon Plus Environment

Page 39 of 39

Actual segments in simulated data

15

Class 1 Class 2

Actual segments in simulated data

10

Class 3 Class 4

Class 1 Class 2

Class 3 Class 4

5

Output y 2

Output y 1

10 5 0

0 -5

-5 -10

-10 0

2000

4000

6000

8000

10000

0

Sample index Identified segments in simulated data

15

Class 1 Class 2

0

4000

Class 1 Class 2

5

5

2000

6000

8000

10000

Sample index Identified segments in simulated data

10

Class 3 Class 4

Output y 2

10

Output y 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Class 3 Class 4

0 -5

-5 -10

-10 0

2000

4000

6000

8000

10000

0

2000

Sample index

4000

6000

8000

10000

Sample index

Figure 15: MIMO segmentation Result. Class 1- Segment with (HQ,ND), Class 2- Segment with (LQ,ND), Class 3- Segment with (HQ,D), Class 4- Segment with (LQ,D).

39

ACS Paragon Plus Environment