Preprocessing of Alarm Data for Data Mining - ACS Publications

Subscriber access provided by University of Sussex Library

Process Systems Engineering

Preprocessing of alarm data for data mining Zahra Mannani, Iman Izadi, and Nasser Ghadiri Ind. Eng. Chem. Res., Just Accepted Manuscript • DOI: 10.1021/acs.iecr.8b05955 • Publication Date (Web): 27 Feb 2019 Downloaded from http://pubs.acs.org on March 1, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Preprocessing of alarm data for data mining Zahra Mannani, Iman Izadi,∗ and Nasser Ghadiri Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran Corresponding Author E-mail: [email protected] Abstract Many industries, including process industry, face an increased number of alarms every day. This is due to advanced computer-based monitoring and control technologies that are widely available in all industrial plants. On the other hand, data mining, as a method of finding patterns in data, has been widely used to discover patterns and relationships in alarm data, in hope of reducing the volume of alarms and operators’ workload. One of the first steps in data mining is to prepare and clean-up raw data for better mining, also known as preprocessing. In this paper, we focus on preprocessing of alarm data and investigate the steps required for data preparation. Two steps, namely removing chattering alarms and reconstruction of missing alarms, are more challenging. For chattering alarms, a number of algorithms are proposed with a discussion on the time-frame that should be selected for removing chattering alarms. As for reconstruction of missing alarms, two methods are presented using information from the same alarm tag, or other related alarms. A case study shows the efficiency of the proposed methods. Keywords: Alarm management, Preprocessing, Data mining, Chattering alarms, Data imputation

1

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Introduction

Alarm system, as one of the protection layers in process industries, has the duty of continuous monitoring of the plant. Once an anomaly (often a process variable exceeding its predefined limits) is detected, an alarm notifies the operator to take appropriate corrective action. In recent years, due to technological developments and advanced distributed control systems (DCS), defining and configuring software alarms for any part of a process is often easy and cheap. This, as expected, has led to a large number of alarms being configured in plants. As a result, process industries encounter problems with their alarm systems on a daily basis. 1 Improvement of alarm systems has been a main concern in process and other industries. Many products and services, collectively known as Alarm Management, have been developed to aid engineers and operators in dealing with their alarm systems. It has also been the focus of researchers, mainly in the last decade. Moreover, alarm standards have been developed by technical organizations. 2,3 Chattering alarms, alarm floods, nuisance and redundant alarms, and just the sheer number of alarms generated in a plant are some of the challenges faced in industry every day. The main focus of analyzing and improving alarm systems is to reduce the volume of alarms and present more relevant alarms to the operator 3 . This can generally bring the alarm system within the limits defined by alarm standards. Many resources and methods have been suggested over the past decades to reach this goal 4 , including: alarm processing techniques (e.g., filters 5 , deadbands and delay timers 6 , and time-deadbands 7 ), graphical tools 8,9 , correlation methods 10 , cause and effect analysis 11 , signed digraphs 12 , and root-cause analysis 13 . Data mining, a general process of discovering patterns in data 14 , has also been used for analysis of alarm systems as early as 2009. 15 Storage of alarm messages in almost all plants that use DCS, has facilitated the use of data mining techniques, by providing huge volumes of historical alarm data. Data mining techniques find patterns and similarities among different alarms in a system, which can help engineers locate similar or duplicate 2


Page 2 of 43

Page 3 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


alarms, redesign the alarm system, and eventually improve its quality. Data mining is particularly useful in analyzing alarm floods, where a number of alarms are activated in a short period of time. 9,16,17 Different data mining techniques have been studied and used for this purpose, including: generalized sequential patterns (GSP) algorithm (a subcategory of sequential pattern mining) 15 ; fuzzy association rules 18 ; apriori algorithm; 19 and modified PrefixSpan 20 . For instance, Hu et al. used a method based on apriori algorithm, a well-known data mining algorithm, to discover relationships between alarms and the state of the process. 19 Or, Hu et al. utilized an itemset mining method to detect frequent alarm patterns during alarm floods. 9 Other works include a general weight-based multi-state sequential algorithm to find temporal patterns between alarms, which is particularly useful in detecting successive alarms. 10 From a different perspective, weighted fuzzy association rules were developed by Wang et al. to identify cause and effect relationships of frequent alarms. 18 Although the proposed method reduces the search region, it requires process data for implementation which is a limitation. Büttner et al. proposed a web-based application to reduce alarm floods and identify the root cause of turmoil in a process, using machine learning concepts. 21 Their method uses the MMHC algorithm and maximum likelihood estimation and the results are delivered to operators with specifically designed human interface. The advantage of this method is that it can use the knowledge and expertise of plant and field experts in addition to the automatic data analysis. Preprocessing (preparing raw data for mining) is one of the main steps of any data mining method. Regardless of the type of data, or the recruited mining technique, or the selected model, the data has to be prepared for better and more conclusive mining. Alarm data is no exception. And although preprocessing accounts for anywhere between 10 to 60 percent of data mining efforts 14 , to the best of our knowledge, preprocessing of alarm data has not been studied before. In this paper, we approach this subject and try to provide a complete

3



and thorough guideline for preprocessing of alarm data. Basic steps of preprocessing of alarm data are discussed in this paper. However, two steps are particularly more important and require detailed investigation: alarm chattering and missing data. The former has been briefly discussed in some research reports 22 . But many issues such as the selection of time-frame, or the amount of data to be removed are discussed here. The difference between point-based data and interval-based data is also important and should be considered when chattering alarms are removed. Another contribution of the paper is the study of missing alarms and how they are to be replaced. Two methods are presented to reconstruct missing alarm messages (also known as data imputation). One method uses the historical information from the same tag to suggest median alarm duration. Another more involved method tries to find patterns between the missing alarm tag and other alarms. These patterns are then used to reconstruct the missing data based on the data from other messages. A case study is presented to show the detailed steps of preprocessing of alarm data. The rest of this paper is organized as follows: In Section 2, basic concepts of data mining and the CRISP_DM method are reviewed. Then, early steps of data mining in the context of alarm systems are presented. Section 3 is dedicated to the preliminary stages of alarm data preprocessing. In Section 4, chattering alarms are discussed. In Section 5, the concept of missing alarms and their imputation is studied. In Section 6, a case study is presented. Finally, Section 7 concludes the paper.

2

Preparation for alarm data mining

Data mining is the process of discovering useful patterns in large data collections. 14 With technological advances in computers and databases, storing information and data in industrial settings is evermore increasing. Applying data mining in its appropriate and standard form in any field is significantly important. Whereas, misuse and incorrect application of

4


Page 4 of 43

Page 5 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


data mining can cause erroneous results. Naturally, data mining methods such as clustering and pattern mining are used to analyze alarm data as well. However, in many industrial sections, this process is conducted and implemented manually. Manual implementation is not only time consuming, inefficient and expensive, but also not practical on large data.

2.1

The CRISP_DM method of data mining

It is a common practice to use a standard data mining framework when working with data. Cross Industry Standard Practice for Data Mining (CRISP_DM) method is a nonproprietary standard process which helps the data miner to investigate the problems and difficulties of the procedure, and prepare the data for efficient and conclusive mining. This process consists of six steps: 14 1. Business/Research Understanding: Firstly, the field expert should precisely set out the goals, needs and technical concerns to the data miner and provide him/her with necessary information. This will allow the data miner to explore the data mining environment and the purpose of the project. 2. Data Understanding: This step, which is one of the most important parts of the data mining process, ensures that the work starts with adequate knowledge of data. If not, it is still possible to carry out the mining process to obtain results and reach the evaluation stages, but the credibility or significance of the results is debatable. 3. Data Preparation: Databases often contain raw data which may not be suitable for mining and requires preparation and processing. This step is often regarded as the most tedious and time-consuming step of the procedure. 14 4. Modeling: In this step, considering data conditions and other assumptions, an appropriate model (from many different available ones) has to be chosen. Then, the parameters of the model are defined and calculated to reach an acceptable result.

5



5. Evaluation: The outcome of the previous stage is one or more models. Before using and implementing their results, these models should be evaluated in terms of quality, effectiveness and how they meet the set goals. 6. Deployment: Finding a model does not necessarily imply project completion, as it has to be deployed to the benefit of the specific industry. It can be used in a variety of ways, such as presenting a report, managerial decisions, detecting anomalies and faults, parallel data mining, reconfiguring the settings, etc. It should be emphasized that the CRISP_DM process is a life cycle model and not a one-time operation. Similarly, as stated in the ISA 18.2 standard, alarm management is a life cycle as well 2 . Therefore, the steps of the procedure, including preprocessing, sometimes have to be repeated multiple iterations or in scheduled maintenance. In this rest of this section, we quickly review the first two steps of the CRSIP_DM method in the context of alarm management and then focus on the preprocessing of alarm data as the main contribution of the paper.

2.2

Research understanding: Alarm systems

According to the ISA 18.2 standard, alarm system is defined as “a set of hardware and software that detects, communicates and informs the operator of alarms”. 2 In an industrial environment, alarms can be divided into two main categories: process alarms and system alarms. Process alarms, which account for most of the alarms, are determined based on process variables. For instance, low pressure or high temperature alarms. System alarms, also known as digital or discrete alarms, are not generated based on process variables but other devices. For example network failure, or pump malfunction alarms. The way alarms are defined in a plant depends on their alarm philosophy and the specific brand of DCS. In any case, the number of configured alarms in a plant is often higher 6


Page 6 of 43

Page 7 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


than the number of process variables. Alarm systems are not necessarily efficiently designed during the commissioning phase. They need to be monitored and maintained throughout the operation of a plant. This is known as the Alarm Life Cycle in the ISA 18.2 standard. 2 In case alarms are not properly designed or maintained, false and/or nuisance alarms are observed. Nuisance alarms do not provide any new information or the operator does not need to do any action based on them. They are one of the main contributors to alarm floods 8 . Extensive research is available on analysis and correction of alarm floods, many of them involve data mining. In any case, before proceeding with the next steps of data mining, one has to become familiar with the way alarms are defined and triggered in the target plant.

2.3

Data understanding: Alarm data

Each alarm is, almost always, a line of text which is shown to the operator and also stored for future use. Alarms can be stored as simple flat files, or parsed and sent to a relational alarm database which stores them as several columns. 8 This database, however, does not contain alarms only, but also many other events that occur in the plant. Hence, it is often called Alarm and Event (A&E) database. In addition to alarms, other messages (referred to as rows or records in database) that might be stored in the A&E database are as following: • Return-to-normal (RTN): when an alarm is cleared. • Acknowledgment: when the operator acknowledges an alarm. • Operator action: when the operator changes something in the plant, e.g., increasing a setpoint or disabling an alarm. • System event: when something in the DCS changes, e.g., an operator logs into the system. It is clear that some of these messages are not important for data mining and are removed during preprocessing which will be discussed later. 7



Depending on the brand of DCS, process structure, plant hierarchy, method of storage, and other factors, alarm messages can vary in different plants. Samples of alarm messages from three different DCS implementations are illustrated in Figure 1. However, some fields of information (referred to as columns in databases) are commonly present in all alarm messages: 8 • Time stamp: the time when the alarm is triggered in the system. • Tag name: often the process variable name associated with the alarm. • Alarm set point: the (high or low) alarm limit defined for the process variable. • Tag description: description of the process variable associated with the alarm. • Alarm type (or alarm identifier): a description of the specific type of alarm (e.g., high, low, or rate-of-change) • Priority: severity and importance of the alarm (e.g., high, medium, emergency) • Alarm state: the state of an alarm (active or inactive) or acknowledgment of an alarm by the operator. • Value: the value of the process variable when the alarm occurs. The number of columns may vary in different databases or may have a different name. At this stage the data miner is required to thoroughly investigate the database and the process to become familiar with the columns and their various values.

2.4

Point-based vs. interval-based analysis

Some events occur in an instant of time. Data associated with such events is referred to as point-based event data. On the other hand, most events have a period of occurrence and cannot be considered as point-based. These events have a start and finish time, and their data is known as interval-based event data. 23 8


Page 8 of 43

Page 9 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 1: Samples of alarm messages from three different DCS implementations Alarms, in general, fall within the second category, as they have an activation (triggering or ON) time and a deactivation (clearance, RTN or OFF) time. So, generally, there are two messages associated with an alarm in the database: one when the alarm is activated (ALM message) and a second one when it is cleared (RTN message). However, in some implementations the RTN message might not be generated by the DCS or not stored in the database. In this case, we only see the alarm activation times, i.e., point-based data. Therefore, there are two approaches to work with alarm data. The first approach is to deal with alarms as interval-based events, which is more coherent. In this case, both ON and OFF times are required for each individual alarm. In other words, in addition to the activation time of an alarm, its duration is also important. 24 For the second approach, alarms can be considered as point-based events. This approach is particularly useful when there is no RTN data. Even when RTN messages are available, some might prefer to ignore them and use ALM messages only. The latter approach is more common in alarm research. 9,10,22 In this paper, we consider both cases and discuss the differences whenever necessary. Let ON1 , ON2 , ON3 , · · · be the sequence of alarm activation times, and OFF1 , OFF2 , OFF3 , · · · the sequence of RTN times for a unique alarm. Then, the ith alarm is active during the interval [ONi OFFi ].

9



Figure 2: Alarm data preprocessing steps

3

Basic steps of preprocessing

The main purpose of the preprocessing stage is to prepare a set of data suitable for mining. Suitable is a vague concept here and very much depends on the type of data and the selected model. However, there are certain common concepts in the preprocessing stage of alarm data which will be discussed below. These are the basic steps required to clean-up the alarm data but their order depends on the structure, and method of storage. Figure 2 summarizes these steps.

3.1

Data collection and consolidation

Before collecting alarm data, the data miner should select a specific period of time, one year for instance, as well as a specific area/unit of the plant. If different areas of the plant are physically separate, it is recommended to complete data mining process on each individual area, to avoid increasing computational costs. Now, the first step is to collect alarm data for the specified period of time. If the data is stored in a relational database, one can query the required data using Structured Query Language (SQL). If the data is given in individual flat files, such as comma-separated values (CSV) files, or Microsoft Excel spreadsheets, then the files should be consolidated in one file with a common format. The data collected and consolidated for mining is referred to as the dataset.

10


Page 10 of 43

Page 11 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


3.2

Removing unnecessary records

As stated before, the A&E database contains many records of data not required for alarm data mining. These records should be removed before any further processing. Removing unnecessary data at early stages of preprocessing saves time and computational resources. For alarm data mining, records or messages that are removed from the dataset include: 1. Duplicate records: In some A&E databases, due to poor DCS configuration, some or even all messages are repeated. These duplicate records should be removed to make sure that each message is unique. 2. Journal or log alarms: In many system configurations, some alarms are not shown to the operator and are directly sent to the database. These are known as journal or log alarms. In some DCS implementations, for instance, when an instrument generates a signal out of the standard range (e.g., 4-20 mA), an alarm, known as BADPV, is generated and logged but not shown to the operator. Such alarms that are not visible to the operator are removed from the dataset. This is due to the fact that alarm standard measures are given in terms of alarms presented to the operator. 2 3. Non-alarm messages: For alarm data mining, only records that are associated with an alarm (an alarm or RTN message) are required. If the operator’s reaction to alarms is of interest, acknowledgement messages (ACK messages) are required as well. Besides these records, other records should be removed from the dataset. These include but are not limited to operator changes and system events. 4. Records with missing columns: In any given database, there is a chance that some data is lost. For example the time stamp could be invalid or recorded as 0. If the missing column is not important it can be ignored. Otherwise, there is no significance in this record of data and it has to be removed from the dataset. Later, in Section 5, we will discuss how to proceed if one complete alarm message is missing from the dataset.

11



3.3

Unification of data formats

In the dataset, some columns may not have a proper format. It is necessary to change these columns into standard format readable by data mining software. A very common case is the time stamp (date and time). Date and time are stored in many different formats and might be distributed among several columns, which should be consolidated in a single softwarereadable column. Also, in some databases, each record might have separate columns for alarm and RTN times, which needs to be considered. Another example is the value column that might include the unit of measurement, which should be removed as well.

3.4

Removing unnecessary columns

Alarm databases include many columns, not all of which are required for mining. Thus, a subset with selected columns is kept and the rest of the columns are removed from the dataset. The columns that are retained depend on the specific purpose of data mining. If one is interested in finding the relationships and correlations between alarms, the columns that are required are time stamp, tag name, alarm type or identifier, alarm state, and sometimes, value. For other applications columns such as priority might be of interest as well.

3.5

Creating unique alarms

In the process industry, every process variable has a tag name. For example, PT002 is used for the pressure of a vessel. Whenever the value of a process variable exceeds a threshold, the operator should be notified about the pressure increase or decrease. Hence, alarms are defined. Depending on the DCS, different thresholds, or alarm limits, can be defined for each process variable. The most common alarm limits, as illustrated in Figure 3, are high (HI or PVHI), highhigh (HH or PVHH), low (LO or PVLO), and lowlow (LL or PVLL). In addition, other alarms could be defined on a process variable. One example is the rate-ofchange (ROC) alarm. Some DCS brands allow the definition of up to 16 alarms per process

12


Page 12 of 43

Page 13 of 43

20

high high limit high limit

18

low limit 16

low low limit

14

Amplitude

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


12 10 8 6 4 2 0 0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Time

Figure 3: A process variable and its most common alarm limits variable. For more advanced objects such as controllers, many other alarms can be defined, e.g., deviation high (DEVHI) and deviation low (DEVLO). It is obvious that alarms are different from process variables, as one process variable might have several configured alarms. For alarm data mining, though, we need to identify each individual alarm, referred to as a unique alarm. In many DCS implementations, a tag can be assigned to each unique alarm. For example a high alarm for the pressure tag PT002 might be identified as PT002AH. In this case, we can use the tag name as a unique identifier. If not, a unique alarm identifier is created from concatenating tag name and alarm type (or alarm identifier). For example, PT002.PVHI is the unique alarm identifier of the high alarm for PT002. Once the alarm identifier is created for each record, the tag name and alarm type columns can be discarded.

3.6

Data sorting

Due to the fact that there is no data order in relational databases, alarms might not be stored in their chronological order. However, for data mining purposes the sequence of alarms is of significant importance. Therefore, alarms and RTNs in the dataset must be sorted and re-arranged in ascending order according to their time stamps. In alarm data mining, each

13



record represents an event and has an event number. It is best to attribute an event number according to time stamps. As a result, an extra column might be added to the dataset.

3.7

Removing stale alarms

In industrial plants, we often observe some alarms that have been active for a long period of time, even years. These are known as stale alarms. According to the ISA 18.2 standard, alarms that persist for more than 24 hours may be considered as stale alarms. 2 Stale alarms may be caused by sensors or other devices that are no longer part of the system, ignored hardware faults, software bugs, and so on. These alarms provide little or no information to the operator, have no significance, and their presence on the screen may cause confusion. Stale alarms are typically removed in early stages of alarm rationalization. For data mining, they are to be removed as well to prevent meaningless bias.

3.8

The next steps

Basic steps of alarm data preprocessing, as discussed in the previous parts, are rather straight-forward. However, there are two more steps that are of utmost importance and need more attention. These two steps regard chattering alarms and missing alarms or RTNs, which will be studied in the next two sections.

4

Chattering alarms

A chattering alarm, according to ISA 18.2 standard, is an alarm that repeatedly transitions between the alarm state and the normal state in a short period of time. 2 Usually if an alarm is repeated three or more times in one minute, it can be regarded as chattering. Chattering alarms are probably the most common and the most frustrating problem in alarm systems. ISA 18.2 standard emphasizes that no chattering is acceptable. 2 Chattering alarms can be experienced due to poor design of alarms, process noise and oscillations, inaccurate 14


Page 14 of 43

Page 15 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


instruments, etc. 25,26 Chattering alarms can generate hundreds or even thousands of messages within a few minutes or hours. This high volume of alarms is obviously a significant distraction to the operator. They can fill up the alarm monitor as well as the alarm database, and often show up at the top in the list of the most frequent alarms. In a study based on 75 alarm systems, it has been observed that on average 74% of total alarm load is due to chattering alarms 26 .

4.1

Removing of chattering alarms

Since chattering alarms do not carry any significant information for analysis, it is a common practice to remove them before data modeling 16 . However, the criteria and method to remove chattering is debatable. In this section, we propose some techniques for the removal of chattering alarms from the dataset as a preprocessing step. Notice that, we do not intend to eliminate an actual chattering alarm from the system, which has been discussed elsewhere. 25 The goal here is to prepare a dataset with valuable information by removing chattering alarm records. Let Tw (in seconds) be a time-frame to be used for removing chattering alarms. Tw might refer to the size of the time-frame or the time-frame itself. Later in Section 4.2, the selection of Tw is discussed. 4.1.1

Point-based data

As previously stated, in most research reports, point-based alarm data was used. The common practice to remove chattering from the dataset, is to consider a specific time-frame Tw and remove records of a unique alarm within this interval. This can be achieved in two ways: 1. Fixed time-frame: Whenever a unique alarm is activated no alarms will be accepted within Tw seconds of it. In other words, all records associated with a unique alarm during this Tw second period are removed. 22 So, if ONf is the first alarm in a chattering

15



Algorithm 1 Chattering removal algorithm for point-based data with a fixed time-frame 1: j = 1 2: while ONj exist do 3: q=1 4: while ONj+q − ONj < Tw do 5: remove ONj+q 6: q =q+1 7: end while 8: j =j+q 9: end while episode, then all the alarms for which ONi − ONf < Tw for i > f are removed. After the time-frame is over, the next alarm is maintained and the procedure is repeated. The chattering removal algorithm for point-based data with a fixed time-frame is given in Algorithm 1. 2. Moving time-frame: This is similar to the fixed time-frame, with a difference that the time-frame moves with the last alarm. Here, all alarms within Tw seconds of their previous alarms are removed. In other words, if ONi+1 − ONi < Tw , then the (i + 1)th alarm is removed. The chattering removal algorithm for point-based data with a moving time-frame is given in Algorithm 2. The two approaches are illustrated in Figure 4 for Tw = 60 seconds. As it can be observed, if we have a severely chattering alarm, in the first approach one alarm is maintained every Algorithm 2 Chattering removal algorithm for point-based data with a moving time-frame 1: j = 1 2: while ONj exist do 3: q=1 4: while ONj+q − ONj+q−1 < Tw do 5: remove ONj+q−1 6: q =q+1 7: end while 8: j =j+q 9: end while 16


Page 16 of 43

Page 17 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


ON

OFF

4

20

34

42 46

58

68 72

136

Time (second) Fixed time frame

Tw

Tw

ON

OFF

4

20

34

42 46

58

64 68 72

128

136

Time (second) Moving time frame

Tw

ON

OFF

4

20

34

42 46

58

64 68 72

80

94

102 106

118

128 132 136

Time (second)

Figure 4: Removing of chattering in point-based alarm data: original data (top); fixed time-frame (middle); moving time-frame (bottom)

Tw seconds. While in the second approach all the alarms, but the first one, are eliminated. It is obvious that the former removes less data. However, it is the only approach that has been used. 22 It seems that, although chattering alarms are not acceptable, data analyzers are reluctant to remove all of them in fear of losing valuable data.

4.1.2

Interval-based data

To the best of our knowledge, no research has been conducted on removing chattering alarms for interval-based data. Of course, there is always the option to ignore RTN records and convert interval-based data to point-based and use the aforementioned methods to deal with chattering. But here, we intend to keep interval-based data as it is. So, we make two assumptions. First, after chattering alarms are removed, one should still have interval-based data. Second, it is preferred to maintain an alarm active during its chattering episode. Chattering alarms can be removed using two approaches: 1. Fixed time-frame: For a unique alarm, the activation time is determined as the

17



Algorithm 3 Chattering removal algorithm for interval-based data with a fixed time-frame 1: j = 1 2: while ONj exist do 3: q=1 4: while ONj+q − OF Fj < Tw do 5: remove ONj+q & OF Fj+q−1 6: q =q+1 7: end while 8: j =j+q 9: end while activation time of the first alarm. The RTN time for the episode is selected as the RTN of the last alarm that is activated in the Tw seconds time-frame from the RTN of the first alarm. Any record in between is considered as a chattering alarm. Notice that here the time-frame is fixed and starts with the RTN of the first alarm. Then, the alarm interval that replaces a chattering episode is obtained as [ONf OFFj ], where j = min i such that ONi+1 − OFF1 > Tw , and ONf is the first alarm in a chattering episode. The chattering removal algorithm for interval-based data with a fixed time-frame is given in Algorithm 3. 2. Moving time-frame: For a unique alarm, the activation time is determined as the activation time of the first alarm. The time-frame starts from the clearance time of the first alarm. The RTN time for the episode is selected as the RTN of the first alarm with no alarms in the next Tw seconds. In between any alarm is considered as chattering. Here, the time-frame is moving with the alarms within the chattering episode. As a result, the alarm interval that replaces a chattering episode is obtained as [ONf OFFj ], where j = min i such that ONi+1 − OFFi > Tw , and ONf is the first alarm in a chattering episode. The chattering removal algorithm for interval-based data with a moving time-frame is given in Algorithm 4.

18


Page 18 of 43

Page 19 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Algorithm 4 Chattering removal algorithm for interval-based data with a moving timeframe 1: j = 1 2: while ONj exist do 3: q=1 4: while ONj+q − OF Fj+q−1 < Tw do 5: remove ONj+q & OF Fj+q−1 6: q =q+1 7: end while 8: j =j+q 9: end while The two approaches are illustrated in Figure 5 for Tw = 60 seconds. Again, it can be observed that for a severely chattering alarm, in the first approach one alarm is maintained approximately every Tw seconds. While in the second approach all the alarms, but the first one, are eliminated. So, in the former approach more alarms are kept but with shorter duration. And in the latter approach fewer alarms are kept with longer durations. Which approach is chosen depends on the application and the type of modeling. For example, for repetitive pattern recognition in alarms, the number of alarms is important, so the former approach is more suitable. But, for methods that look for correlations based on time series algorithms, the duration of an alarm is more important and so the second approach is preferred.

4.2

Selection of removal time-frame

The efficiency of the chattering removal step depends on the proper selection of the timeframe Tw . One can select a pre-determined global time-frame for all the alarm tags, or use a different time-frame for each tag depending on the underlying process variable and its behavior. In this section we study different methods to select the time-frame.

19



Page 20 of 43

ON

OFF

4

12

26 32

76

90

114

134 142

160

230 238

150 160

220 230 238

Time (second) Fixed time frame

Tw

Tw

ON

OFF

4

12

32

7276

90

Time (second) Tw

Moving time frame

ON

OFF

4

12

32

72

90

134

150 160

194

220 230 238

Time (second)

Figure 5: Removing of chattering in interval-based alarm data: original data (top); fixed time-frame (middle); moving time-frame (bottom)

4.2.1

Global time-frame

As stated before, the ISA 18.2 standard considers an alarm as chattering if it is repeated three or more times in one minute. 2 Based on this, a half-a-minute interval for all the chattering tags is acceptable. However, often a larger time-frame, from 60 to 300 seconds, is selected to make sure all chattering alarms are removed. 25,27 A time-frame of one minute is used in a number of research reports. 20,22

4.2.2

Variable-dependent time-frame

Different process variables have different time-constants and different rates of change. Some process variables are slow in nature, e.g., temperature, level, and concentration. Some other process variables, such as pressure, flow, voltage and current, and motor speed change much faster. As a result, chattering alarms for a pressure variable, might be closer to each other (more alarms per minute) than a temperature variable. Therefore, it is reasonable to select the time-frame based on the process variable and its time-constant. This can avoid removing

20


Page 21 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 6: ON–ON run-length distribution of an actual flow tag

of important non-chattering alarms for a fast changing tag. Also, it will make sure that all chattering alarms for a slow tag are removed even if they are further apart. An efficient tool to study the distribution of alarms for a unique alarm is the run-length distribution (RLD). In fact, an index to quantify alarm chattering has been proposed based on RLD. 28 RLD (or more specifically ON–ON RLD) is the histogram of time differences between two consecutive alarms. Figure 6 show the RLD of an actual flow tag. The more skewed to the left, the more chattering. Based on the RLD, the 50th percentile (median) of a unique alarm tag is calculated. This is the median of the time interval between two consecutive alarms, and, more or less, indicates how far apart about 50% of alarms for a unique tag are. Table 1 shows the median index for the most frequent alarms, categorized by type of process variable (temperature, level, flow, and pressure). The data is obtained from the alarm database discussed in the case study in Section 6. As it can be seen, the median time between two consecutive temperature or level tags are between 4 to 20 times more than that of pressure or flow tags. Therefore,

21



Page 22 of 43

Table 1: ON-ON median index (in minutes) for different types of variables (a) Temperature

Unique alarm TAG127.PVHI TAG260.PVHI

(b) Level

median index 2.6833 2.5000

Unique alarm TAG261.PVHI TAG128.PVLO TAG067.PVLO TAG138.PVHI TAG067.PVLL

(c) Pressure

Unique alarm TAG035.PVLO TAG036.PVLO TAG300.PVHI TAG303.PVHI TAG126.PVHH

median index 4.7000 4.8167 4.1667 2.2167 4.1667

(d) Flow

median index 0.0667 0.0833 0.0667 0.0667 0.7667

Unique alarm TAG121.PVLO TAG185.PVLO TAG122.PVLO TAG117.PVLO TAG031.PVHI

median index 0.2000 1.2667 0.0500 0.6167 1.5167

if we use a global time-frame for all the tags, we will be ending up removing much more pressure and flow tags, compared to temperature or level tags. This suggests that, to have a more unified approach, it is better to use a variable-dependent time-frame for chattering removal. If the median index is within an acceptable range, it can provide a fair indicator of how bad the chattering is for a unique alarm. It also suggests a measure of the time-frame, meaning if we use the median for Tw , 50% of alarms are removed. It is obvious that finding a specific time-frame for each individual variable in a large database might be tedious and time consuming. But, there aren’t too many different types of variables in a typical process industry. In fact, the four main variables (temperature, pressure, flow, and level) account for the vast majority of process variables. On the other hand, the ISA 18.2 standard recommends some typical values for configuring delay-timers for different types of variables, as given in Table 2. 2 The numbers are suggested based on the time-constants of these variables in a typical process. If these values are used as a time-frame for removing chattering alarms, a more consistent data will be achieved.

22


Page 23 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Table 2: ISA 18.2 recommendations for delay-timers Process variable flow pressure level temperature

Delay-timer 15 sec 15 sec 60 sec 60 sec

A final recommendation is to use a different time-frame Tw for each type of variable. The time-frames can be selected directly from Table 2, or a multiple of the numbers. So, if the data miner decides to use a 30 sec time-frame for removing chattering pressure alarms, a fair choice is to use 30 sec, 120 sec, and 120 sec respectively for flow, level and temperature tags as well. Similar analysis is also valid if interval-based alarm data is available. In this case, the run-length distribution of the time between two consecutive alarms can be calculated and plotted. However, here the time difference is between the RTN of the first and the activation of the second alarm. This is known as OFF–ON RLD. Figure 7 shows the OFF–ON RLD of the same flow tag as Figure 6. Also Table 3 lists the median index (here the median time difference between an RTN and the next alarm) for the most frequent alarms, categorized by type of process variable. Again, it can be observed that the OFF-ON median indexes for flow and pressure tags are similar. The OFF–ON median index for temperature and level tags are similar too, and 4 to 20 times more than that of flow and pressure tags. This also supports the recommendation that the time-frame for temperature and level tags should be selected at least 4 times more than flow and pressure tags. Here too, the numbers in Table 2, or a multiple thereof can be selected.

23



Figure 7: OFF–ON run-length distribution of an actual flow tag

5

Missing alarm or RTN messages

As previously mentioned, there is a possibility that some data is lost in any database. This is not important, if the missing data belongs to some columns that are removed anyways. But sometime one row (i.e., one messages) is completely missing. This could be due to technical problems, temporary lost communication, buffer overflow, data loss in networks, misconfiguration of software, data truncation, etc. Our surveys show that, in many alarm databases about 0.5% to 5% of messages might be lost. A lost message may remain unnoticed in many applications. In fact, if point-based data is used there is, almost always, no way to decide whether a message is missing. But, if interval-based data is used, due to the fact that two messages are stored for each alarm (an alarm and an RTN message), missing messages can be detected. In this case, for a unique alarm, two alarm messages appear in the database without any RTN between them, or vice versa. One might go ahead and remove one of these messages, but this means we are removing 24


Page 24 of 43

Page 25 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Table 3: OFF-ON median index (in minutes) for different types of variables (a) Temperature

Unique alarm TAG127.PVHI TAG140.PVHI TAG029.PVLO TAG069.PVLO TAG054.PVHI

(b) Level

median index 1.7083 8.6416 1.5833 6.5083 6.1917

Unique alarm TAG261.PVHI TAG128.PVLO TAG067.PVLO TAG138.PVHI TAG067.PVLL

(c) Pressure

Unique alarm TAG035.PVLO TAG125.PVLO TAG125.PVHI TAG300.PVHI TAG036.PVLO

median index 3.4000 3.9750 4.1000 1.8333 4.1333

(d) Flow

median index 0.0500 1.9117 0.6500 0.0167 0.0667

Unique alarm TAG121.PVLO TAG185.PVLO TAG117.PVLO TAG031.PVHI TAG030.PVLO

median index 0.1000 0.5583 0.3083 0.7500 0.2500

useful information. A better way is to try to replace missing data with a reasonable and justified value. This is known as data imputation in data mining literature, and is an important step in preprocessing. In the rest of this section, we propose methods for alarm data imputation. This means adding an alarm or RTN message to the data set, if one is missing. In a clean alarm dataset, alarm and RTN messages should appear alternatively. If this is not the case, some messages are missing. Missing messages are observed either in the beginning and/or the end of the dataset, or in between. For a unique alarm, if the first message that appears in the dataset is an RTN message, then the alarm message is missing from the dataset. The alarm message might very well be available in the original A&E database, but it is not available to the data miner anyways. Similarly, if the last message for a unique alarm is an alarm message, then the RTN is missing from the dataset. Examples of both cases are shown in Table 4. It is possible to go ahead and use the methods proposed in this section to impute the missing messages, but there is a chance that the generated message falls outside of the analysis interval (usually one year). In this case, the alarm and the RTN

25



Table 4: Missing messages at the beginning an the end of a dataset (a) Missing alarm message at the beginning of a dataset

Date and time 2018-01-21 07:56:10 2018-03-22 20:20:13 2018-03-22 20:24:05

Unique alarm TAG100.OFFNORM TAG100.OFFNORM TAG100.OFFNORM

Alarm state RTN ALM RTN

(b) Missing RTN message at the end of a dataset

Date and time 2018-03-20 01:57:03 2018-03-20 01:57:08 2018-05-10 20:59:55

Unique alarm TAG041.PVLO TAG041.PVLO TAG041.PVLO

Alarm state ALM RTN ALM

Table 5: Missing messages in the middle of a dataset Date and time 2018-01-20 17:26:25 2018-01-20 17:26:27 2018-01-20 18:16:31 2018-01-20 18:16:51 2018-01-21 04:05:41 2018-01-21 04:06:16

Unique alarm TAG097.PVHI TAG097.PVHI TAG097.PVHI TAG097.PVHI TAG097.PVHI TAG097.PVHI

Alarm state ALM RTN RTN ALM ALM RTN

messages can be added at the beginning of the analysis interval and the end, respectively. If the missing messages are not at the beginning and/or the end of a dataset (for an example see Table 5), then the data is missing from the original A&E database as well. In this case, the proposed methods can be readily used to impute the missing messages. In the following, two methods are proposed for alarm data imputation.

5.1

Alarm data imputation using ON–OFF median index

In the previous section we defined ON–ON and OFF-ON median indices as a measure of the average distance between two consecutive alarms. A similar index can be defined for duration of an alarm. In this case, the run-length distribution of durations of a unique alarm is calculated and plotted, the so-called ON–OFF RLD. Figure 8 shows the ON–OFF RLD of the same flow tag as before. The ON–OFF median index can be calculated from this plot 26


Page 26 of 43

Page 27 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 8: ON–OFF run-length distribution of an actual flow tag

as the 50th percentile. This index shows the median duration of this unique alarm. Then, this index is used to estimate the activation time of an RTN message whose alarm message is missing, and an alarm message is added to the dataset at said alarm time. Similarly, if an RTN message is missing, one can be added to the dataset after the corresponding alarm message using the ON–OFF median index as the estimated RTN time.

5.2

Alarm data imputation using temporal patterns

In the previous method, to impute alarm messages for a unique alarm with missing messages, we used the information from the same alarm (i.e., the ON–OFF median index). A different approach is to use the information from other alarms to impute missing messages. In process industry, thanks to technologies such as sensor networks and DCS, a large number of variables are continuously measured and monitored. But, not all of these variables are independent. In fact, although the size of the data in a process industry is very large, its rank is often very small. Usually only a few process variables, known as latent variables,

27



are considered independent and the rest of the variables can be obtained from this small set. The same is true for alarm variables. A number of research results confirm that there are some correlations between alarm data. 9,10,22 Here, we use this correlation, captured by temporal patterns, to impute missing alarm messages. Kong et al. proposed a method to discover multi-temporal patterns in large databases. 29 They define four relationships between two timed events: equal, before, during, overlap. These relationships are then used to discover patterns. The basis of this method is the apriori algorithm of data mining, whose main advantage is reducing the number of dataset searches (just once). This multi-temporal pattern recognition algorithm was later used to find multi-patterns in alarm data 24 , which in turn can predict successive alarms with the objective of suppressing the volume of alarms and reducing operator load. Here, we use a modified version of the same algorithm for alarm data imputation. To use this algorithm, we need to divide the database into sections, referred to as transaction. Each transaction consists of a set of alarms that assumed to be related. Therefore, the transactions are selected in a way that no two alarms in consecutive transactions are related. Thus, the transactions are selected to be at least 30 minutes apart (other time intervals can be used based on the application). In other words, periods of 30 minutes or more with no alarms are the dividers between transactions. Let T1 be the set of all initial transactions. Then the following steps are performed for each missing message: 1. Find the transaction that contains the missing message, referred to as the main transaction (TM ). This method is only applicable to missing alarms that are part of a transaction. 2. For a meaningful and conclusive analysis, only transactions with at least 4 alarms are maintained and the rest are discarded. Let T2 be the set of all transactions with at least 4 alarms. Find all the transactions that contain messages with the same unique tag as the missing message. This set is denoted by TT . These transactions are used for imputation. If the unique tag appears only as a missing message, then obviously it is 28


Page 28 of 43

Page 29 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


not possible to find a relationship for that alarm. 3. From TT , find one that has the most number of unique alarms in common with the main transaction. This transaction is called the related transaction (TR ), and is the basis for imputing the missing message. 4. To use the algorithm, the transaction must be converted to a new form. This new form is in the DT variable. Each alarm occurring in DT is called a state, and denoted by s. An example on how to construct DT is provided in table 6. Table 6: Example of DT State s1 s2 s1 s3

ON time 5 12 20 25

OFF time 16 18 29 34

Now, we can run the algorithm on the related transaction DT . Let w be a predefined window of time, and consider an alarm, ALMA , with the active interval of [ONA OFFA ]. Then CA = [ONA OFFA + w] is defined as the window constraint of ALMA . Now, if another alarm, ALMB , is triggered within the window constraint of ALMA , i.e., ONB ∈ CA , then ALMB is said to satisfy the window constraint of ALMA . 29 If w is selected properly, the second alarm can be considered related to the first alarm. For alarm data, a window size of w = 90 seconds is appropriate, but the data miner can set it based on their specific set of data. Let supp be the number of instances of a certain pattern. Then deg_supp is defined as:

deg_supp =

supp |E|

where |E| = maxj=1,2,...,N (|Ej |), N is the number of unique alarms in TR , and |Ej | is the number of occurrences of the jth unique alarm in TR .

29



Page 30 of 43

The pattern finding algorithm in Algorithm 5 can now be executed. This is a modified version of the original algorithm 29 customized for alarm data. Here, M is the set of all the messages that need to be imputed. Also C is the set of pattern candidates and F is the set of candidates whose support is more than the minimal support, as defined in the algorithm. This algorithm uses four temporal predicates (i.e., equal, before, during or overlap) defined as: E

1. If ONi = ONj and OF Fi = OF Fj , then ALMi equals ALMj (ALMi ⇒ ALMj ). B

2. If 0 ≤ ONj − OF Fi ≤ w, then ALMi is before ALMj (ALMi ⇒ ALMj ). 3. If ONi = ONj < OF Fi < OF Fj or ONj < ONi < OF Fi ≤ OF Fj , then ALMi is D

during ALMj . (ALMi ⇒ ALMj ). 4. If ONi < ONj < OF Fi ≤ OF Fj or ONi < ONj < OF Fj ≤ OF Fi , or ONi ≤ ONj < O

OF Fj < OF Fi , then ALMi overlaps ALMj (ALMi ⇒ ALMj ). Then a temporal instance between two events ALMi and ALMj can be denoted as φ := R

ALMi ⇒ ALMj , where R is a temporal predicate (R ∈ {E, B, D, O}). 29

30


Page 31 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Algorithm 5 Alarm data imputation algorithm 1: Result = 0 2: for p = 1 : |M | do 3: Ap = unique alarm for the missing message Mp 4: Find TM 5: if TM = ∅ then 6: p=p+1 7: end if 8: Find TT 9: Find TR 10: DT = appropriate form of TR 11: supp(Ap ) = min{3, number of occurrences of Ap in TR } 12: min_deg_supp = supp(Ap )/|E| 13: F0 = 0 14: for every state s in DT do 15: if deg_suppS (s) ≥ min_deg_supp then 16: F0 = F0 {s} 17: end if 18: end for R 19: C1 = {si ⇒sj |si ∈ S, sj ∈ F0 } 20: Filter C1 21: k=1 22: while Fk−1 6= ∅ do 23: Fk = 0 24: for every sequence φ ∈ Ck do 25: if deg_suppS (φ) ≥ min_deg_supp then 26: Fk = Fk {φ} 27: end if 28: end for R 29: Ck+1 = {si ⇒sj |si ∈ Fk , sj ∈ F0 } 30: k =k+1 31: end while 32: Filter F 33: Resultp = F 34: p=p+1 35: end for

When candidates with two states are produced (C1 ), we should focus on patterns associated with the missing alarm. Then, we select patterns that include the missing alarm tag or those that one of them has found to be associated with the missing alarm. After the algorithm is executed, a set of rules are obtained that show how the alarms are related to

31



Figure 9: A sample of alarm data each other. Within this set, only rules that include the missing message tag are of interest. Once these rules are extracted, alarm patterns that include the missing message tag can be observed. These patterns provide an approximate duration for the alarm which can be used for message imputation. The details are shown for the case study in Section 6.

6

Case Studies

For the case study, we use alarm data from a refinery. A sample of data is depicted in Figure 9, with renamed tag names. Column headers are added for convenience. The data for a period of 9 months was provided through a number of flat files. There are about 110000 messages in the dataset, including alarms, RTNs, and system events. In this database there are 332 TagNames, 12 AlarmTypes (e.g., PVLL, PVLO, PVHI, PVHH, BADPV, OFFNORM, etc.), 4 Priorities (EMERGNCY, HIGH, LOW, and JOURNAL). The AlarmState has only two values (ALM and RTN) which indicate the activation and return-to-normal of an alarm. A blank AlarmState indicate a system event or operator action. The Value and AlarmLimit columns show the corresponding values for process alarms. But for system alarms, these two columns might be blank or populated with other values. For data analysis RStudio, a software package based on the R language, a common language for data analysis was used. The preprocessing steps are followed for this data. In this dataset, there are no duplicate messages, but there are messages that are not shown to the operator. The Priority of these messages are JOURNAL, and are removed from the dataset. System events and operator actions (determined by a blank AlarmState column) are removed as well. By removing

32


Page 32 of 43

Page 33 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


unnecessary records, the number of records is reduced to about 20000. This indicates that a large number of alarms are JOURNAL alarms. Then, all data was consolidated into one file readable by RStudio. The date and time formats are changed to be readable by RStudio as well. The next step is to remove unnecessary columns which include AlarmLimit, Priority, TagDescription, and Value. Then, unique alarms are created by concatenating TagName and AlarmType, e.g., TAG275.PVHI. This results in 404 unique alarms. The columns TagName and AlarmType can now be removed. In this dataset, the messages are already sorted in chronological order. We then remove stale alarms which account for about 0.5% of the messages. We add a column, Number, to identify each message. The next two steps are removing of chattering alarms and alarm data imputation. These steps are different for point-based vs. interval-based data. This dataset is obviously interval-based, but for demonstration purposes we consider both cases. If the RTN messages are removed, we end up with point-based data and about 10000 messages. If a fixed global time-frame of Tw = 60 seconds 49.9% of data is removed. This means that about half of the messages are chattering alarms which is not unusual. Now if a variable-dependent fixed time-frame is selected based on ISA recommendations for delaytimers (i.e., Tw = 60 seconds for temperature tags and so on) 45.2% of messages are removed. This is about 5% less than the global time-frame, which means about 5% more messages, all of them for flow and pressure tags, are maintained. If we keep the RTN messages (hence interval-based data), and use a fixed global timeframe of Tw = 60 seconds 54% of data is removed. And if a moving global time-frame of the same size is used, 61.8% of data is removed. The latter number, as expected, is more than the former. As for the missing messages, in this dataset about 1% of messages are missing from the beginning and the end, and about 1% from the middle. We explore the two approaches to impute the missing messages. The first approach is based on the ON-OFF median index. As an example, consider TAG209.PVLO which has a missing RTN (see Table 7 (a)). The

33



5

4883 s

4

Frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 43

3

2

1

0 0

0.5

1

1.5

2

2.5

Time (second)

3 4

x 10

Figure 10: ON–OFF RLD for TAG029.PVLO

Table 7: Alarm and RTN messages for TAG029.PVLO before and after imputation (a) Before imputation

Date and Time 2018-01-13 16:35:52 2018-01-13 18:12:42 2018-01-30 19:26:01 2018-01-24 15:15:34 2018-01-24 15:23:14

(b) After imputation

AlarmState ALM RTN ALM ALM RTN

Date and Time 2018-01-13 16:35:52 2018-01-13 18:12:42 2018-01-13 19:26:01 2018-01-13 20:47:24 2018-01-24 15:15:34 2018-01-24 15:23:14

AlarmState ALM RTN ALM RTN ALM RTN

ON–OFF RLD of this unique alarm is depicted in Figure 10. It can be observed that the ON–OFF median index for this tag is 4883 seconds. This means means that 50% of times this alarm lasts no longer than 4883 seconds. So, this value can be used to construct a missing RTN, 4883 seconds after the alarm, as shown in Table 7 (b). Notice that this is not a chattering alarm and its median duration is long. The second approach for alarm data imputation is based on temporal patterns and information from other alarms. Notice that to use this approach, a tag has to have some conditions and previously stated. And even if all the conditions are satisfied, there is no guarantee that a temporal pattern exists, although most of the times it does. We first divide the dataset into transactions that are 30 minutes apart (meaning that

34


Page 35 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Table 8: Discovered rules that include TAG035.PVLO TAG030.PVLO=B=TAG035.PVLO TAG035.PVLO=O=TAG063.OPLO TAG035.PVLO=O=TAG185.PVLO TAG063.OPLO=D=TAG035.PVLO TAG185.PVLO=O=TAG035.PVLO TAG030.PVLO=B=TAG035.PVLO=O=TAG185.PVLO TAG030.PVLO=B=TAG185.PVLO=O=TAG035.PVLO TAG035.PVLO=O=TAG063.OPLO=O=TAG185.PVLO TAG035.PVLO=O=TAG185.PVLO=O=TAG063.OPLO TAG035.PVLO=O=TAG185.PVLO=D=TAG063.OPLO TAG063.OPLO=D=TAG035.PVLO=O=TAG185.PVLO TAG185.PVLO=O=TAG035.PVLO=O=TAG063.OPLO TAG063.OPLO=O=TAG185.PVLO=O=TAG035.PVLO TAG185.PVLO=O=TAG063.OPLO=D=TAG035.PVLO TAG185.PVLO=D=TAG063.OPLO=D=TAG035.PVLO the last RTN in a transaction is more than 30 minutes apart from the first alarm in the following transaction). This gives us 538 transactions with sizes of 1 to 656 alarms. Now for example, consider TAG035.PVLO with a missing RTN (see Table 9). The alarm generated at 2018-06-25 14:57:40 is obviously missing an RTN. We first find the transactions that this alarm belongs to, i.e., the main transaction. Then, we select transactions that have 4 or more alarms, which results in 106 transactions. Within this set, we find all transactions that include an alarm message for TAG035.PVLO; and among those the one that has more unique alarm tags in common with the main transaction. This is the related transaction. Then we separate the subsets of the two transactions that include the common tags and run the algorithm to find rules. Here the rules that include TAG035.PVLO are important, which are shown in Table 8. Here, B, D, and O stand for before, during, and overlap, respectively. Among these rules, those with two tags are more useful. Those with three or more tags can be considered only if they are consistent with two-tag rules. Based on these rules, the alarm patterns are obtained as shown in Figure 11. The imputed RTN message for

35



Page 36 of 43

T AG030.P V LO T AG185.P V LO T AG035.P V LO T AG063.P V LO Figure 11: Temporal alarm patterns

Table 9: Alarm and RTN messages for TAG035.PVLO before and after imputation (a) Before imputation

Date and Time 2018-06-12 20:36:43 2018-06-25 14:57:40 2018-08-14 12:53:16 2018-08-14 15:35:19

(b) After imputation

AlarmState RTN ALM ALM RTN

Date and Time 2018-06-12 20:36:43 2018-06-25 14:57:40 2018-06-25 19:48:05 2018-08-14 12:53:16 2018-08-14 15:35:19

AlarmState RTN ALM RTN ALM RTN

TAG035.PVLO, is then obtained from new old old OFFnew T AG035.P V LO =max{ONT AG035.P V LO + OFFT AG035.P V LO − ONT AG035.P V LO , old old OFFnew T AG063.OP LO + OFFT AG035.P V LO − OFFT AG063.OP LO } Or OFFnew T AG035.P V LO = max{17:40:19, 19:48:05}. So, the RTN message is added at 19:48:05. Table 9 shows the messages before and after imputation.

7

Conclusion

Preprocessing is often regarded as the most time-consuming and tedious step in data mining. In this paper, preprocessing of alarm data for data mining was studied in details. The procedure suggested for preprocessing of alarm data is summarized in the following steps: 1. After completing the first two steps of the CRISP_DM method, select a time period and an area/unit of the plant for data mining. 36


Page 37 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


2. Collect alarm data over the selected period and for the selected area/unit. 3. Consolidate the collected data in one dataset. 4. Remove unnecessary records (duplications, system events, operator actions, acknowledgments, etc.). 5. Unify data into a software readable format. 6. Remove unnecessary columns (tag descriptions, values, alarm limits, etc.). 7. Create unique alarms if required (often by concatenating tag name and alarm type). 8. Sort the messages in chronological order. 9. Remove stale alarms. 10. Decide whether to proceed with point-based or interval-based data. 11. Select a time-frame Tw for chattering removal. It can be a predetermined global time-frame, or variable-dependent based on ISA 18.2 recommendations for delay-timers (Table 2). 12. Decide whether to use a fixed or moving time-frame based on the objective of data mining. 13. Based on type of data (point-based vs. interval-based) and the time-frame (fixed vs. moving) use one of the Algorithms 1–4 to remove chattering. 14. If alarm data is interval-based, investigate if any messages are missing. 15. Impute missing messages using either ON-OFF median index or temporal patterns (Algorithm 5). 16. Proceed with step 4 of the CRISP_DM method. In this paper, a detailed investigation of the concept of chattering alarms was given. It is a general practice to remove chattering alarms before mining. A few methods to remove 37



chattering were proposed here (steps 11–13 of the aforementioned procedure). To do that, a time-frame is selected and all chattering alarms within that time-frame are removed. Two approaches are proposed: a fixed time-frame, and a moving time-frame. The fixed timeframe is simpler to implement, but it cannot remove all chattering alarms and depending on the time-frame a certain number of alarms are preserved. On the other hand, moving time-frame is harder to implement, but can remove all chattering alarms in an episode. Commonly, a global time-frame is selected for all alarm tags, which is simple to implement. An important observation, though, is that chattering might vary for different variables, depending on their time constants. Hence, a variable-dependent time-frame for chattering removal was presented. This gives the data miner a tool to fine-tune the algorithm. A general rule of thumb is to use the delay-timers suggested by the ISA 18.2 standard, or a multiple thereof, which is simple and general. It also provides a more consistent removal set of rules. Another important step in preprocessing of alarm data is reconstructing missing messages. Missing alarm and/or RTN messages are observed in alarm databases. Using data imputation ensures that the information carried by an alarm message whose RTN is missing (or vice versa) is preserved in the analysis. It was shown that, if a message is missing for a particular tag, the history of its alarms can be used to calculate median alarm duration and construct a message accordingly. This approach is easy to implement and can be automated. A rich database with enough historical alarms for a specific alarm tag gives a good estimate of the median duration of alarms. Alternatively, one can use information of other related tags, through temporal patterns. Correlations are often observed between process variables and alarm tags, which can be used to find the patterns and use them for message imputation. This approach requires more computational resources and also some level of expert knowledge for evaluating the patterns. The advantage is that the missing message is imputed based on the closest pattern. Other pattern finding techniques can also be used for data imputation. This needs further

38


Page 38 of 43

Page 39 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


investigation and is the subject of future study.

References (1) Rothenberg, D. H. Alarm management for process control ; Momentum, 2009. (2) International Society of Automation (ISA), Management of Alarm Systems for the Process Industries, ANSI/ISA–18.2 ; 2016. (3) Engineering Equipment and Materials Users Association (EEMUA)), Alarm Systems A Guide to Design, Management and Procurement, 3rd ed.; EEMUA Publication 191, 2013. (4) Izadi, I.; Shah, S. L.; Chen, T. Effective resource utilization for alarm management. 49th IEEE Conference on Decision and Control (CDC). 2010; pp 6803–6808. (5) Cheng, Y.; Izadi, I.; Chen, T. Optimal alarm signal processing: Filter design and performance analysis. IEEE Transactions on Automation Science and Engineering 2013, 10, 446–451. (6) Izadi, I.; Shah, S. L.; Shook, D. S.; Kondaveeti, S. R.; Chen, T. A framework for optimal design of alarm systems. IFAC Proceedings 2009, 42, 651–656. (7) Afzal, M. S.; Chen, T.; Bandehkhoda, A.; Izadi, I. Analysis and design of timedeadbands for univariate alarm systems. Control Engineering Practice 2018, 71, 96– 107. (8) Kondaveeti, S. R.; Izadi, I.; Shah, S. L.; Black, T.; Chen, T. Graphical tools for routine assessment of industrial alarm systems. Computers & Chemical Engineering 2012, 46, 39–47.

39



(9) Hu, W.; Chen, T.; Shah, S. L. Detection of Frequent Alarm Patterns in Industrial Alarm Floods Using Itemset Mining Methods. IEEE Transactions on Industrial Electronics 2018, 65, 7290–7300. (10) Li, T.; Tan, W.; Li, X. Data mining algorithm for correlation analysis of industrial alarms. Cluster Computing 2017, 1–11. (11) Hu, W.; Wang, J.; Chen, T.; Shah, S. L. Cause-effect analysis of industrial alarm variables using transfer entropies. Control Engineering Practice 2017, 64, 205–214. (12) Yang, F.; Shah, S.; Xiao, D. Signed directed graph based modeling and its validation from process knowledge and process data. International Journal of Applied Mathematics and Computer Science 2012, 22, 41–53. (13) Gao, H.; Xu, Y.; Zhu, Q. Spatial interpretive structural model identification and AHPbased multimodule fusion for alarm root-cause diagnosis in chemical processes. Industrial & Engineering Chemistry Research 2016, 55, 3641–3658. (14) Larose, D. T.; Larose, C. D. Discovering knowledge in data: an introduction to data mining; John Wiley & Sons, 2014. (15) Cisar, P.; Hostalkova, E.; Stluka, P. Data mining techniques for alarm rationalization. 19th European Symposium on Computer Aided Process Engineering, Cracow, Poland. 2009; pp 1457–1462. (16) Ahmed, K.; Izadi, I.; Chen, T.; Joe, D.; Burton, T. Similarity analysis of industrial alarm flood data. IEEE Transactions on Automation Science and Engineering 2013, 10, 452–457. (17) Folmer, J.; Vogel-Heuser, B. Computing dependent industrial alarms for alarm flood reduction. Systems, Signals and Devices (SSD), 2012 9th International Multi-Conference on. 2012; pp 1–6. 40


Page 40 of 43

Page 41 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(18) Wang, J.; Li, H.; Huang, J.; Su, C. Association rules mining based analysis of consequential alarm sequences in chemical processes. Journal of Loss Prevention in the Process Industries 2016, 41, 178–185. (19) Hu, W.; Chen, T.; Shah, S. L. Discovering association rules of mode-dependent alarms from alarm and event logs. IEEE Transactions on Control Systems Technology 2017, 26, 971–983. (20) Niyazmand, T.; Izadi, I. Pattern Mining in Alarm Flood Sequences Using a Modifed PrefxSpan Algorithm. ISA Transactions 2019. (21) Büttner, S.; Wunderlich, P.; Heinz, M.; Niggemann, O.; Röcker, C. Managing Complexity: Towards Intelligent Error-Handling Assistance Trough Interactive Alarm Flood Reduction. International Cross-Domain Conference for Machine Learning and Knowledge Extraction. 2017; pp 69–82. (22) Cheng, Y.; Izadi, I.; Chen, T. Pattern matching of alarm flood sequences by a modified Smith–Waterman algorithm. chemical engineering research and design 2013, 91, 1085–1094. (23) Wu, S.-Y.; Chen, Y.-L. Mining nonambiguous temporal patterns for interval-based events. IEEE Transactions on Knowledge & Data Engineering 2007, 742–758. (24) Karoly, R.; Abonyi, J. Multi-temporal sequential pattern mining based improvement of alarm management systems. Systems, Man, and Cybernetics (SMC), 2016 IEEE International Conference on. 2016; pp 003870–003875. (25) Wang, J.; Chen, T. An online method for detection and reduction of chattering alarms due to oscillation. Computers & Chemical Engineering 2013, 54, 140 – 150. (26) Hollifield, B.; Habibi, E. Alarm Management: A Comprehensuve Guide; ISA, 2011.

41



(27) Guo, C.; Hu, W.; Lai, S.; Yang, F.; Chen, T. An accelerated alignment method for analyzing time sequences of industrial alarm floods. Journal of Process Control 2017, 57, 102–115. (28) Kondaveeti, S. R.; Izadi, I.; Shah, S. L.; Shook, D. S.; Kadali, R.; Chen, T. Quantification of alarm chatter based on run length distributions. Chemical Engineering Research and Design 2013, 91, 2550–2558. (29) Kong, X.; Wei, Q.; Chen, G. An approach to discovering multi-temporal patterns and its application to financial databases. Information Sciences 2010, 180, 873–885.

42


Page 42 of 43

Page 43 of 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Table of Contents/Abstract Graphics

43


Preprocessing of Alarm Data for Data Mining - ACS Publications

Recommend Documents