Effect of Data Compression on Pattern Matching in Historical Data

Menu Edit content on homepage Add Content to homepage Return to homepage Search. Clear search. Switch Switch View Sections ... Search. Search ...
0 downloads 0 Views 500KB Size
Ind. Eng. Chem. Res. 2005, 44, 3203-3212

3203

Effect of Data Compression on Pattern Matching in Historical Data Ashish Singhal* Controls Research Group, Johnson Controls, Inc., 507 East Michigan Street, Milwaukee, Wisconsin 53202

Dale E. Seborg† Department of Chemical Engineering, University of California, Santa Barbara, California 93106

It is a common practice in the process industry to compress process data before it is archived. However, compression may alter the original data in a manner that makes extracting useful information from it more difficult. In this paper, popular data compression methods and their effect on pattern matching in historical data are evaluated. Pattern matching is performed using principal-component analysis based similarity factors. Simulation results indicate that waveletbased compression provides the best compression for pattern matching, while compression using OSI PI software produces the best reconstruction of data. 1. Introduction Because of advances in information technology, large amounts of data produced by industrial plants are recorded as frequently as every 1 s using data historians. With data storage media being relatively inexpensive, the cost of storing data has dropped sharply. The mechanical and electronic aspects of data acquisition, organization, and storage are well developed.1-3 However, with rapid globalization and decentralization, many companies require data to be made available at different sites around the world. The focus for data historian software is not only how to record data in an efficient and reliable manner but also how to make it available on demand in a timely fashion to a wide variety of personnel at different locations. To speed up transfer of relevant data across data networks and via the Internet, it is beneficial to transmit data in a compressed form. Data compression also is becoming relevant for control networks with wireless sensors that transmit data intermittently to extend battery life. An efficient data compression method must maximize the degree of compression while retaining key features of the original data. There have been extensive studies performed in the areas of image compression and acoustic signal compression,4-6 but there are only a few relevant publications for compression of process data.7 A recent paper by Thornhill et al.8 described the detrimental effect of popular compression methods on basic data statistics such as the average, variance, and spectral density. One of the earliest papers on compression of process data was published by Hale and Sellars.1 They provided an excellent overview of the issues in the compression of process data and also described piecewise linear compression methods that were being employed at the DuPont company. By compressing data, they were able to increase computer capability and satisfy the demand for process information by operators and engineers. Data compression thus permitted ever-increasing stored data to remain accessible with short response times. * To whom correspondence should be addressed. Tel.: (414) 524-4688. Fax: (414) 524-5810. E-mail: [email protected]. † E-mail: [email protected].

Kennedy3 and Bascur and Kennedy9 have extended the ideas proposed by Hale and Sellars1 by developing information systems that integrate data recording as well as efficient retrieval and presentation to aid operations management, maintenance, crisis management, warehousing, engineering, business, and scheduling, among others. Other researchers have developed several algorithms to compress time-varying signals in efficient ways. Bristol10 modified the piecewise linear compression methods of Hale and Sellars1 to provide a swinging-door data compression algorithm. Mah et al.11 proposed a complex piecewise linear online trending algorithm that performed better than the classical boxcar, backwardslope, and swinging-door methods. Although their algorithm could adapt to process variability and noise, it was slower than the swinging-door method. Bakshi and Stephanopoulos12-14 published a series of papers on representing trends and compressing process data using wavelet methods. Their methods were batch methods because a batch of data was required before it could be compressed. Recently, Misra et al.15,16 developed an online data compression method using wavelets, where the algorithm computes and updates the wavelet decomposition tree before receiving the next data point. Their approach achieved better compression than batch methods. In this paper, we evaluate different data compression methods in terms of pattern matching. The data compression methods are evaluated on the basis of not only how accurately they represent process data but also how they effect the identification of similar patterns from historical data. For example, a particular method may compress data so that the compressed data require very little storage while accurately representing the original process data. However, it may transform the original data in a way that is detrimental to methods that attempt to extract useful information from it, such as the pattern-matching methodology presented by Singhal and Seborg.17-19 Preliminary results on this research topic were presented previously.20 This paper provides a detailed description of the different compression methods and their effects on pattern matching.

10.1021/ie049707a CCC: $30.25 © 2005 American Chemical Society Published on Web 03/24/2005

3204

Ind. Eng. Chem. Res., Vol. 44, No. 9, 2005

and then processes each component with a resolution matched to its scale.22 The wavelet transform of a timevarying signal depends on two variables: scale (frequency) and time. Wavelets provide a tool for timefrequency localization. This time-frequency representation is of particular interest in analyzing time-varying patterns that are characteristic of process data. A time-varying signal f(t) can be represented through the discrete wavelet transformation as follows:25 ∞

f(t) )

Figure 1. Boxcar method for online data compression: (×) recorded value; (O) last recorded point; (b) unrecorded value; (f) value that caused recording.

2. Popular Data Compression and Reconstruction Methods for Time-Series Data This section describes some of the popular compression methods for time-series data. Because the accuracy of retrieved data depends not only on the method that was used for compression but also on the method that was used for reconstruction, some simple reconstruction techniques such as zero-order hold and linear interpolation are also discussed briefly. 2.1. Data Compression Methods. The popular boxcar piecewise linear compression method1 is considered in this paper. Boxcar is also sometimes referred to as a “change-of-value” compression method and is illustrated in Figure 1. It is summarized as algorithm 1. Another common data compression method used in the process industries, data averaging, is described. Wavelet-based compression, a new method that is gaining popularity, is discussed in section 2.1.2. Finally, data compression performed by the commercially available Plant Information, or PI software, is described briefly21 in section 2.1.3. Algorithm 1. Boxcar compression algorithm. If V is the current value of the process variable, VR is the most recent recorded value, VL is the last value processed, and H is the recording limit, then do the following: 1. Calculate |V - VR|. 2. If |V - VR| G H, then record the previous value processed, VL, not the current value V, which caused the triggering of the recording. 3. Recording is accomplished by setting VR ) VL. 2.1.1. Data-Averaging Compression. A common compression technique for time-series data is to simply average the data over a specified period of time. In this case, the compression is performed offline, rather than online, because a batch of data must be collected before it can be averaged. For example, if the averaging period is 2 min while the sampling period is 5 s, then 24 samples need to be collected for averaging before the next compressed data point is recorded. 2.1.2. Wavelet-Based Compression. The development of wavelets has received considerable interest during the past 15 years.22,23 Wavelet analysis is an emerging field of mathematics that has provided new methods and algorithms suited for the types of problems encountered in process monitoring and control.24 The wavelet transform is a method that divides data, functions, or operators into different frequency components



d(k,l) ψ(2-kt-l) ∑ ∑ k)-∞ l)-∞

(1)

where the integer k represents the frequency scale at which the signal is decomposed and l represents its position in time. The wavelet decomposition coefficients, d(k,l), describe the frequency content of the signal f(t) at different times. The function ψ(t) is defined as the mother wavelet function.22,23,25 The most common mother wavelet functions used in the literature are the Haar function and the Daubechies family of orthogonal wavelet functions.22,23 The Haar wavelet function is used in this paper and is defined as

{

1 0 E t < 1/2 ψ(t) ) -1 1/2 E t < 1 0 otherwise

(2)

Wavelet transforms can be used to compress timeseries data by thresholding the wavelet coefficients d(k,l).15,26 Thresholding is a method by which wavelet coefficients that are greater than a specified threshold are retained, while those smaller than the threshold are discarded. This type of thresholding is called hard thresholding.15 Another type of thresholding, known as soft thresholding, scales the wavelet coefficients that are smaller than the specified threshold.27-29 Hard thresholding is used in this paper. If d(k,l) is the wavelet coefficient at frequency scale k and at time location l and φ is the wavelet threshold, then the thresholded wavelet coefficients d*(k,l) are defined as

d*(k,l ) }

{

0

d(k,l) < φ

d(k,l) d(k,l) G φ

(3)

For data compression, only the nonzero-thresholded wavelet coefficients d*(k,l) are stored. These thresholded coefficients can then be used to reconstruct data when needed. For data reconstruction, the missing wavelet detail coefficients are assumed to be zero. In this paper, the recording limits for each process variable are used as threshold values. The wavelet-based compression algorithm will now be described. Like the data-averaging method, it is also a batch method. Online versions of the algorithm are also available.15 2.1.3. Compression Using Commercial PI Software. Commercial data archiving software, such as PI from OSI Software, Inc., is very efficient at storing data because it automatically compresses time-series data while storing them in a way that provides the best reconstruction. Because PI is widely used for data archiving, it is informative to compare this commercial software with the classical techniques. In particular, the BatchFile Interface for the PI software was used for data compression.

Ind. Eng. Chem. Res., Vol. 44, No. 9, 2005 3205

chosen by the user; no special plant tests or preimposed conditions are necessary. To find periods of historical data that are similar to the snapshot data, a window of the same size as the snapshot data is moved through the historical data. The similarity between the snapshot and historical data in the moving window is characterized by the SPCA and Sdist similarity factors that are described in section 4. The historical data windows with the largest values of the similarity factors are collected in a candidate pool. The individual data windows in the candidate pool are called records. After the candidate pool has been formed, a person familiar with the process can then perform a more detailed examination of the records. Details of the moving-window approach for pattern matching are provided by Singhal and Seborg.17 4. Pattern Matching Using Similarity Factors Figure 2. Proposed pattern-matching approach.

2.2. Data Reconstruction Methods. All of the data compression methods described in the previous section result in lossy compression; i.e., it is not possible to reconstruct the compressed data to exactly match the original data. The accuracy by which compressed data can describe the original uncompressed data depends not only on the compression algorithm but also on the method of data reconstruction. Many reconstruction methods are available such as zero-order hold, where the value of a variable is held at the last recorded value until the next recording. Although the zero-order hold is the simplest method, it is not the best method for reconstructing compressed data because holding the value of the variable constant since the last recording causes a loss of much of the information about the possible values of the variable between recordings. However, a zero-order hold can be useful in situations where the process is at steady state. Linear interpolation is a simple method that can overcome a part of this limitation by reconstructing data between recordings. It can provide more accurate reconstruction for situations where the process is at steady state and for situations where process variables show trends. Higher-order methods such as cubic-spline interpolation can be used to obtain better reconstruction accuracy, but they are computationally more intensive and produce large deviations from the process trend line when the time between consecutive recordings is large. We use linear interpolation to reconstruct compressed data throughout this paper. More sophisticated methods such as the use of the expectation-maximization algorithm for data reconstruction have also been proposed.30,31 However, these methods are sensitive to the amount of missing data and do not perform well when more than 20% of the data are missing.30,31

Pattern matching in historical data involves comparisons of different datasets. Suppose that it is desired to find historical datasets that are similar to a particular dataset of current interest. This latter dataset will be referred to as the snapshot data. Because process data are available as multivariate time series where variables are correlated to each other, metrics are required to compare the historical multivariate data to the snapshot data. The authors have proposed and used similarity factors based on principal-component analysis (PCA) for comparing such multivariate datasets.17-19,32 These similarity factors are briefly described next. 4.1. PCA Similarity Factor. Krzanowski33 developed a method for measuring the similarity of two datasets using a PCA similarity factor, SPCA. Consider two datasets that contain the same n variables but not necessarily the same number of measurements. We assume that the PCA model for each dataset contains k principal components, where k E n. The number of principal components, k, is chosen such that k principal components describe at least 95% of the total variance in each dataset. The similarity between the two datasets is then quantified by comparing their principal components. The appeal of the similarity factor approach is that the similarity between two multivariate datasets is quantified by a single number, SPCA. Consider a current snapshot dataset S and a historical dataset H with each dataset consisting of m measurements of the same n variables. Let kS be the number of principal components that describe at least 95% of the variance in dataset S and kH be the number of principal components that describe at least 95% of the variance in dataset H. Let k ) max(kS, kH), which ensures that k principal components describe at least 95% of the variance in each dataset. Then subspaces of the S and H datasets can be constructed by selecting only the first k principal components for each dataset. The PCA similarity factor compares these reduced subspaces and can be calculated from the angles between principal components33

3. Pattern-Matching Approach A new pattern-matching strategy developed by the authors is summarized in the flowchart of Figure 2.17-19,32 First, the user defines the snapshot data that serve as a template for searching the historical database. The snapshot specifications consist of (i) the relevant process variables and (ii) the duration of the period of interest. These specifications can be arbitrarily

SPCA )

1

k

k

cos2 θij ∑ ∑ k i)1 j)1

(4)

where θij is the angle between the ith principal component of dataset S and the jth principal component of dataset H. A value of SPCA of close to 1 means that the datasets S and H are similar.

3206

Ind. Eng. Chem. Res., Vol. 44, No. 9, 2005

Figure 3. Pattern matching in historical data using a movingwindow approach.

4.2. Distance Similarity Factor. The distance similarity factor, Sdist,17 compares two datasets that may have similar spatial orientation but are located far apart. This similarity factor is particularly useful when two data windows have similar principal components, but the values of the process variables may be different because of disturbances of varying magnitudes or setpoint changes. The distance similarity factor can be used to distinguish between these cases. The Mahalanobis distance,34 Φ, from the center of the historical dataset (x j H) to the center of the current snapshot dataset, x j S, is defined as

Φ ) x(x jH - x j S)ΣS/-1(x jH - x j S)T

(5)

j H are sample mean vectors. ΣS is the where x j S and x covariance matrix for dataset S, and ΣS/-1 is the pseudoinverse of ΣS calculated using singular-value decomposition. The number of singular values used to calculate the pseudoinverse is k ) max(kS, kH). The new distance similarity factor, Sdist, is defined as the probability that the center of the historical dataset, x j H, is at least a distance Φ from the snapshot dataset S:

1 Sdist } 2 x2π

[

)2 1-

∫φ∞e-z /2 dz 2

1 x2π

]

∫-∞φ e-z /2 dz 2

(6)

Note that a one-sided Gaussian distribution is used in eq 6 because Φ G 0. The error function in eq 6 can be evaluated using standard tables or software packages. Because the distance similarity factor provides a natural complement to the PCA similarity factor, the proposed pattern-matching approach is based on both SPCA and Sdist. This approach compares datasets at two levels: (i) orientation of their subspaces and (ii) the distance between them. It is interesting to note that the widely used Mahalanobis distance metric in the pattern-matching literature and Hotelling’s T 2 statistic used in process monitoring are related to each other. If k singular values are used to calculate ΣS*-1, then the square of the Mahalanobis distance, Φ, is equal to Hotelling’s T 2 statistic for the point x j H. The novelty of the proposed distance similarity factor is that it assigns a probability value between 0 and 1 to the Mahalanobis distance between the snapshot and historical datasets. 4.3. Selection of the Candidate Pool. The candidate pool is constructed by selecting the most similar historical data windows. Figure 3 illustrates pattern matching using the moving-window methodology. Suppose that a historical data window has been selected for the candidate pool based on its SPCA and Sdist values. In general, this record will contain data from two

operating periods, as shown in Figure 3. For the simulation case study in section 5, the record is considered to have been correctly identified if the center of the moving data window is located in the same type of operating period as the snapshot data. This situation is illustrated in Figure 3, where the snapshot data are for an operating condition called F10. Then a correct match between the moving historical window and the F10 snapshot data occurs when the center of the moving window lies inside a F10 period of historical data. To avoid redundant counting of data windows that represent the same period of operation in the simulated historical database (i.e., historical data windows that are too close to each other), the historical data window that has the largest values of the similarity factors is chosen among the historical data windows that are within m observations of each other. This restriction results in the selection of data windows that are at least m observations apart. In the case study, m is chosen to be the number of data points in the snapshot data. This value is also the duration of each operating period in the historical database. 4.4. Performance Measures for Pattern Matching. Two important metrics are proposed to quantify the effectiveness of a pattern-matching technique. However, first several definitions are introduced: NP: The size of the candidate pool. NP is the number of historical data windows that have been labeled similar to the snapshot data by a pattern-matching technique. The data windows collected in the candidate pool are called records. N1: The number of records in the candidate pool that are actually similar to the current snapshot, i.e., the number of correctly identified records. N2: The number of records in the candidate pool that are actually not similar to the current snapshot, i.e., the number of incorrectly identified records. By definition, N1 + N2 ) NP. NDB: The total number of historical data windows that are actually similar to the current snapshot. In general, NDB * NP. The first metric, the pool accuracy p, characterizes the accuracy of the candidate pool:

p}

N1 × 100% NP

(7)

A second metric, the pattern-matching efficiency η, characterizes how effective the pattern-matching technique is in locating similar records in the historical database. It is defined as

η}

N1 × 100% NDB

(8)

When the pool size NP is small (NP < NDB), then the efficiency η will be small because N1 E NP. A theoretical maximum efficiency, ηmax, for a given pool size NP can be calculated as follows:

{

NP × 100% for NP < NDB N ηmax } DB for NP G NDB 100%

(9)

Because an effective pattern-matching technique should ideally produce large values of both p and η, an average

Ind. Eng. Chem. Res., Vol. 44, No. 9, 2005 3207

Figure 4. Representative time series showing data compression using wavelet and PI methods.

of the two quantities (ξ) is used as a measure of the overall effectiveness of pattern matching:

p+η ξ} 2

(10)

In pattern-matching problems, the relative importance of p and η metrics is application dependent. For example, a busy engineer may want to locate a small number of previous occurrences of an abnormal situation without having to waste time evaluating incorrectly identified records (false positives). In this situation, a large value of p is more important than a large value of η, and NP should be small (e.g., 2-5). In other types of applications, it might be desirable to locate most (or even all) previous occurrences of an abnormal situation, for business or legal reasons. This type of situation could arise during the investigation of a serious plant accident or after the recall of a defective product. Here, η is more important than p, and a relatively large value of the candidate pool size, NP, is acceptable.

5. Simulation Case Study: Continuous Stirred Tank Reactor (CSTR) Example To compare the effect of data compression on pattern matching, a case study was performed for a simulated chemical reactor. A nonlinear CSTR with cooling jacket dynamics, a variable liquid level, and a first-order irreversible reaction, A f B, was simulated. A total of 28 different operating conditions that included a number of faults of varying magnitudes, disturbances, and setpoint changes were simulated for the CSTR, and 14 process variables for each operating condition were recorded. The details of the simulation study and 28 different operating conditions for the CSTR are available in previous publications.17,19,32 The simulation generated two groups of data. The first group contained 28 snapshot datasets, each corresponding to the nominal magnitude of a fault, while the second group contained a large historical dataset of over 474 000 observations that included faults, disturbances, and setpoint changes of varying magnitudes. Data compression was performed for each of the 28 snapshot

3208

Ind. Eng. Chem. Res., Vol. 44, No. 9, 2005

Table 1. Representative Operating Periods Used To Determine Shewhart Chart Limits run

run description

N1 N2, N_2 N3, N_3 N4, N_4 N5, N_5

no unusual disturbances sinusoidal pulse (half period) in coolant feed temperature with an amplitude of (3 K sinusoidal pulse in feed temperature with an amplitude of (3 K sinusoidal pulse in downstream pressure in the reactor outlet line with an amplitude of (1.5 psi rectangular pulse in upstream pressure in the coolant line with an amplitude of (0.75 psi (a step up at 250 min and a step back down at 1000 min) feed flow rate ramps at a rate of (0.0075 L/min over 400 min and remains there for 800 min first-order exponential change (τ ) 100 min) in feed concentration to a new steady-state value of 1 ( 0.018 mol/L

N6, N_6 N7, N_7

Table 2. Standard Deviation Values for the Process Variables of the CSTR Example variable CA T TC h Q QC QF a

std. dev. (σ)a 10-3

2.40 × 6.94 × 10-1 1.92 × 10-1 4.16 × 10-2 1.14 2.24 × 10-1 6.61 × 10-1

variable CAF TF TCF hC QC TC TCC

Table 3. Data Compression and Reconstruction Results for the CSTR Example for a Constant CR

std. dev. (σ)a 10-3

2.34 × 6.61 × 10-1 6.61 × 10-1 1.66 × 10-1 6.96 × 10-2 1.14 × 10-1 7.44 × 10-2

Definitions and units are given in the Nomenclature section.

datasets, as well as for the historical dataset. The time series for the reactor concentration (CA) of species A and the exiting coolant temperature (TC) are plotted in Figure 4 to show the effect of wavelet and PI compression methods. 5.1. Generation of the Recording Limits. For the CSTR simulation study, Shewhart chart limits were used to calculate the recording limits. The chart limits were constructed using data that included small disturbances.35 The representative operating periods are described in Table 1. Except for run N1, each run includes two disturbances, one in the positive direction (e.g., N3) and one in the negative direction (e.g., N_3). Thus, a total of 13 representative operating periods were considered, with each period lasting 120 min. Although a recent paper by Pettersson and Gutman36 has proposed implementation of the boxcar and backward-slope compression methods with time-varying recording limits, our analysis is limited to compression methods with constant recording limits. The recording limits for each variable were specified by calculating the Shewhart chart limits. The standard deviation for the ith process variable, σi, was determined using the methodology described above. Then the recording limit for that variable was set equal to cσi, where c is a scaling factor called the recording limit constant. The value of c was specified differently for each compression method as described later. The standard deviations for the 14 measured variables are presented in Table 2. 6. Results and Discussion The data compression methods were evaluated in two ways: (1) reconstruction error and (2) degree of similarity between the original and reconstructed data. The reconstruction error analysis calculates the difference between the original and compressed time series, while the calculation of similarity factors compares the datasets from a pattern-matching perspective. The SPCA and Sdist similarity factors were used to quantify the similarity between the original and reconstructed data. The results with respect to the reconstruction error are presented first, and the results for pattern matching are presented in section 6.2. 6.1. Comparison of Different Methods with Respect to the Reconstruction Error. The data com-

compression method

recording limit constant (c)

CR

MSE

boxcar averaging over 1.25 min wavelet PI

2.2295 NA 2.2669 3.0

14.84 14.60 14.83 14.83

5.23 24.69 2.61 0.33

pression methods described in section 2.1 were first compared on the basis of the reconstruction error for a constant compression ratio (CR). It is easier to compare different methods with respect to the reconstruction error if all compression methods have the same CR. The CR is defined as

number of data points in the original dataset CR } number of data points in the compressed dataset

(11)

The mean-squared error (MSE) of reconstruction is defined as

MSE }

1

n

m

i,j2 ∑ ∑ m × n i)1 j)1

(12)

where m is the number of measurements in the original dataset; n is the number of variables; and i,j ) xi,j xˆ i,j, where xi,j represents the j measurement of the i variable in the original data and xˆ i,j is the corresponding reconstructed value. Obtaining a constant CR for all methods requires adjustment of the recording limits individually for each method. As mentioned in the previous section, the recording limits for a given method and each process variable are proportional to their standard deviations in Table 2. For example, the OSI PI recording limits were chosen as 3σi, while the recording limits for the boxcar method were adjusted to produce the same CR as that in the PI method. Thus, the recording limits for the boxcar method were 2.23σi. Data compression was performed using PI’s proprietary algorithm. The CR for the 28 snapshot datasets was 14.8 using PI. The recording limits for all other methods were then adjusted using numerical rootfinding techniques, such as the bisection method, to obtain an average CR of approximately 14.8 for each method. The average results for the 28 operating conditions presented in Table 3 show that the PI algorithm provides the best reconstruction of the compressed data, while wavelet-based compression is second best. The common practice of averaging the data provides the worst reconstruction. 6.2. Effect of Data Compression on Pattern Matching. An objective of this paper is to evaluate the effect of data compression on pattern matching. Data compression affects pattern matching because the origi-

Ind. Eng. Chem. Res., Vol. 44, No. 9, 2005 3209 Table 4. SPCA Values between the Reconstructed and Original Datasets for the CSTR Example method boxcar SPCA average 0.88 N 0.80 F1 0.84 F2 0.87 F3 0.83 F4 0.83 F_4 0.92 F5 0.87 F_5 0.93 F6 0.95 F_6 0.81 F7 0.83 F_7 0.83 F8 0.84 F_8 0.86 F9 0.85 F_9 0.93 F10 0.89 F_10 0.94 F11 0.87 F_11 0.80 S1 0.82 S_1 0.96 F12 0.88 F13 0.89 O1 0.85 O2 0.90 O3 0.99 O4 >0.99

Sdist

averaging SPCA

0.67 0.92 0.69 0.96 0.42 0.99 0.87 0.83 0.89 0.95 0.76 0.93 0.78 0.88 0.70 0.81 0.84 0.92 0.92 0.94 0.71 0.98 0.63 0.91 0.79 0.98 0.84 0.93 0.34 0.99 0.73 0.99 0.29 0.89 0.39 0.93 0.30 0.94 0.70 0.92 0.89 0.83 0.88 0.90 0.84 0.97 0.78 0.95 0.01 0.87 0.20 0.81 0.75 0.79 0.85 0.99 0.99 >0.99

Sdist

wavelet SPCA

Sdist

PI SPCA

0.97 0.95 >0.99 0.88 0.98 0.99 >0.99 0.83 0.98 0.99 >0.99 0.84 0.99 0.99 >0.99 0.88 0.99 0.97 >0.99 0.83 0.99 0.96 >0.99 0.83 0.98 0.98 >0.99 0.86 0.98 0.98 >0.99 0.86 0.99 0.94 >0.99 0.97 0.99 0.96 >0.99 0.96 0.98 0.99 >0.99 0.83 0.98 0.99 >0.99 0.83 0.98 0.99 >0.99 0.83 0.98 0.98 >0.99 0.83 0.98 0.99 >0.99 0.83 0.98 0.99 >0.99 0.85 0.98 0.88 >0.99 0.84 0.99 0.88 >0.99 0.99 0.99 0.88 >0.99 0.97 0.99 0.88 >0.99 0.97 0.99 0.80 >0.99 0.84 0.99 0.99 >0.99 0.82 0.99 0.99 >0.99 0.98 0.99 0.98 >0.99 0.87 0.99 0.88 >0.99 0.82 0.99 0.86 >0.99 0.89 1.00 0.97 >0.99 0.90 1.00 >0.99 >0.99 0.99 0.99 >0.99 >0.99 >0.99

Sdist 0.71 0.83 0.77 0.91 0.85 0.78 0.84 0.70 0.89 0.96 0.85 0.60 0.85 0.76 0.81 0.61 0.30 0.23 0.17 0.40 0.87 0.90 0.84 0.62 0.05 0.54 0.92 0.98 0.96

nal and reconstructed datasets are not the same. To evaluate the effect of different compression methods on the effectiveness of the proposed pattern-matching methodology, similarity factors between the original and reconstructed data were calculated to determine how similar the reconstructed dataset was to the original one. For scaling purposes, the original snapshot dataset was considered to be the reference dataset. The SPCA and Sdist values were calculated for every snapshot dataset in the following manner. An original snapshot dataset was selected from the 28 datasets. The SPCA and Sdist values for the original snapshot dataset and each of the reconstructed snapshot datasets were calculated. This process was repeated for each of the 28 original snapshot datasets. The detailed results for each operating condition are presented in Table 4. The methods of Table 3, with a constant CR of approximately 14.8, were employed for data compression. Although the averaging compression method performed the worst in terms of the reconstruction error (cf. Table 3), it produced compressed datasets with a high degree of similarity to the original ones, as indicated by high SPCA and Sdist values. The wavelet compression method results in low MSE values and high SPCA and Sdist values. These results demonstrate that wavelet-based compression is very accurate in terms of both the reconstruction error and the similarity of the reconstructed and original datasets. Although the PI algorithm produces very low MSE values, indicating that it is a very accurate method in terms of the reconstruction error, it does not result in high SPCA and Sdist values. Almost all compression methods result in reasonably high values of SPCA, between 0.85 and 0.99. However, the Sdist values exhibit large variations for different operating conditions, indicating mean shifts as a result of data compression. For example, all methods except

averaging and wavelet methods produced low Sdist values for the normal operation. The low values occur for normal operation because there is not much variation in the measurements. This results in small eigenvalues for the covariance matrix and makes the Mahalanobis distance very sensitive to mean shifts. The averaging method produces high SPCA and Sdist values for all operating conditions. This suggests that although it is not a very accurate method for data compression, it is still very effective in representing the data characteristics for pattern matching. On the other hand, even though the PI algorithm results in a very low MSE, it does not represent the data very well for pattern matching. The wavelet method produces both low MSE and high similarity factor values. The wavelet transform preserves the essential dynamic features of the signal in the detail coefficients while retaining the correlation structure between the variables in the approximation coefficients. These two features of the wavelet transform produce low MSE and high SPCA values between the original and reconstructed data. These features also minimize mean shifts and result in high Sdist values. By contrast, the PI method records data very accurately and produces very low MSE values, but its variable sampling rates change the correlation structure between variables and produce low SPCA values. Variable sampling also affects the mean value of the reconstructed data and produces low Sdist values. The effect of compression on pattern matching using the PI software was investigated further by comparing the reconstructed and original datasets using similarity factors. Similarity factors were calculated for all possible pairs of the snapshot datasets. The results are shown in Figures 5 and 6. The original datasets were considered to be the reference datasets for the purpose of scaling. The numerical values represented as gray scale in a row are the similarity factors between the original snapshot and reconstructed datasets obtained from data compression using PI. The diagonal entries and the entries that are larger than the diagonal value in the same row are crossed. The SPCA values for the original snapshot and reconstructed datasets are presented in Figure 5. This figure shows the effect of data compression on the correlation between the process variables. The diagonal and offdiagonal values in a row that are greater than the diagonal value are crossed in Figure 5b. These offdiagonal values indicate that data compression (using PI) changes the correlation between variables such that a snapshot now is more similar to reconstructed datasets of other operating conditions than it is to its own reconstructed dataset. Figure 5b indicates that the normal operation and operating conditions F1, F4, F_4, F6-F9, S_1, F13, and O2 are less similar to their reconstructed datasets than to other reconstructed datasets. For example, normal operation is more similar to reconstructed datasets corresponding to operating conditions F1, F6, F_11, and F12; F8 is more similar to reconstructed datasets corresponding to operating conditions F2, F_4, F5, F_5, and S_1. These results suggest that mismatches between these operating conditions would be very high if the historical data were compressed and the snapshot data were not compressed. Thus, it would be very difficult to locate these operating conditions in the historical database if the data were compressed and

3210

Ind. Eng. Chem. Res., Vol. 44, No. 9, 2005

Figure 5. SPCA self-similarity matrices for original and compressed datasets.

recorded using PI. In Figure 5a, the diagonal SPCA values are exactly equal to 1 because the datasets being compared were not compressed, while in Figure 5b, data compression results in diagonal values that are less than 1. These results suggest that compression has a detrimental effect on pattern matching. Figure 6 shows the Sdist values for the original snapshot and reconstructed datasets. Note that the diagonal values for normal operation and operating conditions F10, F_10, F11, and F_11 are very small. In particular, the Sdist value for normal operation is only 0.05. The reason for the low Sdist values is that there is very little variation in the values of the process variables for these operating conditions. Consequently, the eigenvalues of the covariance matrix are small and, thus, the Mahalanobis distance becomes very sensitive to even small mean shifts. The normal operation is affected most strongly by mean shifts because the process variables have very small variations, and the data compression algorithm records only those values that have large deviations from the original mean. When the datasets being compared are not compressed, the diagonal Sdist values would be equal to 1, as indicated by Figure 6a.

Figure 6. Sdist self-similarity matrices for original and compressed datasets.

Figure 6b indicates that data compression using PI causes significant mean shifts for normal operation and operating conditions F11, F12, F13, O1, O2, and O4 because the off-diagonal Sdist values are larger than the diagonal values. For example, significant mean shifts can be observed for the normal operation that has larger Sdist values for reconstructed datasets corresponding to operating conditions F12, F13, and O1-O4 compared to the reconstructed normal operation. 6.2.1. Finding Similar Patterns in Compressed Historical Data. We now evaluate the effect of data compression on locating similar patterns in the historical data. Two situations are considered: (i) both the snapshot and historical data are compressed, and (ii) only the historical data are compressed. When snapshot data were compressed, the same compression method was used for both the snapshot and historical data. The historical data for the CSTR example described in section 5 were compressed using three different methods: boxcar, averaging, and wavelet. The performance of the proposed pattern-matching technique for compressed historical data was then evaluated. As described by Singhal19 and Singhal and Seborg,17 a data window

Ind. Eng. Chem. Res., Vol. 44, No. 9, 2005 3211 Table 5. Effect of Data Compression on Pattern Matching When Both Snapshot and Historical Data Are Compressed Using the Same Method compression method

similarity factor

opt. NP

p (%)

η (%)

ηmax (%)

ξ (%)

original data (uncompressed) boxcar

SPCA Sdist SPCA Sdist SPCA Sdist SPCA Sdist

34 25 56 60 21 24 34 52

43 41 24 19 49 40 38 25

90 68 84 73 65 65 82 83

99 97 100 100 95 96 99 100

66 54 53 46 57 53 60 54

averaging wavelet

Table 6. Effect of Data Compression on Pattern Matching When Snapshot Data Are Not Compressed and Historical Data Are Compressed compression method

similarity factor

opt. NP

p (%)

η (%)

ηmax (%)

ξ (%)

original data (uncompressed) boxcar

SPCA Sdist SPCA Sdist SPCA Sdist SPCA Sdist

34 25 51 60 60 16 39 15

43 41 26 19 23 52 31 50

90 68 80 75 85 57 75 53

99 97 100 100 100 92 99 91

66 54 53 47 54 54 53 52

averaging wavelet

7. Conclusions A comparison of some popular data compression methods has shown that the classical methods such as boxcar and data-averaging compression do not accurately represent data in terms of the reconstruction error. Data compressed using the PI software very accurately represent the original data but result in low similarity factor values for pattern-matching applications. Compression using the wavelet method produces reconstruction errors that are higher than those obtained with PI but much lower than conventional compression methods such as boxcar, etc. Data compressed using wavelets also show a high degree of similarity with the original data. This feature tends to minimize the misclassifications that result from data compression. Thus, if the objective is to accurately represent data with minimum reconstruction errors, then PI software provides the best compression. However, if pattern matching using similarity factors is the primary concern, then wavelet compression would provide better results. When the historical data are compressed, it is beneficial to compress the snapshot data as well for pattern-matching applications. Acknowledgment

that was the same size as the snapshot data (S) was moved through the historical database, 100 observations at a time (i.e., w ) 100). The ith moving window was denoted as Hi. The snapshot data were scaled to zero mean and unit variance, and the historical data were scaled using the scaling factors for the snapshot data. Similarity factors were then calculated for each instance of the window moving through the historical data. After the entire historical dataset was analyzed for one set of snapshot data, the analysis was repeated for a new snapshot dataset. A total of 28 different snapshot datasets, one for each of the 28 operating conditions, were used for pattern matching. Table 5 compares the pattern-matching results for different data compression methods. The best patternmatching results were obtained when the data were compressed using the wavelet method. In Table 5, the optimum NP was calculated by calculating ξ for different NP values and then choosing the value of NP for which ξ was the largest. Table 5 indicates that pattern matching is adversely affected by data compression when the data are compressed using either the averaging method or the boxcar compression methods. By contrast, wavelet-based compression has very little effect on pattern matching because similar results are obtained for both compressed and uncompressed data. Table 6 presents the results for the situation when the snapshot data are not compressed and only the historical data are compressed. The p, η, and ξ values in Table 6 are somewhat lower compared to those in Table 5, particularly for wavelet compression. Thus, if the historical data are compressed, it may be beneficial to compress the snapshot data as well to obtain better pattern matching. In this situation, wavelet compression provides the best pattern matching. The ξ values for the averaging method are higher than those of the wavelet method because of large NP and η values. In general, data compression had a detrimental effect on pattern matching, but SPCA and Sdist methods performed very well.

The authors thank OSI Software, Inc., for providing financial support and the data archiving software PI and Gregg LeBlanc at OSI for providing software support during the research. Financial support from ChevronTexaco Research and Technology Co. is also acknowledged. Nomenclature CA ) concentration of species A in the reactor (mol/L) CAF ) concentration of species A in the reactor feed stream (mol/L) h ) liquid level in the reactor (dm) hC ) level controller signal (mA) Q ) flow rate of the reactor outlet stream (L/min) QC ) flow rate controller signal (mA) QC ) coolant flow rate (L/min) QF ) feed flow rate of the reactor feed stream (L/min) T ) temperature in the reactor (K) TC ) reactor temperature controller signal (mA) TC ) temperature of the coolant in the cooling jacket (K) TCF ) temperature of the coolant feed (K) TF ) temperature of the reactor feed stream (K)

Literature Cited (1) Hale, J. C.; Sellars, H. L. Historical Data Recording For Process Computers. Chem. Eng. Prog. 1981, 77 (11), 38-43. (2) DeHeer, L. E. Plant-Scale Process Monitoring and Control Systems: Eighteen Years and Counting. In Computer-Aided Process Operations; CACHE/Elsevier: New York, 1987. (3) Kennedy, J. P. Building an Industrial Desktop. Chem. Eng. 1996, 103 (1), 82-86. (4) Ramaswamy, G. N.; Gopalakrishnan, P. S. Compression of Acoustic Features for Speech Recognition in Network Environments. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ‘98, Seattle, WA, 1998; pp 977-980. (5) Kjelso, M.; Gooch, M.; Jones, S. Performance Evaluation of Computer Architectures With Main Memory Data Compression. J. Syst. Archit. 1999, 45, 571-590. (6) Bajaj, C.; Ihm, I.; Park, S. 3D RGB Image Compression for Interactive Applications. ACM Trans. Graphics 2001, 20 (1), 1038.

3212

Ind. Eng. Chem. Res., Vol. 44, No. 9, 2005

(7) Vedam, H.; Venkatasubramanian, V.; Bhalodia, M. A BSpline Based Method for Data Compression, Process Monitoring and Diagnosis. Comput. Chem. Eng. 1998, 22, S827-S830. (8) Thornhill, N. F.; Choudhury, M. A. A. S.; Shah, S. L. The Impact of Compression on Data-Driven Process Analyses. J. Process Control 2004, 14, 389-398. (9) Bascur, O. A.; Kennedy, J. P. Real Time Business and Process Analysis to Increase Productivity in the Process Industries. Proceedings of the International Conference and Exposition for Advanced Measurement, Control and Automation Techniques, Products and Services, ISA TECH 1999, Philadelphia, PA, 1999; pp 233-244. (10) Bristol, E. H. Swinging Door Trending: Adaptive Trend Recording? Advances in Instrumentation and Control; Instrument Society of America: Research Triangle Park, NC, 1990; Vol. 45, pp 749-754. (11) Mah, R. S. H.; Tamhane, A. C.; Tung, S. H.; Patel, A. N. Process Trending with Piecewise Linear Smoothing. Comput. Chem. Eng. 1995, 19, 129-137. (12) Bakshi, B. R.; Stephanopoulos, G. Representation of Process TrendssIII. Multiscale Extraction of Trends from Process Data. Comput. Chem. Eng. 1994, 18, 267-302. (13) Bakshi, B. R.; Stephanopoulos, G. Representation of Process TrendssIV. Induction of Real-Time Patterns from Operating Data for Diagnosis and Supervisory Control. Comput. Chem. Eng. 1994, 18, 303-332. (14) Bakshi, B. R.; Stephanopoulos, G. Compression of Chemical Process Data through Functional Approximation and Feature Extraction. AIChE J. 1996, 42, 477-492. (15) Misra, M.; Qin, S. J.; Kumar, S.; Seemann, D. On-line Data Compression and Error Analysis Using Wavelet Technology. AIChE J. 2000, 46, 119-132. (16) Misra, M.; Kumar, S.; Qin, S. J.; Seemann, D. Error Based Criterion for On-Line Wavelet Data Compression. J. Process Control 2001, 11, 717-731. (17) Singhal, A.; Seborg, D. E. Pattern Matching in Multivariate Time Series Databases Using a Moving Window Approach. Ind. Eng. Chem. Res. 2002, 41, 3822-3838. (18) Singhal, A.; Seborg, D. E. Pattern Matching in Historical Batch Data Using PCA. IEEE Control Syst. Mag. 2002, 22 (5), 53-63. (19) Singhal, A. Pattern Matching in Multivariate Time-Series Data. Ph.D. Dissertation, University of California, Santa Barbara, CA, 2002. (20) Singhal, A.; Seborg, D. E. Data Compression Issues with Pattern Matching in Historical Data. Proceedings of the 2003 American Control Conference, Denver, CO, 2003; pp 3696-3701.

(21) OSI Software, Inc. BatchFile NT, VAX and UNIX Interface to the PI System; OSI Software, Inc.: San Leandro, CA, 2000. (22) Daubechies, I. Ten Lectures on Wavelets; SIAM, Philadelphia, PA, 1992. (23) Mallat, S. G. A Wavelet Tour of Signal Processing; Academic Press: San Diego, 1998. (24) Motard, R. L., Joseph, B., Eds. Wavelet Applications in Chemical Engineering; Kluwer Academic Publishers: Boston, MA, 1994. (25) Rao, R. M.; Bopardikar, A. S. Wavelet Transforms: Introduction to Theory and Applications; Addison-Wesley: Reading, MA, 1998. (26) Watson, M. J.; Liakopoulos, A.; Brzakovic, D.; Georgakis, C. A Practical Assessment of Process Data Compression Techniques. Ind. Eng. Chem. Res. 1998, 37, 267-274. (27) Nason, G. P. Wavelet Shrinkage Using Cross-Validation. J. R. Stat. Soc. 1996, 58, 563. (28) Bakshi, B. R. Multiscale PCA with Application to Multivariate Statistical Process Monitoring. AIChE J. 1998, 44, 15961610. (29) Mathworks, Inc. Matlab Wavelet Toolbox: User’s Guide; The Mathworks, Inc.: Natick, MA, 2001. (30) Nelson, P. R. C.; Taylor, P. A.; MacGregor, J. F. Missing Data Methods in PCA and PLS: Score Calculations with Incomplete Observations. Chemom. Intell. Lab. Syst. 1996, 19, 45-65. (31) Roweis, S. EM Algorithms for PCA and SPCA. Neural Information Processing Systems 11 (NIPS’98); 1997; pp 626632. (32) Johannesmeyer, M. C.; Singhal, A.; Seborg, D. E. Pattern Matching in Historical Data. AIChE J. 2002, 48, 2022-2038. (33) Krzanowski, W. J. Between-Groups Comparison of Principal Components. J. Am. Stat. Assoc. 1979, 74 (367), 703-707. (34) Mardia, K. V.; Kent, J. T.; Bibby, J. M. Multivariate Analysis; Academic Press: London, 1979. (35) Johannesmeyer, M. C. Abnormal Situation Analysis Using Pattern Recognition Techniques and Historical Data. M.Sc. Thesis, University of California, Santa Barbara, CA, 1999. (36) Pettersson, J.; Gutman, P. Automatic Tuning of the Window Size in the Box-Car Back Slope Data Compression Algorithm. J. Process Control 2004, 14, 431-439.

Received for review April 12, 2004 Revised manuscript received February 23, 2005 Accepted February 24, 2005 IE049707A