A Deep Learning Approach for Process Data Visualization Using t

May 16, 2019 - In this work, two steps to improve its generalization capacity are proposed. In the first step, the neural ... Visualization results fo...
3 downloads 0 Views 5MB Size
Article pubs.acs.org/IECR

Cite This: Ind. Eng. Chem. Res. 2019, 58, 9564−9575

A Deep Learning Approach for Process Data Visualization Using t‑Distributed Stochastic Neighbor Embedding Wenbo Zhu,† Zachary T. Webb,† Kaitian Mao,‡ and Jose ́ Romagnoli*,† †

Department of Chemical Engineering, Louisiana State University, Baton Rouge, Louisiana 70803, United States R&D Department, Shanghai SupeZET Engineering Technology Corp., Ltd, Shanghai, 200335, China



Downloaded via NOTTINGHAM TRENT UNIV on August 13, 2019 at 13:47:26 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

S Supporting Information *

ABSTRACT: A generic process visualization method is introduced, which visualizes real-time process information and correlations among variables on a 2D map using parametric t-SNE. As an unsupervised learning method, it learns the mapping by minimizing the Kullback−Leibler divergence between the original high-dimensional space and the latent space using a deep neural network. In practice, it is observed that the original parametric t-SNE method lacks generalization and struggles to visualize unseen operating conditions correctly. In this work, two steps to improve its generalization capacity are proposed. In the first step, the neural network is trained with additional dummy data, which is generated to mimic the possible unseen conditions. Additionally, the structure of the neural network is reformulated with a new activation function that was designed to improve generalization for process data. The capability of the proposed approach was tested on two case studies, the Tennessee Eastman Process (TEP) and an industrial pyrolysis reactor. The results indicate that the proposed approach outperforms conventional methods in visualization as well as generalization capacity for unseen process conditions.



INTRODUCTION In current chemical industries, thousands of measurements are transmitted each second from sensors across the plant to operating personnel via the distributed control systems (DCS). With the assistance of the DCS, sensor measurements can be visualized, and the status of plant operations can be monitored. In the current DCS implementation, the measurement visualization strategy is based on individual variables, such that important measurements are either expressed in a line chart corresponding to time or labeled on the process diagram. This implementation is straightforward and clear, since it is able to provide a historical profile of each variable, while also presenting an overview of plant operations. On the other hand, this implementation also has an obvious drawback: it is difficult to directly observe correlations among multiple variables. The significance of multivariate correlation is illustrated in Figure 1. Two artificial signals are generated to represent two process variables, which are plotted in the similar style of the DCS. Through the line charts of each variable in Figure 1a, it could be intuitively inferred that the system is operating in a single steady state with some noise. However, when plotting them in Figure 1b to analyze their correlation, two separate clusters can be observed. These clusters represent that the given system is moving between two different operation conditions instead of in a single steady state. The demonstration reveals a weakness of the current data visualization strategy by showing its failure to provide indepth information about the correlation of sensor measurements. Moreover, even the method in Figure 1b is only valid © 2019 American Chemical Society

for up to two variables. Visualization of correlations among multiple variables supplements the individual measurements and provides insight into the process operation. In past decades, many efforts have been made to visualize complicated process information from high-dimensional measurements. Principal component analysis (PCA) is one of the most famous dimensionality reduction (DR) methods. After its invention in 1901,1 PCA was applied in many areas to analyze high-dimensional data sets. Despite the broad applications of PCA, it still suffers the common weakness of being a linear model. The self-organizing map (SOM)2,3 is another approach that is popular for process data visualization. Trained with known labeled data, the SOM is able to visualize recognized operation conditions by the u-matrix. Because the SOM is built inside a rigid box, classifying incoming samples correctly can be difficult, as they must be assigned to an area inside the box. Unseen conditions and even unknown faults are often not obvious to the viewer upon inspection. By comparison, the t-distributed stochastic neighbor embedding (t-SNE)4 is a more effective tool to visualize data. It reduces the dimensionality of a data set by minimizing the Kullback− Leibler divergences between high-dimensional space with the latent space. Although the t-SNE method is an excellent technique for data visualization in a low dimensional space, it Received: Revised: Accepted: Published: 9564

February 19, 2019 April 21, 2019 May 16, 2019 May 16, 2019 DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research

Figure 1. A demonstration of the data visualization problem in current data visualization system. The line chart for visualizing each variable separately could lose multivariate correlation information.

parametric t-SNE was then applied on the Tennessee Eastman Process and a pyrolysis process to verify its effectiveness. The rest of the paper is organized as follows. First, we provide a review of the original idea of t-SNE and this parametric form. The proposed improvement is discussed after that. Then, the case studies of the work, the Tennessee Eastman Process (TEP), and a pyrolysis reactor are introduced. In the end, the results of the proposed visualization method are summarized.

does require a huge computational cost. Besides, t-SNE is neither parametric nor scalable, which causes the mapping of new incoming data points to require running the optimization for the entire data set again. To combat this problem, a deep network is chosen to learn the parametric projection from the original space to the latent space, so that new data vectors can be mapped more effectively, which is named parametric tSNE.5 This approach has been applied in fault diagnosis for industrial processes.6 Nevertheless, in our practice to apply parametric t-SNE5,6 in process visualization, we noticed that it was unable to achieve an expected performance, particularly regarding its generalization capacity. Although it can learn the probability distribution from the training data and provide a decent 2D projection of the given data set, it is unable to correctly visualize unseen operation conditions or faults, which refer to the data samples or conditions that are not given to the model during training. This scenario commonly exists in chemical industries. For example, many processes experience catalyst deactivation, coking, and equipment aging, which cause normal drifting away from the previously defined normal operation regions.7 The appropriate visualization method should be able to distinguish different operation conditions clearly. Therefore, we propose two strategies to improve the generalization capacity of the original parametric t-SNE. The first strategy is to expand the training set. If all operation modes and all possible faults could be covered in the training set, then the model is able to map any upcoming data samples over the known regions regardless of the poor generalization capacity of the model itself. Although it is infeasible in practice to obtain all necessary data, following such a strategy, we can adopt the idea to create dummy data which mimics the possible variations of each variable. On the other hand, the trade-off of this approach is a higher computational cost due to the computational complexity of the probability matrix calculation. Training with a large amount of dummy data could also potentially change the probability distribution learned from the original data set. Hence, to apply parametric t-SNE in a large data set more effectively, we propose a second strategy to improve the internal structure of the neural network, in order to achieve a better generalization capacity. In this approach, the internal structure is reformulated, incorporating a newly designed activation function and recently proposed batch normalization method.8 The proposed



RELATED METHODS In this section, the background and related methods are introduced. We start with the original t-SNE method, given that parametric t-SNE is just an alternative form of the original method. Next, the parametric form of t-SNE is introduced, which utilizes a deep neural network to learn the mapping. t-Distributed Stochastic Neighbor Embedding (tSNE). t-SNE is a nonlinear dimensionality reduction technique developed from stochastic neighbor embedding (SNE).9 It uses a student t-distribution with a heavy-tailed probability distribution to solve the crowding problem found in the original SNE method.4 Denote the probability distribution in the original space as pij: pij =

exp( −||xi − xj||2 /2σi2) ∑k ≠ l exp( −||xk − xl||2 /2σi2)

(1)

and the distribution in the latent space as qij: qij =

(1 + ||yi − yj ||2 )−1 ∑k ≠ l (1 + ||yk − yl ||2 )−1

(2)

The cost function that minimizes the Kullback−Leibler divergences between high-dimensional space and the latent space is given as pij C = KL(P||Q ) = ∑ ∑ pij log qij i j (3) Using the gradient descent method to optimize the cost function, the distribution in the original space can be expressed on the low-dimensional map. The gradient is given by δC = 4 ∑ (pij − qij)(yi − yj )(1 + ||yi − yj ||2 )−1 δyi j 9565

(4)

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research Parametric t-SNE. Parametric t-SNE is proposed in order to avoid the optimization burden of the t-SNE method when applied to the same data set more than once. Taking advantage of the learning ability of deep networks, a feed-forward neural network is used to learn the parametric mapping from the high-dimensional space to the latent space. There are two stages in the training procedure proposed by Van Der Maaten, namely pretraining with the restricted Boltzmann machine (RBM)10,11 and fine-tuning using the cost function from tSNE.5 RBM. Training a deep network with multiple hidden layers is challenging, because a typical deep network can contain millions of parameters. Activation functions used in nonlinear transformations (e.g., the sigmoid function) can cause gradient vanishing and exploding issues for networks with many layers. This makes training extremely difficult for the parameters in earlier layers. Until mid-2000s, these networks were nearly impossible to train effectively, but the development of RBM and its application of layer-wise greedy training on deep networks10,12 revolutionized the parameter initialization of deep networks. The RBM is a generative artificial neural network that learns the probability distribution on a given data set using energy function models: p(x ) =

e−E(x) Z

∂ log P(v) = ∂θ

∑ P(h|v) ∂[−E(v , h)] ∂θ

h



∑ ∑ P(v ̃, h) ∂[−E(v ̃, h)] ṽ

h

∂θ

(9)

Once the RBM for the first layer of the network is trained, its output serves as the input for the next layer’s RBM. This is repeated for each layer in the network, until they are all trained. This configuration, where the RBMs are stacked together in a chain, serves as the network structure, with the RBM training providing better approximations of the desired initialization of the parameters than a random one. Fine Tuning. After pretraining, the model parameters are initialized and the network is fine-tuned by the t-SNE cost function (eq 3) using gradient-based backpropagation. In the feed-forward network, the term of qij is modified as follows: qij =

(1 + ||f (xi|W ) − f (xj|W )||2 /α)−(α + 1)/2 ∑k ≠ l (1 + ||f (xk|W ) − f (xl|W )||2 /α)−(α + 1)/2

(10)

The training procedure is summarized in Figure 3. By finetuning using the Kullback−Leibler divergence as the cost function, the local structure of the data is preserved in the latent space.

(5)

where Z is called a partition function: Z = ∑x e−E(x) The standard type of the RBM has one hidden layer and one visible layer. The structure of the RBM is illustrated in Figure 2.

Figure 2. Illustration of the RBM.

Figure 3. Illustration of the fine-tuning procedure.



The probability of a data vector in the hidden layer can be written as P(x) =

∑ P(x , h) = ∑ h

h

e−E(x , h) Z

PROPOSED METHOD Before going through the details of proposed methodology, we would like to present the problem of the original parametric tSNE first. A visualization example is given in Figure 4. Following the training procedure of the original parametric tSNE, the normal operation and a step change fault (Fault 2) from the TEP13 are visualized by a neural network with one hidden layer. Two scenarios are demonstrated in Figure 4. Figure 4a presents the scenario of known pattern visualization for which both normal and faulty data are given in the training, while Figure 4b presents the scenario of unknown pattern visualization for which only normal data are given. For known patterns, the parametric t-SNE creates a useful and aesthetic map, but when faced with unseen patterns, the visualization fails to distinguish the fault and crowds it with the normal region. Similar results can be observed on other TEP faults in the Supporting Information.

(6)

In the RBM, the binary energy function is defined as E(v , h) = −b′v − c′h − h′Wv

(7)

where b and c are the bias of the visible and hidden layer to be learned through training. In practice, the RBM can be extended to Gaussian distribution for real data: E(v , h) =

∑ i

(vi − ai)2 2σi2

− c′h − h′Wv (8)

To train the RBM, the objective is to maximize P(v), the loglikelihood function at single data point v, calculated as 9566

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research

Figure 4. Demonstration of two scenarios of parametric t-SNE visualization. A 31−64−2 neural network was chosen to visualize the TEP data set that 31 input variables are mapped onto a 2D plane. The result was obtained by training with different data sets: (a) Training the network with data from both normal region and step fault. (b) Training the network with only data from the normal region.

unselected variables in another duplicated group. It should be noted that the variation added to each target variable could be either positive or negative. The above procedure is repeated n times to collect enough dummy data for training. The choice of n is usually not critical. Typically, the value of n depends on the number of the variables in the system, where at least one variation should be created for each variable. Although more variations provide a larger exploration space, longer computational time and higher memory occupancy should also be taken into consideration. In addition, a trial-and-error method is suggested to fine-tune the value of n. Network Reformulation. Besides training the parametric t-SNE with dummy data, a second approach is proposed to enhance the generalization capacity of parametric t-SNE. As described in previous sections, the parametric t-SNE is based on a feed-forward neural network, where each layer can be expressed as

Therefore, to improve the robustness of the parametric tSNE method, two strategies are introduced together into the original method to achieve a better generalization capacity. Combinatorial Variation Creation. From the visualization comparison in Figure 4, it is observed that the original parametric t-SNE has a poor generalization performance when facing unseen conditions. In deep learning literature, data augmentation is a common method to improve the robustness and generalization of deep learning models. In the U-Net approach,14 data augmentation successfully helped the DL model to accomplish the biomedical image segmentation task with an extremely small data set (only seven given images). Besides, Vincent’s denoising autocoder approach15 shows that it is a practical way to improve the network robustness with additional variation introduced into the training data. Inspired by these methods, we propose an approach that creates dummy training data, in order to mimic possible unseen operation conditions. The procedure for dummy data creation is illustrated in Figure 5. From an original training set,

z = x·w + b

(11)

y = f (z )

(12)

where x is the input to the layer, w is the weight matrix, b is the bias, y is the layer output, and f(z) is the sigmoid activation function. When a step change is introduced to the model of Figure 4b, the input disturbance seems to be suppressed in either eq 11 or eq 12. In other words, the problem could be caused either in the linear transformation or the setting of the activation function. Hence, the hidden layer responses of both equations are inspected. In Figure 6a, it can be clearly noticed that the step disturbance from Fault 2 causes a very strong response through the linear transformation in eq 11, while the subsequent sigmoid activation function regularizes the response, causing the mutation of the disturbance in Figure 6b. Therefore, to improve the generalization of the parametric tSNE for process visualization, it requires the redesigning of the activation function. On the basis of our knowledge of process data and deep learning, we define three properties of an ideal activation function to visualize process dynamics properly. a. Unbounded Function. As the sample shown in Figure 6, the sigmoid function in the neural network regulates the disturbance from the step changes, in which positive signals are regulated up to 1 and negative signals are regulated to 0. Thus, the step changes are suppressed in the final output. To avoid this problem, the activation function should be unbounded for both positive and negative signals, by which different kinds of process changes can be captured. Such an approach is quite different from many deep learning models that intentionally

Figure 5. Illustration of the proposed combinatorial variation creation method.

a number of data is sampled first (normally 1% to 10% of the total data depending on the data size), which is then duplicated into two groups. For one group, a subset of variables are selected with a random dimension d. A variation with σ times the standard deviation is introduced into each selected variable. The same method is applied on the rest of the 9567

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research

Figure 6. Output responses: (a) eq 11 and (b) eq 12. The responses are generated by the model trained with only normal data. The y-axis corresponds to the data samples in which the first 480 data samples are from normal operation and the following 480 samples are from Fault 2. The x-axis represents the responses to 64 hidden neurons.

Figure 7. Tennessee Eastman Process scheme.

regulate the negative signals, for example, the Relu and Sigmoid functions. b. Nonlinearity. The activation function should be nonlinear. Since many of the correlations in chemical industries are nonlinear, the nonlinear activation can offer a better visualization capacity to the neural network. On the other hand, nonlinear activation is the prerequisite of stacking multiple neural layers which enable the deep networks to learn complicated features. c. Stable Gradient. In training the neural network, the tuning of parameters is based on the back-propagation algorithm using the error gradient. Thus, the gradient provided by the activation function should be stable to avoid gradient vanishing or exploding problems, by which one-stage training (without pretraining) can be possible. The first two points are designed for a better visualization effect in the feed-forward direction, while the third point can assist the training of the neural network in the feed-back direction. To satisfy the above characteristics, a novel activation function is defined as follows: g (x) = x + tanh(λx)

It is simply a combination of a linear and a nonlinear function, in which λ is a positive parameter to be learned in the training stage to adjust the nonlinearity corresponding to the specific process. The derivative of the function, which is between 1 and 1 + λ, is also stable. In addition, to further enhance the generalization of the visualization network, batch normalization8 is also introduced in the network architecture. Batch normalization is widely used in the deep learning community in order to normalize the input layer by adjusting and scaling the activations. It has been reported that batch normalization can accelerate the training of the deep network8 making the optimization more stable and smooth.16 A recent study16 also indicates that the adoption of batch normalization also helps the optimization land to more flat minima, which is reported to achieve a better generalizing performance. Adoption of batch normalization is a promising development which helps the convergence of the neural network to minimize the KL divergence. The batch normalization can be expressed as follows: BN (x) =

(13) 9568

γ(x − μ) +β σ

(14) DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research

Figure 8. Demonstration of the pyrolysis reactor.

Figure 9. Pyrolysis process data from two cycles. The training cycle is marked in gray, and after the training cycle is the testing cycle.

where μ is the training batch mean, σ is the training batch variance, γ is the scale, and β is the offset. The building layers are reformulated as follows: z = x·w + b zbn =



γ(z − μ) +β σ

y = g (zbn)

application is an industrial pyrolysis reactor, which contains multiple steady-state operation modes and process drifting due to coking. Tennessee Eastman Process. TEP is a realistic simulation originally developed by Downs and Vogel.13 It has been widely used as a benchmark in the evaluation of process monitoring and control studies. Liquid products (G and H) with an unwanted byproduct (F) are generated by four gaseous reactants (A, C, D, and E) with an inert (B). Four competing reactions follow the Arrhenius laws. A total of 53 measurements (41 measured variables and 12 manipulated variables) come from five unit operations: a reactor, condenser, separator, compressor, and a stripper. The process simulation contains 20 process faults. The simulation data was generated

(15)

(16) (17)

CASE STUDY Two case studies are discussed in this work. The first is the Tennessee Eastman Process (TEP), which is widely used as a benchmark data set for validating different algorithms in process control and monitoring problems. The second 9569

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research

Figure 10. Visualization of two operation cycles using (a) t-SNE and (b) PCA, for which the training cycle is shown in warm colors and the testing cycle is shown in cool colors. The color map indicates their time-series sequence.

by the code from Ricker,17 by which all process variables are available. Figure 7 illustrates the process schematic. In this work, data from 24 h of normal operation are used to train the model. The model is then tested on 48 h of operation which contains 24 h of different kinds of faults. In the data set, 22 process measurements (XMEAS(1−22)) and 9 nonconstant manipulated variables are included. The purpose of testing on the TEP data is to validate the effectiveness of the proposed method that is able to provide good visualization for different types of faults. Also, as a benchmark, the proposed method can be cross-validated and compared with other methods to evaluate its performance. Pyrolysis Reactor. The pyrolysis reactor, as an important unit in the chemical industry is used to crack heavier hydrocarbons to lower molecular weight hydrocarbons. The whole reactor is placed in a fired furnace, where the energy required for the reaction is generated from fuel gas burning. In this study, the data come from an industrial scale furnace reactor, in which naphtha is cracked to ethylene. As shown in Figure 8, six coils of naphtha are mixed with steam before cracking in the furnace. The ethylene product is generated by cracking the mixture through the furnace at a high temperature (around 1000 °C). In this case study, data from one operation cycle is used to train the model, which covers roughly 90 days of process operation, and another operation cycle is for model validation. A total of 64 variables are selected, which consists of hydrocarbon flows, diluted steam flows, crossover temperature, cracking temperature, and pressure measurements for each of the six coils. Besides, the total hydrocarbon flow, overall outlet temperature, fuel gas flow rate, fuel gas molecular weight, and the furnace temperature are also included. The process data of the two operation cycles are plotted in Figure 9. From the two operation cycles, several differences can be noticed. In general, the operation conditions in each operation cycle (especially the furnace temperature) are drifting due to normal coking and deformation in the cracking tubes. After each operation cycle, coke would be cleared, while it is difficult to recover the tube deformation, which could cause the shifting between different operation cycles. Therefore, when the data from two cycles are compared, their operation condition could be slightly different, in order to keep a constant temperature for cracking heavy hydrocarbons. In the given data set, the values of hydrocarbon flows, steam flows, crossover temperature, and even the furnace temperature are not exactly the same, while these necessary drifting and fluctuations keep the cracking temperature

constant. Besides, a fault can be noticed at the beginning of the validation cycle, where all hydrocarbon flows are shut off. After training on the given cycle, a desired visualization model should be able to visualize process shifting between different operation cycles, as well as significant abnormal conditions in the unseen cycle. Dimensionality reduction methods can provide a better understanding between different operating cycles, particularly on the overall distribution of the data sets and the local events from the high dimensional spaces. The two operation cycles are visualized together in a 2D map (see Figure 10) using both t-SNE and PCA. From the comparison of the two methods, the nonlinear DR technique, t-SNE, provides a better overview of the given data sets, in which different operation cycles are well separated and local events in both cycles are also identified into small clusters on the map. The aforementioned fault at the beginning of the second operation cycle can also be clearly grouped apart from the normal conditions. On the other hand, the PCA model successfully isolates the faults away from the normal conditions, but it fails to provide more details inside each operation cycle, or separate different operation cycles. Although the simple linear model is able to proportionally represent the variance and distribution of the data set, the drawback of the linear model, that complicated nonlinear correlations cannot be expressed well in the linear configuration, is also obvious. Hence, approximation of the t-SNE cost function by a neural network for process visualization seems promising in practice.



RESULT AND DISCUSSION The visualization performances are validated on both the TEP data and the pyrolysis process. For both of the processes, the DL framework is implemented in tensorf low18 in a Python 3.6 environment using a graphics processing unit (GPU). The network architectures are selected as m-64−32−2, where m is the input size of the neural network. For the TEP, m is set as 31 and for the pyrolysis process, m equals 64. The learning rate is set as 0.001 for all cases. Training takes roughly 400 epochs to converge, where Adam19 is used as the optimizer. The architecture and parameter setting are determined by trial and error. Multiple training is encouraged to help the model converge into a better minima. Regarding the parameters setting for variation addition, 50 groups of variations are created for the TEP, while 60 groups are used for the pyrolysis process. The variation level σ is determined based on the 9570

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research

Figure 11. Comparison of different activation functions. Panels a and b summarize different activations in Ramachandran’s work.20 (c) Their error curve on testing data of the pyrolysis reactor.

Figure 12. Visualization result for normal condition (blue) and three TEP faults (red) from (a) the proposed method, (b) PCA, (c) Isomap, (d) LLE, and (e) LPP.

distribution of the data set, where σ is set as 15 for TEP and 10 for the pyrolysis process. In recent research of activation function design, Ramachandran et al.20 proposed an activation functions design method using a combination of an exhaustive and reinforcement-

learning-based search on a number of unary and binary functions. Through their research, several other activations that achieve promising performance in their experiments are also investigated in this work, which are listed in Figure 11a,b. Additionally, a linear multilayer perceptron (MLP) model was 9571

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research

Figure 13. Training results of (a) the proposed method, (b) the original parametric t-SNE, (c) t-SNE as the reference of the data set, (d) PCA, (e) Isomap, (f) LLE and (g) LPP. They are colored by their corresponding time sequence.

Figure 14. Testing results of (a) the proposed method, (b) the original parametric t-SNE, (c) t-SNE as the reference of the data set, (d) PCA, (e) Isomap, (f) LLE, and (g) LPP. They are colored by their corresponding time sequence.

Locally Linear Embedding (LLE),22 and Locality Preserving Projection (LPP).23 Besides, in the Supporting Information, each test set (normal and faulty data) is visualized in t-SNE as a reference of the overall data visualization, as well as a comparison with the original parametric t-SNE. For Faults 3, 5, and 9, all methods are unable to distinguish the faulty data from normal data, because these faults are caused by temperature changes that are not included in the measurements of given data sets. Hence, in these methods, the fault data are highly overlapped with normal data. In the other tests, it can be observed that the proposed visualization method effectively shows the difference for various kinds of faults. Clear separation can be observed in the step faults, while for faults caused by random variations, faulty data are dispersed around the normal region. In comparison with other methods, the performance is comparable in most of the faults. While in Faults 4, 11, and 14 (see Figure 12), the proposed approach outperforms the other methods by providing clearer separations and better representations.

also tested to verify the necessity of a nonlinear activation for the process visualization. They are compared with same configuration on test data of the pyrolysis process in Figure 11c. Through the comparison, the proposed activation function shows the best performance compared with others. The activation function, sinc(x) + x, that is also a combination of a linear unit and a weak nonlinear unit gives a comparable performance to the proposed one, which is consistent with the three presumptions we made for selecting an ideal activations for process visualization problems. The linear MLP also provides a satisfying result on the testing set, while the improvement on the nonlinear activation is also clearly noticed through the comparison. The linear model could potentially face the challenge to represent the intrinsic nonlinear correlation among multiple variables, while the nonlinear functions excel at this. TEP Results. The comprehensive results that include 14 faults are summarized in the Supporting Information. The results consist of the comparison between the proposed method with other popular methods including PCA, Isomap,21 9572

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research Table 1. Averaged JS Divergence Metrics of Different Methods on Both the Training Cycle and the Testing Set training testing

proposed method

para t-SNE

PCA

Isomap

LLE

LPP

t-SNE

0.446 0.517

0.482 0.597

0.544 0.523

0.452 0.565

0.560 0.567

0.508 0.564

0.311 0.241

Figure 15. Prospective application of the proposed approach in process monitoring: (a) time-series visualization, (b) fault detection visualization, (c) multiple operation modes visualization, (d) perspective event management.

Pyrolysis Process Results. To further validate the performance of the proposed approach in a more complicated scenario, the visualization results on the pyrolysis reactor are summarized in this section. The proposed method was trained by one complete operation cycle, and the trained models were then verified on another cycle which shifts from the training data. The same configuration was set for comparison with the original parametric t-SNE, PCA, Isomap, LLE, and LPP. The projection by the t-SNE is given as the reference of the data set. Figure 13 summarizes the results on the training data set. It can be observed that both the proposed method and the original parametric t-SNE method can create clear visuals that catch the local as well as the global characteristics on the

training set, which is comparable to the nonparametric t-SNE in Figure 13c. PCA is able to reveal the global drifting of the process, but it fails to provide more local details for the training data. Among the selected manifold methods, LPP gives a similar projection to PCA, while Isomap provides extra details in the drifting parts (colored in orange and red). LLE seems to give the worst mapping from the high dimensional space, since it only indicates the outliers of the data sets, while any further information about process operation is unable to be inferred from this 2D projection. The generalization ability is then examined on the validation cycle, which is summarized in Figure 14. The proposed method provides a proper visualization of the testing cycle, where the abnormal conditions (dark blue clusters) are 9573

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research

addressed the drawbacks of the original parametric t-SNE in the paper that the generalization capacity of the method is limited. Hence, we proposed two approaches to improve its generalization capacity: by training with dummy data and by reformulating the internal structure of the neural network. The method with the proposed adjustments achieves superior visualization performances on TEP and the pyrolysis process data set.

isolated from the normal conditions, and normal process drifting can also be easily seen from the 2D map. PCA gives a similar projection that satisfies the basic requirement of the process visualization purpose. The 2D representations of LLE and LPP retain the essential global information on the data set, while the quality of the representation is not comparable with results from the proposed method and the PCA. The original parametric t-SNE and Isomap show poor representation of the testing cycle, where visualization of both the global information and the local events including drifting and faults are not shown. Besides the direct visual inspection on the 2D map, the averaged Jensen-Shannon (JS) divergence is applied in this work as the metrics to evaluate their performance. It is often challenging to make a direct comparison among different DR methods, so here we quantify their performance by comparing the probability divergence from the original space with the latent space using the JS divergence. The JS divergence results of different methods on both the training and testing cycles are summarized in Table 1. The JS divergence results are consistent with the 2D representation of the training and testing results. The proposed method shows a promising visualization performance on the given data set, as well as a satisfied generalization capacity on the unseen conditions. In addition, readers also should note one flaw of the proposed method. As a nonlinear visualization method stacking multiple weak nonlinear activation functions, a distortion problem is unavoidable where multivariate correlations in the latent space cannot be represented proportionally. In other words, the distance between different patterns in the 2D space may not match perfectly to the original space. Monitoring Application. As an unsupervised learning method, the proposed visualization method itself does not draw conclusions about the process operation, because it is not the design objective of such a method. After training the visualization model on historical data sets, the proposed method can be prospectively embedded into a monitoring system as the front-end interface along with a fault detection and diagnosis method on the back end. Figure 15 demonstrates such an application in our monitoring system for the pyrolysis reactor, in which the proposed visualization method is deployed as the interface and the adaptive k-nearest neighbor fault detection24 is running in the back. The proposed visualization method is able to provide various process information including process drifting over time (Figure 15a), fault detection decisions (Figure 15b), and visualization of multiple operation modes (Figure 15c). Beyond that, it can be further developed as an interactive event management system that can store faulty events on the map and provide extra information, for example, diagnosis results on the top of the map (Figure 15d).



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.iecr.9b00975. Visualization results for normal conditions and the described faults from this work and other methods (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

José Romagnoli: 0000-0003-3682-1305 Notes

The authors declare no competing financial interest.



REFERENCES

(1) Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 1901, 2, 559−572. (2) Zhong, B.; Wang, J.; Wu, H.; Zhou, J.; Jin, Q. SOM-based visualization monitoring and fault diagnosis for chemical process. 2016 Chinese Control and Decision Conference (CCDC), May 28− 30, 2016, pp 5844−5849. (3) Robertson, G.; Thomas, M.; Romagnoli, J. A. Topological preservation techniques for nonlinear process monitoring. Comput. Chem. Eng. 2015, 76, 1−16. (4) Maaten, L. v. d.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579−2605. (5) van der Maaten, L. Learning a parametric embedding by preserving local structure. RBM 2009, 500, 26. (6) Ruixue, J.; Jing, W.; Jinglin, Z. Fault diagnosis of industrial process based on the optimal parametric t-distributed stochastic neighbor embedding. Sci. Chin. Inform. Sci.. DOI: 10.1007/s11432018-9807-7 (7) Gallagher, N. B.; Wise, B. M.; Butler, S. W.; White, D. D., Jr; Barna, G. G. Development and benchmarking of multivariate statistical process control tools for a semiconductor etch process: improving robustness through model updating. IFAC Proceedings Volumes 1997, 30, 79−84. (8) Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. https:// arxiv.org/abs/1502.03167 2015. (9) Hinton, G. E.; Roweis, S. T. Stochastic neighbor embedding. Advances in neural information processing systems; University of Toronto, 2003; pp 857−864. (10) Hinton, G. A practical guide to training restricted Boltzmann machines. Momentum 2010, 9, 926. (11) Hinton, G. E.; Osindero, S.; Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural computation 2006, 18, 1527− 1554. (12) Hinton, G. E.; Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504−507. (13) Downs, J. J.; Vogel, E. F. A plant-wide industrial process control problem. Comput. Chem. Eng. 1993, 17, 245−255.



CONCLUSION In this work, we introduced a process visualization method using parametric t-SNE to provide high-level process information in the form of a 2D map. Instead of monitoring a line-chart diagram for a single variable in the DCS systems, the proposed method can extract features from multiple process variables and indicate patterns corresponding to different process behaviors. As an unsupervised learning approach, this method does not require prior knowledge such as well-labeled data to provide correct visualization. The t-SNE mapping is approximated by a deep neural network that can learn the probability distribution of the original space. We 9574

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575

Article

Industrial & Engineering Chemistry Research (14) Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention; Springer, 2015; pp 234−241. (15) Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn Res. 2010, 11, 3371−3408. (16) Santurkar, S.; Tsipras, D.; Ilyas, A.; Madry, A. How Does Batch Normalization Help Optimization?(No, It Is Not About Internal Covariate Shift). https://arxiv.org/abs/1805.11604 2018. (17) Ricker, N. L. Decentralized control of the Tennessee Eastman challenge process. J. Process Control 1996, 6, 205−221. (18) Abadi, M. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015; https://www.tensorflow.org/. (19) Kingma, D.; Ba, J. Adam: A method for stochastic optimization. https://arxiv.org/abs/1412.6980 2014. (20) Ramachandran, P.; Zoph, B.; Le, Q. V. Searching for Activation Functions. https://arxiv.org/abs/1710.05941 2017. (21) Tenenbaum, J. B.; De Silva, V.; Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 2000, 290, 2319−2323. (22) Roweis, S. T.; Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323−2326. (23) He, X.; Niyogi, P. Locality preserving projections. Advances in neural information processing systems; Bradford Books, 2004; pp 153− 160. (24) Zhu, W.; Sun, W.; Romagnoli, J. Adaptive k-Nearest-Neighbor Method for Process Monitoring. Ind. Eng. Chem. Res. 2018, 57, 2574− 2586.

9575

DOI: 10.1021/acs.iecr.9b00975 Ind. Eng. Chem. Res. 2019, 58, 9564−9575